Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mwe
  • failed: mathalfa

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.03913v2 [cs.CE] 09 Jan 2024

A Wasserstein graph distance based on distributions of probabilistic node embeddings

Abstract

Distance measures between graphs are important primitives for a variety of learning tasks. In this work, we describe an unsupervised, optimal transport based approach to define a distance between graphs. Our idea is to derive representations of graphs as Gaussian mixture models, fitted to distributions of sampled node embeddings over the same space. The Wasserstein distance between these Gaussian mixture distributions then yields an interpretable and easily computable distance measure, which can further be tailored for the comparison at hand by choosing appropriate embeddings. We propose two embeddings for this framework and show that under certain assumptions about the shape of the resulting Gaussian mixture components, further computational improvements of this Wasserstein distance can be achieved. An empirical validation of our findings on synthetic data and real-world Functional Brain Connectivity networks shows promising performance compared to existing embedding methods.

Index Terms—  Optimal Transport, graph distance, graph similarity, node embedding, functional brain connectivity

1 Introduction

Graphs and networks have become an almost ubiquitous abstraction in domains like biology, medicine or social sciences to represent a large range of complex systems [1]. For instance, protein interactions, brain connections or social dynamics are frequently modelled as networks and studied from this perspective [2, 3, 4]. Due to this increasing abundance of network data, the classical problem of quantifying (dis-)similarities between graphs has seen a surge of research interest recently. Indeed, a graph distance measure to compare the structure of various systems is crucial to enable an exploratory, comparative analysis of (sets of) graphs in many application contexts. However, for most applications it is typically not only important to quantify the difference between two graphs on a global level, but to identify the lower-level, structural differences that contribute to this difference. Accordingly, optimal transport based graph distances, which not only provide a distance measure between two graphs based on a probabilistic matching but also a transport plan that highlights where changes occur, have recently gained significant attention [5, 6, 7].

In general, graph similarity measures may be classified as either supervised or unsupervised. Supervised approaches aim at learning a distance function that effectively distinguishes between differently labeled networks. These include approaches for graph similarity of human brain fMRI data using Graph Neural Networks [8] or Protein-Protein interactions using Genetic Programming [9]. Unsupervised approaches, on the other hand, are concerned with finding distances between networks without having access to labels. They are particularly useful for the exploratory study of cluster differences beyond known classes. Some approaches leverage the powerful but computationally expensive Graph Edit Distance [10, 11]. Other methods first compute a vector representation of the network, which is then used to define a distance metric [12, 13]. Recently, there have been approaches that use Optimal Transport (OT) to define a distance between networks, based on the Gromov-Wasserstein distance [14, 15].Fused Gromov-Wasserstein [16] is an extension for attributed graphs where in addition to the graph structure, node attributes can also influence the distance between two graphs. Closely related to our approach are OT-based methods that leverage Wasserstein distances on graphs [5, 6, 7]. These approaches define the distance between two graphs as the distance between the distributions of the corresponding systems excited by Gaussian noise. In contrast, we propose specific non-gaussian node embeddings that highlight distinct structural aspects of the graph.

Contribution. In this paper, we propose a novel unsupervised approach for computing the distance between two graphs based on Optimal Transport. This provides us not only with an alignment between the two node sets of the graph, but also with a measure of the quality of this alignment (the actual distance between the graphs). Our approach is efficient and thus scalable to large data sets. Further it can even be used to compare graphs of different sizes. We show that, as we increase the number of samples, our approach defines a distance pseudometric on the space of all graphs. Further, we evaluate our approach on a range of synthetic data and apply it to Functional Brain Connectivity networks of mice, where we can recover meaningful patterns in the data.

2 Notation and Preliminaries

Notation. A graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) consists of a node set V𝑉Vitalic_V and an edge set E={uvu,vV}𝐸conditional-set𝑢𝑣𝑢𝑣𝑉E=\{uv\mid u,v\in V\}italic_E = { italic_u italic_v ∣ italic_u , italic_v ∈ italic_V }. Given a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), we identify the node set V𝑉Vitalic_V with {1,,n}1𝑛\{1,\ldots,n\}{ 1 , … , italic_n }. We allow for self-loops vvE𝑣𝑣𝐸vv\in Eitalic_v italic_v ∈ italic_E and positive edge weights w:E+:𝑤𝐸subscriptw:E\rightarrow\mathbb{R}_{+}italic_w : italic_E → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT in our graphs. For a matrix M𝑀Mitalic_M, Mi,jsubscript𝑀𝑖𝑗M_{i,j}italic_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the component in the i𝑖iitalic_i-th row and j𝑗jitalic_j-th column. We use Mi,_subscript𝑀𝑖_M_{i,\_}italic_M start_POSTSUBSCRIPT italic_i , _ end_POSTSUBSCRIPT to denote the i𝑖iitalic_i-th row vector of M𝑀Mitalic_M and M_,jsubscript𝑀_𝑗M_{\_,j}italic_M start_POSTSUBSCRIPT _ , italic_j end_POSTSUBSCRIPT to denote the j𝑗jitalic_j-th column vector. An adjacency matrix of a given graph is a matrix A𝐴Aitalic_A with entries Au,v=0subscript𝐴𝑢𝑣0A_{u,v}=0italic_A start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = 0 if uvE𝑢𝑣𝐸uv\notin Eitalic_u italic_v ∉ italic_E and Au,v=w(uv)subscript𝐴𝑢𝑣𝑤𝑢𝑣A_{u,v}=w(uv)italic_A start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = italic_w ( italic_u italic_v ) otherwise, where we set w(uv)=1𝑤𝑢𝑣1w(uv)=1italic_w ( italic_u italic_v ) = 1 for unweighted graphs for all uvE𝑢𝑣𝐸uv\in Eitalic_u italic_v ∈ italic_E. For two vectors x,y𝑥𝑦x,yitalic_x , italic_y we write xy𝑥𝑦x{\mathbin{\|}}yitalic_x ∥ italic_y for the concatenation and i=0nxisuperscriptsubscript𝑖0𝑛subscript𝑥𝑖{\mathbin{\|}}_{i=0}^{n}x_{i}∥ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the concatenation over a sequence of vectors x0,,xnsubscript𝑥0subscript𝑥𝑛x_{0},\ldots,x_{n}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We denote the two norm of a vector x𝑥xitalic_x by xnorm𝑥\|x\|∥ italic_x ∥.

Optimal Transport. Optimal transport (OT) is a framework for computing distances between probability distributions. In this paper, we leverage the so called Wasserstein distance (W22superscriptsubscript𝑊22W_{2}^{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), also referred to as Earth Movers Distance. For two probability distributions 𝒳,𝒴𝒳𝒴\mathcal{X},\mathcal{Y}caligraphic_X , caligraphic_Y on some metric space 𝒮𝒮\mathcal{S}caligraphic_S, the Wasserstein distance can be computed by solving the following optimization problem:

W22(𝒳,𝒴)=minπΠ(𝒳,𝒴)𝒳×𝒴xy2𝑑π(x,y)superscriptsubscript𝑊22𝒳𝒴subscript𝜋Π𝒳𝒴subscript𝒳𝒴superscriptnorm𝑥𝑦2differential-d𝜋𝑥𝑦\displaystyle W_{2}^{2}(\mathcal{X},\mathcal{Y})=\min_{\pi\in\Pi(\mathcal{X},% \mathcal{Y})}\int_{\mathcal{X}\times\mathcal{Y}}||x-y||^{2}d\pi(x,y)italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X , caligraphic_Y ) = roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π ( caligraphic_X , caligraphic_Y ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_X × caligraphic_Y end_POSTSUBSCRIPT | | italic_x - italic_y | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_π ( italic_x , italic_y ) (1)

where Π(𝒳,𝒴)Π𝒳𝒴\Pi(\mathcal{X},\mathcal{Y})roman_Π ( caligraphic_X , caligraphic_Y ) is the set of all admissible couplings π𝜋\piitalic_π on 𝒮×𝒮𝒮𝒮\mathcal{S}\times\mathcal{S}caligraphic_S × caligraphic_S whose marginals are 𝒳𝒳\mathcal{X}caligraphic_X and 𝒴𝒴\mathcal{Y}caligraphic_Y with π(x,y)𝜋𝑥𝑦\pi(x,y)italic_π ( italic_x , italic_y ) being the mass moved from x𝑥xitalic_x to y𝑦yitalic_y.

When both distributions considered are multivariate Normal distributions, i.e., 𝒳=𝒩(μ1,Σ1),𝒴=𝒩(μ2,Σ2)formulae-sequence𝒳𝒩subscript𝜇1subscriptΣ1𝒴𝒩subscript𝜇2subscriptΣ2\mathcal{X}=\mathcal{N}(\mu_{1},\Sigma_{1}),\mathcal{Y}=\mathcal{N}(\mu_{2},% \Sigma_{2})caligraphic_X = caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , caligraphic_Y = caligraphic_N ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), with mean vectors μ1,μ2subscript𝜇1subscript𝜇2\mu_{1},\mu_{2}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and covariance matrices Σ1,Σ2subscriptΣ1subscriptΣ2\Sigma_{1},\Sigma_{2}roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively, the Wasserstein distance has a closed form expression given by

W22(𝒳,𝒴)=μ1μ2+tr(Σ1)+tr(Σ2)2tr((Σ112Σ2Σ112)12)superscriptsubscript𝑊22𝒳𝒴normsubscript𝜇1subscript𝜇2trsubscriptΣ1trsubscriptΣ22trsuperscriptsuperscriptsubscriptΣ112subscriptΣ2superscriptsubscriptΣ11212W_{2}^{2}(\mathcal{X},\mathcal{Y})=||\mu_{1}-\mu_{2}||+\operatorname{tr}(% \Sigma_{1})+\operatorname{tr}(\Sigma_{2})-2\operatorname{tr}((\Sigma_{1}^{% \frac{1}{2}}\Sigma_{2}\Sigma_{1}^{\frac{1}{2}})^{\frac{1}{2}})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X , caligraphic_Y ) = | | italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | + roman_tr ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + roman_tr ( roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - 2 roman_tr ( ( roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )

Reference [17] appropriately generalizes this to Gaussian Mixtures ,^^\mathcal{M},\hat{\mathcal{M}}caligraphic_M , over^ start_ARG caligraphic_M end_ARG, which are central to our approach. Here we consider uniformly weighted GMMs =1n(𝒩1++𝒩n)1𝑛subscript𝒩1subscript𝒩𝑛\mathcal{M}=\frac{1}{n}(\mathcal{N}_{1}+...+\mathcal{N}_{n})caligraphic_M = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + caligraphic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) with Gaussian distributions 𝒩i=𝒩(μi,Σi)subscript𝒩𝑖𝒩subscript𝜇𝑖subscriptΣ𝑖\mathcal{N}_{i}=\mathcal{N}(\mu_{i},\Sigma_{i})caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), called Gaussian components, having equal weight 1n1𝑛\frac{1}{n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG. In this case the optimal transport distance can be computed by considering the optimization problem

MW22(,^)=minπΠ(,^)i,jW22(𝒩i,𝒩^j)πi,j𝑀superscriptsubscript𝑊22^subscript𝜋Π^subscript𝑖𝑗superscriptsubscript𝑊22subscript𝒩𝑖subscript^𝒩𝑗subscript𝜋𝑖𝑗MW_{2}^{2}(\mathcal{M},\hat{\mathcal{M}})=\min_{\pi\in\Pi(\mathcal{M},\hat{% \mathcal{M}})}\sum_{i,j}W_{2}^{2}(\mathcal{N}_{i},\hat{\mathcal{N}}_{j})\pi_{i% ,j}italic_M italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M , over^ start_ARG caligraphic_M end_ARG ) = roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π ( caligraphic_M , over^ start_ARG caligraphic_M end_ARG ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (2)

where 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒩^jsubscript^𝒩𝑗\hat{\mathcal{N}}_{j}over^ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the i𝑖iitalic_i-th resp. j𝑗jitalic_j-th component of the Gaussian mixture distributions ,^^\mathcal{M},\hat{\mathcal{M}}caligraphic_M , over^ start_ARG caligraphic_M end_ARG and πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT the mass moved from the Gaussian component 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the Gaussian component 𝒩^jsubscript^𝒩𝑗\hat{\mathcal{N}}_{j}over^ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

The OT framework can also be used to compare metric spaces (or distributions of points defined in different spaces) by means of the Gromov-Wasserstein distance (GW22𝐺superscriptsubscript𝑊22GW_{2}^{2}italic_G italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) [14, 15]. For two probability distributions 𝒳,𝒴𝒳𝒴\mathcal{X},\mathcal{Y}caligraphic_X , caligraphic_Y supported in different spaces the associated optimization problem the becomes [18]:

GW22(𝒳,𝒴)=minπΠ(𝒳,𝒴)W22(xi,xk)W22(yj,yl)2πi,jπk,l𝐺superscriptsubscript𝑊22𝒳𝒴subscript𝜋Π𝒳𝒴superscriptnormsuperscriptsubscript𝑊22subscript𝑥𝑖subscript𝑥𝑘superscriptsubscript𝑊22subscript𝑦𝑗subscript𝑦𝑙2subscript𝜋𝑖𝑗subscript𝜋𝑘𝑙GW_{2}^{2}(\mathcal{X},\mathcal{Y})=\min_{\pi\in\Pi(\mathcal{X},\mathcal{Y})}% \sum||W_{2}^{2}(x_{i},x_{k})-W_{2}^{2}(y_{j},y_{l})||^{2}\pi_{i,j}\pi_{k,l}italic_G italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_X , caligraphic_Y ) = roman_min start_POSTSUBSCRIPT italic_π ∈ roman_Π ( caligraphic_X , caligraphic_Y ) end_POSTSUBSCRIPT ∑ | | italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT

where xi,xk𝒳similar-tosubscript𝑥𝑖subscript𝑥𝑘𝒳x_{i},x_{k}\sim\mathcal{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_X and yj,yl𝒴similar-tosubscript𝑦𝑗subscript𝑦𝑙𝒴y_{j},y_{l}\sim\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∼ caligraphic_Y.

Intuitively, the Gromov Wasserstein formulation seeks to map points onto each other such that the overall distances between all pairs of points are as much as possible preserved. Hence, in contrast to Equation 2 the Wasserstein distances W22superscriptsubscript𝑊22W_{2}^{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are only computed between elements of the respective distributions 𝒳𝒳\mathcal{X}caligraphic_X and 𝒴𝒴\mathcal{Y}caligraphic_Y. Consequently the Gromow Wasserstein distance GW22𝐺superscriptsubscript𝑊22GW_{2}^{2}italic_G italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be computed for distributions supported on different spaces. While this additional flexibility can be advantageous for applications, in this paper we argue that it can be worthwhile to compute vectorial graph representations that are supported in the same space. This enables us to leverage the Wasserstein Distance as a graph distance measure. Our experiments show that our proposed embedding methods CCB and CNP are suited to produce such graph representations.

3 Proposed Method

In this section, we establish our approach for computing the distance between two graphs using OT. The high-level approach is as follows: We compute multiple randomly initialised i.i.d node embedding for each node. Subsequently fitting a Gaussian to the sampled embeddings of each node represents the graph as a Gaussian Mixture. By computing the optimal transport plan between the Gaussian Mixtures of two graphs we obtain a node allignment with the corresponding cost. In the following, we present two node embeddings that can be used in the above framework and that highlight different properties of the network.

Node Embedding. Our approach hinges on the fact that the embedding we create for each node is dependent on some random variable. If this is not the case, then the (co-)variance is jointly 00 for all nodes which reduces the Wasserstein distance to the square euclidean distance between the means. We propose two node embeddings that fulfill this requirement: CCB and CNP. The proposed Colored Cooper Barahona embedding (CCB) is an extension of the Cooper Barahona embedding [19]. The original embedding embeds a node as the concatenation of the rows in the matrix power Aδ𝟙superscript𝐴𝛿1A^{\delta}\mathbb{1}italic_A start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT blackboard_1 of the adjacency matrix A𝐴Aitalic_A. This captures not only the degree of a node but also the connections of length up to δ<d𝛿𝑑\delta<ditalic_δ < italic_d. We adapt the embedding by using colors which we use to combine the nodes into groups. We thus receive a more expressive yet still low-dimensional embedding.

The CCB embedding works as follows: For a number of colors k𝑘kitalic_k and n𝑛nitalic_n nodes, we sample k1𝑘1k-1italic_k - 1 cuts (c2,,ck)subscript𝑐2subscript𝑐𝑘(c_{2},...,c_{k})( italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) uniformly at random from {1,,n1}1𝑛1\{1,...,n-1\}{ 1 , … , italic_n - 1 } without replacement, sort them such that ci<cjsubscript𝑐𝑖subscript𝑐𝑗c_{i}<c_{j}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for i<j𝑖𝑗i<jitalic_i < italic_j, and define c1=0,ck+1=nformulae-sequencesubscript𝑐10subscript𝑐𝑘1𝑛c_{1}=0,c_{k+1}=nitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 , italic_c start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_n. We then construct a block matrix H{0,1}k×n𝐻superscript01𝑘𝑛H\in\{0,1\}^{k\times n}italic_H ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_k × italic_n end_POSTSUPERSCRIPT where Hi,j=1subscript𝐻𝑖𝑗1H_{i,j}=1italic_H start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 if cjicj+1subscript𝑐𝑗𝑖subscript𝑐𝑗1c_{j}\leq i\leq c_{j+1}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_i ≤ italic_c start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT and Hi,j=0subscript𝐻𝑖𝑗0H_{i,j}=0italic_H start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 otherwise for 1jk1𝑗𝑘1\leq j\leq k1 ≤ italic_j ≤ italic_k. We then simply compute the embedding as:

φ¯CCB(v,H,d)=missingi=0d1AiAv,_iH.\bar{\varphi}_{\text{CCB}}(v,H,d)={\mathbin{\bigg{missing}}\|}_{i=0}^{d}\frac{% 1}{\|A\|^{i}}A^{i}_{v,\_}H.over¯ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT CCB end_POSTSUBSCRIPT ( italic_v , italic_H , italic_d ) = roman_missing ∥ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ∥ italic_A ∥ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , _ end_POSTSUBSCRIPT italic_H .

The CCB embedding thus embeds the nodes with an embedding of size kd𝑘𝑑k\cdot ditalic_k ⋅ italic_d. However, the ordering of the nodes is paramount for the expressivity of the embedding. A new ordering of the nodes could lead to vastly different embeddings.

On the contrary, the Colored Neighborhood Propagation (CNP) embedding is one that is invariant under reordering of the nodes. It is an extension of the SNP embedding used in [20]. Again, we adapt the original using random colors, the sampling procedure is, however, different: We first randomly assign one of k𝑘kitalic_k colors to each node using an indicator matrix H𝐻Hitalic_H, where Hv,c=1c(v)=ciffsubscript𝐻𝑣𝑐1𝑐𝑣𝑐H_{v,c}=1\iff c(v)=citalic_H start_POSTSUBSCRIPT italic_v , italic_c end_POSTSUBSCRIPT = 1 ⇔ italic_c ( italic_v ) = italic_c and Hv,c=0subscript𝐻𝑣𝑐0H_{v,c}=0italic_H start_POSTSUBSCRIPT italic_v , italic_c end_POSTSUBSCRIPT = 0 otherwise. As opposed to the CCB embedding, this matrix has no block structure. For each distance 0δd0𝛿𝑑0\leq\delta\leq d0 ≤ italic_δ ≤ italic_d and for each node v𝑣vitalic_v, we count the number of nodes reachable in δ𝛿\deltaitalic_δ steps, which have a certain color, and store them in a matrix of size (d+1)k𝑑1𝑘(d+1)\cdot k( italic_d + 1 ) ⋅ italic_k. We then sort the colums of this matrix lexicographically. This leads to the following definition of the CNP embedding:

φ¯CNP(v,H,d)=missingi=1d+1Mi,_ with M=lex-sort([1AAv,_0H1AdAv,_dH])\bar{\varphi}_{\text{CNP}}(v,H,d)={\mathbin{\bigg{missing}}\|}_{i=1}^{d+1}M_{i% ,\_}\text{ with }M=\text{lex-sort}\left(\begin{bmatrix}\frac{1}{\|A\|}A^{0}_{v% ,\_}H\\ \vdots\\ \frac{1}{\|A\|^{d}}A^{d}_{v,\_}H\end{bmatrix}\right)over¯ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT CNP end_POSTSUBSCRIPT ( italic_v , italic_H , italic_d ) = roman_missing ∥ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i , _ end_POSTSUBSCRIPT with italic_M = lex-sort ( [ start_ARG start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ∥ italic_A ∥ end_ARG italic_A start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , _ end_POSTSUBSCRIPT italic_H end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ∥ italic_A ∥ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG italic_A start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , _ end_POSTSUBSCRIPT italic_H end_CELL end_ROW end_ARG ] )

We also normalize each embedding φ(v,H,d)=1φ¯(v,H,d)φ¯(v,H,d)𝜑𝑣𝐻𝑑1norm¯𝜑𝑣𝐻𝑑¯𝜑𝑣𝐻𝑑\varphi(v,H,d)=\frac{1}{||\bar{\varphi}(v,H,d)||}\bar{\varphi}(v,H,d)italic_φ ( italic_v , italic_H , italic_d ) = divide start_ARG 1 end_ARG start_ARG | | over¯ start_ARG italic_φ end_ARG ( italic_v , italic_H , italic_d ) | | end_ARG over¯ start_ARG italic_φ end_ARG ( italic_v , italic_H , italic_d ).

Due to the sorting, this embedding is invariant under isomorphism, that is, the probability of sampling a certain embedding is independent of the node ordering of the graph.

1:Input: G1=(V1,E1),G2=(V2,E2)formulae-sequencesubscript𝐺1subscript𝑉1subscript𝐸1subscript𝐺2subscript𝑉2subscript𝐸2G_{1}=(V_{1},E_{1}),G_{2}=(V_{2},E_{2})italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
2:for vV1V2𝑣subscript𝑉1subscript𝑉2v\in V_{1}\cup V_{2}italic_v ∈ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do
3:    for is𝑖𝑠i\leq sitalic_i ≤ italic_s do
4:         sample assignment of node to colors H(i)superscript𝐻𝑖H^{(i)}italic_H start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT
5:         φ(i)(v)=φX(v,H(i),d)superscript𝜑𝑖𝑣subscript𝜑X𝑣superscript𝐻𝑖𝑑\varphi^{(i)}(v)=\varphi_{\text{X}}(v,H^{(i)},d)italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ) = italic_φ start_POSTSUBSCRIPT X end_POSTSUBSCRIPT ( italic_v , italic_H start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_d )     
6:    Fit Gaussian 𝒩(μv,Σv)𝒩subscript𝜇𝑣subscriptΣ𝑣\mathcal{N}(\mu_{v},\Sigma_{v})caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) on φ(1)(v),,φ(s)(v)superscript𝜑1𝑣superscript𝜑𝑠𝑣\varphi^{(1)}(v),...,\varphi^{(s)}(v)italic_φ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_v ) , … , italic_φ start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( italic_v )
7:Compute Gaussian Mixture x=vVx𝒩(μv,Σv)subscript𝑥subscript𝑣subscript𝑉𝑥𝒩subscript𝜇𝑣subscriptΣ𝑣\mathcal{M}_{x}=\sum_{v\in V_{x}}\mathcal{N}(\mu_{v},\Sigma_{v})caligraphic_M start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) return MW22(1,2)𝑀superscriptsubscript𝑊22subscript1subscript2MW_{2}^{2}(\mathcal{M}_{1},\mathcal{M}_{2})italic_M italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
Algorithm 1 Compute the distance between G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Optimal Transport of Gaussian Mixtures. For each node v𝑣vitalic_v we compute s𝑠sitalic_s embeddings φ(1),,φ(s)superscript𝜑1superscript𝜑𝑠\varphi^{(1)},...,\varphi^{(s)}italic_φ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_φ start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT. We now fit a Gaussian using the maximum likelihood estimate 𝒩v=𝒩(μ^,1ni(xiμ^)(xiμ^))subscript𝒩𝑣𝒩^𝜇1𝑛subscript𝑖subscript𝑥𝑖^𝜇superscriptsubscript𝑥𝑖^𝜇top\mathcal{N}_{v}=\mathcal{N}(\hat{\mu},\frac{1}{n}\sum_{i}(x_{i}-\hat{\mu})(x_{% i}-\hat{\mu})^{\top})caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = caligraphic_N ( over^ start_ARG italic_μ end_ARG , divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG ) ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) on this collection {..,xi,..}\{..,x_{i},..\}{ . . , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , . . } of embedding points, where μ^=1nixi^𝜇1𝑛subscript𝑖subscript𝑥𝑖\hat{\mu}=\frac{1}{n}\sum_{i}x_{i}over^ start_ARG italic_μ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The entire graph is then encoded as a Gaussian Mixture (G)=1|V|vV𝒩v𝐺1𝑉subscript𝑣𝑉subscript𝒩𝑣\mathcal{M}(G)=\frac{1}{|V|}\sum_{v\in V}\mathcal{N}_{v}caligraphic_M ( italic_G ) = divide start_ARG 1 end_ARG start_ARG | italic_V | end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of the Gaussians extracted from each node. To compute the distance between two graphs, we can then compute the Wasserstein distance between the two Gaussian Mixtures (G1),(G2)subscript𝐺1subscript𝐺2\mathcal{M}(G_{1}),\mathcal{M}(G_{2})caligraphic_M ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , caligraphic_M ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (see eq. 2). The whole procedure can be found in Algorithm 1. This leads to relevant properties of the extracted distances for both CCB and CNP embeddings:

Proposition 1

For the sample size snormal-→𝑠s\rightarrow\inftyitalic_s → ∞, CNP defines a pseudometric on the space of all graphs and CCB defines a pseudometric on the space of all adjacency matrices.

The distinction here is related to the isomorphism invariance of the embeddings. While CNP converges to the same expectation regardless of the node ordering, CCB is dependent on the node ordering and will assign a non-zero distance to isomorphic graphs.

The following proposition states that our distance measures can be simplified if we assume additional conditions on the covariances of the distributions.

Proposition 2

Let Di=diag(d1i,,dni),Dj=diag(d1j,,dnj)formulae-sequencesubscript𝐷𝑖normal-diagsubscriptsuperscript𝑑𝑖1normal-…subscriptsuperscript𝑑𝑖𝑛subscript𝐷𝑗normal-diagsubscriptsuperscript𝑑𝑗1normal-…subscriptsuperscript𝑑𝑗𝑛D_{i}=\operatorname{diag}(d^{i}_{1},...,d^{i}_{n}),D_{j}=\operatorname{diag}(d% ^{j}_{1},...,d^{j}_{n})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_diag ( italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_diag ( italic_d start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Assume the Mixture components 𝒩i=𝒩(μi,Σi)subscript𝒩𝑖𝒩subscript𝜇𝑖subscriptnormal-Σ𝑖\mathcal{N}_{i}=\mathcal{N}(\mu_{i},\Sigma_{i})caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) share scaled covariances: DiΣi=DjΣj=Σsubscript𝐷𝑖subscriptnormal-Σ𝑖subscript𝐷𝑗subscriptnormal-Σ𝑗normal-ΣD_{i}\Sigma_{i}=D_{j}\Sigma_{j}=\Sigmaitalic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Σ. Let λxsubscript𝜆𝑥\lambda_{x}italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT be the eigenvalues of Σnormal-Σ\Sigmaroman_Σ. Then, the Wasserstein distance between two components is equal to:

W2(𝒩iG,𝒩jG^)subscript𝑊2subscriptsuperscript𝒩𝐺𝑖subscriptsuperscript𝒩^𝐺𝑗\displaystyle W_{2}(\mathcal{N}^{G}_{i},\mathcal{N}^{\hat{G}}_{j})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_N start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_N start_POSTSUPERSCRIPT over^ start_ARG italic_G end_ARG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =μiμj22+x=1nλxdxi+λxdxj2λxdxidxjabsentsuperscriptsubscriptnormsubscript𝜇𝑖subscript𝜇𝑗22superscriptsubscript𝑥1𝑛subscript𝜆𝑥subscriptsuperscript𝑑𝑖𝑥subscript𝜆𝑥subscriptsuperscript𝑑𝑗𝑥2subscript𝜆𝑥subscriptsuperscript𝑑𝑖𝑥subscriptsuperscript𝑑𝑗𝑥\displaystyle=\|\mu_{i}-\mu_{j}\|_{2}^{2}+\sum_{x=1}^{n}\frac{\lambda_{x}}{d^{% i}_{x}}+\frac{\lambda_{x}}{d^{j}_{x}}-\frac{2\lambda_{x}}{\sqrt{d^{i}_{x}d^{j}% _{x}}}= ∥ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_x = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG end_ARG

This substantially speeds up the computation as we do not have to compute a matrix square root. While the assumptions on the covariance may not always be fulfilled, we can use of the above formula as an approximation. In the following, we use three different approaches to compute the distance between two Gaussians that only differ in the approximation of the Wasserstein distance used: The full Wasserstein distance, the scaled Wasserstein distance, where we adjust the covariances of the Gaussian components such that the assumption of Proposition 2 holds, and the tied Wasserstein distance, where we assume Σi=Σj=ΣsubscriptΣ𝑖subscriptΣ𝑗Σ\Sigma_{i}=\Sigma_{j}=\Sigmaroman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_Σ, which further simplifies the Wasserstein distance to only the square euclidean distance between the means.

Properties of our approach. We remark that with our computations, we not only obtain a distance between two graphs, but also a (probabilistic) mapping between the nodes via the computed transport plan. For applications, this alignment can be very useful. Furthermore, one can use OT to compute the distance between two Mixtures with a distinct number of components — meaning that we can compare graphs of different sizes. One can even define unbalanced transport plans, such that only similar nodes are mapped to each other. Moreover, our approach using the tied Wasserstein distance is very efficient making it applicable to (sparse) graphs of size |V|10.000𝑉10.000|V|\approx 10.000| italic_V | ≈ 10.000. For even larger graphs, reducing the number of Mixture components [21, 22] can be used to speed up computations even further.

4 Evaluation

We now present two experiments on both synthetic and real-world biological data to evaluate the performance of our OT-based graph distances in comparison to other common distance measures. First, we qualitatively show the meaningfulness of clusters produced from our distance measures and the capability to retain these clusters under 2D projections, commonly used in biological domains to visualize population differences. Second, a classification task is presented to provide a quantitative comparison of our approaches with other graph distance methods. It should be noted that although a supervised classification task is presented, all methods are unsupervised and do not use the labels in any way apart for the evaluation. We benchmark against the Euclidean distance between the degree (Degree) distributions, the dominant eigenvector (EV) and the Graph2Vec [23] embedding of the graphs. We also compare against the Node2Vec [24] and Role2Vec [25] embeddings using Gromov-Wasserstein as a distance measure, as well as the GOT [5] distance. All code and data used for the experiments are available here111git.rwth-aachen.de/netsci/wasserstein-graph-dist-prob-embeddings/.

Synthetic Networks. This dataset consists of random networks generated with four common network models: Erdős-Rényi (ER), Watts-Strogatz (WS), Barabási-Albert (BA) and Configuration Models (CF) [26, 27, 28, 29]. From each of the four generative models we sample 20202020 networks with n{10,200}𝑛10200n\in\{10,200\}italic_n ∈ { 10 , 200 } nodes. The other parameters such as edge probability or degree distribution are chosen such that nodes in the resulting networks have an expected degree of 6.

Functional Brain Connectivity Networks. This dataset consists of Functional Brain Connectivity (FC) networks [30, 31] calculated as the Pearson-Correlation between the neural activity traces of different brain regions defined by the Allen Brain Atlas [32]. This results in complete, weighted graphs with n=64𝑛64n=64italic_n = 64 nodes. The neural activity is recorded through Widefield Calcium Imaging [33, 34, 35, 36] while the mice perform a virtual maze experiment under a two-alternative forced choice paradigm[37, 38, 39, 40]. A trial in this experiment consists of the mice perceiving a uni- or multisensory stimulus and responding accordingly after a short delay to get rewarded. The FC networks we use are grouped into 5 classes: Default Mode Network (DMN), Left- & Right Stimulus (LS&RS) and Left- & Right Response (LR&RR). While the DMN corresponds to the baseline neural connectivity between the trials, the stimuli and response classes contain the FC networks from the respective phases within the trial. Each class contains 200 FC networks from 3 different experimental sessions of the same subject.

Refer to caption
(a) Synthetic Networks Distances
Refer to caption
(b) CCB-TiedW UMAP Projection
Refer to caption
(c) FC Network Distances
Refer to caption
(d) CCB-TiedW UMAP Projection
Fig. 1: CCB-TiedW distances for the synthetic networks (a) and the functional connectivity networks (c). The networks are ordered according to the hierarchical clustering dendrogram where small heights correspond to small cluster distances. These distances can also be projected into a 2D space using UMAP for visualization purposes (b,d). Class memberships are indicated by the same color scheme in both corresponding plots.

Setup. On both the synthetic and the real-world dataset, we compute the pairwise distances as defined by CCB and CNP between all graphs. For both experiments the chosen parameters for our embedding methods on the real world data were sampled s=1000𝑠1000s=1000italic_s = 1000 times with k=10𝑘10k=10italic_k = 10 colors and depth d=5𝑑5d=5italic_d = 5. Larger values for these parameters generally did not decrease performance, but only increased computation times. Both embeddings therefore do not appear sensitive to the specific parameter selection. We aim to show this more rigorously as part of a sensitivity analysis in future work. To ensure a fair comparision, all competing node embedding methods, like Node2Vec and Role2Vec were computed for the same number of total dimensions, while Graph2Vec was allowed larger dimensions as it computed only one embedding per graph. The resulting distance matrices are depicted as a hierarchically clustered heatmap in Figure 1. In the same figure, we also show a 2D projection of the distance landscape using UMAP [41]. As a qualitative comparison, a k𝑘kitalic_k-Nearest Neighbor (kNN) classification [42] is performed on both datasets based on the precomputed distances. This provides a measure for how proximities in these distances reflect the true class membership of the graphs. For this purpose, a weighted kNN classifier (k=5𝑘5k=5italic_k = 5) is used which weights points by the inverse of their distance. This gives a higher importance to closer neighbors. To validate the generalizability of the computed distances a k-Fold Cross-validation scheme is deployed. This means that the neighbors the classification is based on are from a training subset of the graphs while the evaluation of the actual classification is done on a separate test set containing unseen graphs. We test on 20%percent2020\%20 % of the data in each of the 20 splits. The mean accuracy and its standard deviation over the splits are shown in Table 1. We also report the silhouette score as a measure of cluster density. Computation times are given as an average over all pairwise computed distances.
Discussion. On the synthetic graphs we can see that the clusters are generally well separated with small inner cluster distances and large distances between clusters. Additionally, the hierarchical clustering shows that these clusters can be found by relatively simple algorithms given our precomputed distances. This stands in stark contrast to the distances computed by other approaches that did not recover any meaningful clusters. Table 1 also shows this trend as classification and silhouette scores are high for our approaches and considerably worse for the competition. The corresponding figures for all considered methods can be found in the supplementary material.
In the real world data, we can see the presence of noise, with some networks not clearly distinguished according to their classes. However, this is to be expected due to the nature of behavioral experimental data where observed behaviors are not always caused by the activation of the hypothesized neural pathways. More specifically, while we have only included trials where the mouse responded correctly to the presented stimuli, this behavior might also happen by random chance if the mouse is disengaged. Despite the presence of noise, CCB-TiedW successfully finds meaningful clusters w.r.t the defined classes. Most trials from the DMN, LR and RR classes are clearly clustered together in the heatmap and projection in Figure 1. Interestingly, trials from the LS and RS both form two well separated sub-clusters with similar structures which can not be explained by the stimulus type as visual and tactile trials were present in both clusters. This could hint at trials where the mouse was not engaged and that are thus further away from the responses and closer to the default mode network. Such exemplary findings of unexpected but consistent within-population differences beyond known labels illustrate how unsupervised methods can help explore real-world data and provide directions for further investigation.
The run times of CNP and CCB in Table 1 show that the tied and scaled Wasserstein formulations provide a significant speed up without a loss in performance. Competing OT based methods are on average around 3 to 10 times slower, while euclidean distance based methods are faster but fail to differentiate the graph classes.

Method Dataset Random Graphs Functional Connectivity
KNN Silh. t (ms) KNN Silh. t (ms)
Degree 0.25±0.1 -.082 <0.01 0.53±0.04 -.074 <0.01
EV 0.59±0.09 .02 0.05 0.44±0.03 -.047 0.01
Graph2Vec 0.51±0.14 .01 0.08 0.33±0.02 -.168 0.01
Node2Vec GW 0.61±0.10 -.003 390 0.76±0.03 .133 14.74
Role2Vec 0.71±0.10 -.014 109 0.78±0.03 .032 9.67
GOT W 0.68±0.03 -.209 24.30
CNP-Tied 0.90±0.06 .550 40 0.59±0.03 -.169 2.85
CCB-Tied 0.91±0.06 .353 36 0.82±0.02 -.019 2.60
CNP-Scaled 0.93±0.07 .512 57 0.58±0.03 -.170 14.22
CCB-Scaled 0.90±0.06 .385 52 0.81±0.03 -.021 14.11
CNP-Full 0.93±0.05 .528 178 0.59±0.03 -.167 36.57
CCB-Full 0.92±0.05 .358 170 0.83±0.02 -.015 50.54
Table 1: Weighted k-NN (k=5𝑘5k=5italic_k = 5) classification scores on synthetic and real-world data given as μ±σplus-or-minus𝜇𝜎\mu\pm\sigmaitalic_μ ± italic_σ. Classification is performed under a 20-fold cross validation with a relative test set size of 20%. OT-based methods are grouped into Gromov-Wasserstein (GW) and Wasserstein (W) distances. Computation times t are averaged over all pairwise computed distances. – indicates that the provided method is not implemented for graphs of different sizes.

5 Conclusion

We introduced an Optimal Transport framework that represents each graph as the Gaussian Mixture of probabilistic node embeddings. This enabled the use of the Wasserstein distance instead of the widely used Gromov Wasserstein distance. We introduced two probabilistic node embeddings that fulfill the requirements of the framework and highlight different properties of the graph. Further, we derived theoretical properties of the resulting graph distances showed their efficiency and performance on both synthetic and real-world data.

6 References

References

  • [1] Steven H Strogatz “Exploring complex networks” In Nature 410.6825 Nature Publishing Group UK London, 2001, pp. 268–276
  • [2] Damian Szklarczyk et al. “STRING v10: protein–protein interaction networks, integrated over the tree of life” In Nucleic acids research 43.D1 Oxford University Press, 2015, pp. D447–D452
  • [3] Mikail Rubinov and Olaf Sporns “Complex network measures of brain connectivity: uses and interpretations” In Neuroimage 52.3 Elsevier, 2010, pp. 1059–1069
  • [4] Stephen P Borgatti, Ajay Mehra, Daniel J Brass and Giuseppe Labianca “Network analysis in the social sciences” In Science 323.5916 American Association for the Advancement of Science, 2009, pp. 892–895
  • [5] Hermina Petric Maretic, Mireille El Gheche, Giovanni Chierchia and Pascal Frossard “GOT: an optimal transport framework for graph comparison” In Neurips 32, 2019
  • [6] Amélie Barbe et al. “Graph diffusion wasserstein distances” In ECML PKDD, 2020, pp. 577–592 Springer
  • [7] Hermina Petric Maretic, Mireille El Gheche, Giovanni Chierchia and Pascal Frossard “FGOT: Graph distances based on filters and optimal transport” In AAAI 36.7, 2022
  • [8] Guixiang Ma et al. “Deep graph similarity learning for brain data analysis” In Proceedings of the 28th ACM CIKM, 2019
  • [9] Rita T Sousa, Sara Silva and Catia Pesquita “Evolving knowledge graph similarity for supervised learning in complex biomedical domains” In BMC bioinformatics 21 Springer, 2020, pp. 1–19
  • [10] Somesh Mohapatra, Joyce An and Rafael Gómez-Bombarelli “Chemistry-informed macromolecule graph representation for similarity computation, unsupervised and supervised learning” In Machine Learning: Science and Technology 3.1 IOP Publishing, 2022, pp. 015028
  • [11] Xinbo Gao, Bing Xiao, Dacheng Tao and Xuelong Li “A survey of graph edit distance” In Pattern Analysis and applications 13 Springer, 2010, pp. 113–129
  • [12] Yujie Mo et al. “Simple unsupervised graph representation learning” In AAAI 36, 2022, pp. 7797–7805
  • [13] Pau Riba, Andreas Fischer, Josep Lladós and Alicia Fornés “Learning graph edit distance by graph neural networks” In Pattern Recognition 120 Elsevier, 2021, pp. 108132
  • [14] Facundo Mémoli “Gromov–Wasserstein distances and the metric approach to object matching” In Foundations of computational mathematics 11 Springer, 2011, pp. 417–487
  • [15] Gabriel Peyré, Marco Cuturi and Justin Solomon “Gromov-wasserstein averaging of kernel and distance matrices” In ICML, 2016 PMLR
  • [16] Vayer Titouan, Nicolas Courty, Romain Tavenard and Rémi Flamary “Optimal transport for structured data with application on graphs” In ICML, 2019 PMLR
  • [17] Julie Delon and Agnes Desolneux “A Wasserstein-type distance in the space of Gaussian mixture models” In SIAM Journal on Imaging Sciences 13.2 SIAM, 2020, pp. 936–970
  • [18] Antoine Salmona, Julie Delon and Agnès Desolneux “Gromov-Wasserstein distances between Gaussian distributions” In arXiv:2104.07970, 2021
  • [19] Kathryn Cooper and Mauricio Barahona “Role-based similarity in directed networks” In arXiv:1012.2726, 2010
  • [20] Michael Scholkemper and Michael T Schaub “Local, global and scale-dependent node roles” In 2021 IEEE ICAS, 2021, pp. 1–5 IEEE
  • [21] David F Crouse, Peter Willett, Krishna Pattipati and Lennart Svensson “A look at Gaussian mixture reduction algorithms” In FUSION, 2011, pp. 1–8 IEEE
  • [22] Akbar Assa and Konstantinos N Plataniotis “Wasserstein-distance-based Gaussian mixture reduction” In IEEE Signal Processing Letters 25.10 IEEE, 2018, pp. 1465–1469
  • [23] Annamalai Narayanan et al. “graph2vec: Learning distributed representations of graphs”, 2017
  • [24] Aditya Grover and Jure Leskovec “node2vec: Scalable feature learning for networks” In ACM SIGKDD, 2016, pp. 855–864
  • [25] Nesreen K Ahmed et al. “role2vec: Role-based network embeddings” In Proc. DLG KDD, 2019, pp. 1–7
  • [26] Paul Erdős and Alfréd Rényi “On the evolution of random graphs” In Publ. math. inst. hung. acad. sci 5.1, 1960, pp. 17–60
  • [27] Duncan J Watts and Steven H Strogatz “Collective dynamics of ’small-world’networks” In Nature 393.6684 Nature Publishing Group, 1998
  • [28] Albert-László Barabási and Réka Albert “Emergence of scaling in random networks” In Science 286.5439 American Association for the Advancement of Science, 1999, pp. 509–512
  • [29] Mark EJ Newman, Steven H Strogatz and Duncan J Watts “Random graphs with arbitrary degree distributions and their applications” In Physical review E 64.2 APS, 2001, pp. 026118
  • [30] Bharat Biswal, F Zerrin Yetkin, Victor M Haughton and James S Hyde “Functional connectivity in the motor cortex of resting human brain using echo-planar MRI” In Magnetic resonance in medicine 34.4 Wiley Online Library, 1995, pp. 537–541
  • [31] Karl J Friston “Functional and effective connectivity: a review” In Brain connectivity 1.1, 2011, pp. 13–36
  • [32] Susan M Sunkin et al. “Allen Brain Atlas: an integrated spatio-temporal portal for exploring the central nervous system” In Nucleic acids research 41.D1 Oxford University Press, 2012
  • [33] Ryota Homma et al. “Wide-field and two-photon imaging of brain activity with voltage and calcium-sensitive dyes” In Dynamic Brain Imaging: Multi-Modal Methods and In Vivo Applications Springer, 2009, pp. 43–79
  • [34] Benjamin B Scott et al. “Imaging cortical dynamics in GCaMP transgenic rats with a head-mounted widefield macroscope” In Neuron 100.5 Elsevier, 2018, pp. 1045–1058
  • [35] Julia V Cramer et al. “In vivo widefield calcium imaging of the mouse cortex for analysis of network connectivity in health and brain disease” In Neuroimage 199 Elsevier, 2019
  • [36] Joseph B Wekselblatt, Erik D Flister, Denise M Piscopo and Cristopher M Niell “Large-scale imaging of cortical dynamics during sensory perception and behavior” In Journal of neurophysiology 115.6 American Physiological Society Bethesda, MD, 2016, pp. 2852–2866
  • [37] Marcus Leinweber et al. “Two-photon calcium imaging in mice navigating a virtual reality environment” In JoVE, 2014, pp. e50885
  • [38] Lucas Pinto et al. “An accumulation-of-evidence task using visual pulses for mice navigating in virtual reality” In Frontiers in behavioral neuroscience 12 Frontiers Media SA, 2018, pp. 36
  • [39] Johannes M Mayrhofer et al. “Novel two-alternative forced choice paradigm for bilateral vibrotactile whisker frequency discrimination in head-fixed mice and rats” In Journal of neurophysiology 109.1 American Physiological Society Bethesda, MD, 2013, pp. 273–284
  • [40] Benjamin B Scott et al. “Sources of noise during accumulation of evidence in unrestrained and voluntarily head-restrained rats” In Elife 4 eLife Sciences Publications, Ltd, 2015, pp. e11308
  • [41] Leland McInnes, John Healy and James Melville “Umap: Uniform manifold approximation and projection for dimension reduction”, 2018
  • [42] Thomas Cover and Peter Hart “Nearest neighbor pattern classification” In IEEE transactions on information theory 13.1 IEEE, 1967, pp. 21–27
  • [43] Roman Vershynin “High-dimensional probability” In University of California, Irvine, 2020
  • [44] Joel A Tropp “An introduction to matrix concentration inequalities” In Foundations and Trends® in Machine Learning 8.1-2 Now Publishers, Inc., 2015, pp. 1–230

Appendix A Plots and Tables

Appendix B Proofs.

See 1 Proof. Toward proving this claim given a graph G𝐺Gitalic_G, we will first show that the Gaussians Mixture (G)𝐺\mathcal{M}(G)caligraphic_M ( italic_G ) (see algorithm 1) converges to a Gaussian Mixture ¯(G):=lims(G)assign¯𝐺subscript𝑠𝐺\overline{\mathcal{M}}(G):=\lim_{s\rightarrow\infty}\mathcal{M}(G)over¯ start_ARG caligraphic_M end_ARG ( italic_G ) := roman_lim start_POSTSUBSCRIPT italic_s → ∞ end_POSTSUBSCRIPT caligraphic_M ( italic_G ). For the moment, assume this to be true. Then, we can use the following result:

Lemma 1 (Proposition 5, [17])

The distance MW22𝑀subscriptsuperscript𝑊22MW^{2}_{2}italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT between Gaussian Mixtures (eq. 2) defines a metric on the space of Gaussian Mixtures.

For graphs G1,G2,G3subscript𝐺1subscript𝐺2subscript𝐺3G_{1},G_{2},G_{3}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, this means that:

dist(G1,G1)distsubscript𝐺1subscript𝐺1\displaystyle\operatorname{dist}(G_{1},G_{1})roman_dist ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =MW22(¯(G1),¯(G1))=0absent𝑀subscriptsuperscript𝑊22¯subscript𝐺1¯subscript𝐺10\displaystyle=MW^{2}_{2}(\overline{\mathcal{M}}(G_{1}),\overline{\mathcal{M}}(% G_{1}))=0= italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) = 0 (3)
dist(G1,G2)distsubscript𝐺1subscript𝐺2\displaystyle\operatorname{dist}(G_{1},G_{2})roman_dist ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =MW22(¯(G1),¯(G2))absent𝑀subscriptsuperscript𝑊22¯subscript𝐺1¯subscript𝐺2\displaystyle=MW^{2}_{2}(\overline{\mathcal{M}}(G_{1}),\overline{\mathcal{M}}(% G_{2}))= italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) )
=MW22(¯(G2),¯(G1))=dist(G2,G1)absent𝑀subscriptsuperscript𝑊22¯subscript𝐺2¯subscript𝐺1distsubscript𝐺2subscript𝐺1\displaystyle=MW^{2}_{2}(\overline{\mathcal{M}}(G_{2}),\overline{\mathcal{M}}(% G_{1}))=\operatorname{dist}(G_{2},G_{1})= italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) = roman_dist ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
dist(G1,G3)distsubscript𝐺1subscript𝐺3\displaystyle\operatorname{dist}(G_{1},G_{3})roman_dist ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) =MW22(¯(G1),¯(G3))absent𝑀subscriptsuperscript𝑊22¯subscript𝐺1¯subscript𝐺3\displaystyle=MW^{2}_{2}(\overline{\mathcal{M}}(G_{1}),\overline{\mathcal{M}}(% G_{3}))= italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) )
MW22(¯(G1),¯(G2))absent𝑀subscriptsuperscript𝑊22¯subscript𝐺1¯subscript𝐺2\displaystyle\leq MW^{2}_{2}(\overline{\mathcal{M}}(G_{1}),\overline{\mathcal{% M}}(G_{2}))≤ italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) )
+MW22(¯(G2),¯(G3))𝑀subscriptsuperscript𝑊22¯subscript𝐺2¯subscript𝐺3\displaystyle\quad\quad+MW^{2}_{2}(\overline{\mathcal{M}}(G_{2}),\overline{% \mathcal{M}}(G_{3}))+ italic_M italic_W start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_M end_ARG ( italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) )
=dist(G1,G2)+dist(G2,G3)absentdistsubscript𝐺1subscript𝐺2distsubscript𝐺2subscript𝐺3\displaystyle=\operatorname{dist}(G_{1},G_{2})+\operatorname{dist}(G_{2},G_{3})= roman_dist ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + roman_dist ( italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )

proving that the distance is a pseudometric. To show convergence, consider the embedding φ(i)(v)superscript𝜑𝑖𝑣\varphi^{(i)}(v)italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ). Each instance is bounded by φ(i)(v)δdsubscriptnormsuperscript𝜑𝑖𝑣superscript𝛿𝑑||\varphi^{(i)}(v)||_{\infty}\leq\delta^{d}| | italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ) | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_δ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT where d𝑑ditalic_d is the number of adjacency matrix powers used and δ𝛿\deltaitalic_δ is the maximum degree in the graph. Since we only allow non-negative edge weights, each component is bounded by 0φj(i)(v)δd0subscriptsuperscript𝜑𝑖𝑗𝑣superscript𝛿𝑑0\leq\varphi^{(i)}_{j}(v)\leq\delta^{d}0 ≤ italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_v ) ≤ italic_δ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We apply Hoeffdings inequality [43]:

Pr(|1si=1sφj(i)(v)𝔼[φj(v)]|ϵ)2exp(2sϵ2δ2d)𝑃𝑟1𝑠superscriptsubscript𝑖1𝑠subscriptsuperscript𝜑𝑖𝑗𝑣𝔼delimited-[]subscript𝜑𝑗𝑣italic-ϵ22𝑠superscriptitalic-ϵ2superscript𝛿2𝑑Pr\left(\left|\frac{1}{s}\sum_{i=1}^{s}\varphi^{(i)}_{j}(v)-\mathbb{E}[\varphi% _{j}(v)]\right|\geq\epsilon\right)\leq 2\exp\left(-\frac{2s\epsilon^{2}}{% \delta^{2d}}\right)italic_P italic_r ( | divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_v ) - blackboard_E [ italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_v ) ] | ≥ italic_ϵ ) ≤ 2 roman_exp ( - divide start_ARG 2 italic_s italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT end_ARG )

We now union bound over all kd𝑘𝑑kditalic_k italic_d components of the embedding φ(i)(v)superscript𝜑𝑖𝑣\varphi^{(i)}(v)italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ):

Pr(1si=1sφ(i)(v)𝔼[φ(v)]ϵ)2kdexp(2sϵ2δ2d)𝑃𝑟subscriptnorm1𝑠superscriptsubscript𝑖1𝑠superscript𝜑𝑖𝑣𝔼delimited-[]𝜑𝑣italic-ϵ2𝑘𝑑2𝑠superscriptitalic-ϵ2superscript𝛿2𝑑Pr\left(\left|\left|\frac{1}{s}\sum_{i=1}^{s}\varphi^{(i)}(v)-\mathbb{E}[% \varphi(v)]\right|\right|_{\infty}\mskip-15.0mu\geq\epsilon\right)\leq 2kd\exp% \left(-\frac{2s\epsilon^{2}}{\delta^{2d}}\right)italic_P italic_r ( | | divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ) - blackboard_E [ italic_φ ( italic_v ) ] | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≥ italic_ϵ ) ≤ 2 italic_k italic_d roman_exp ( - divide start_ARG 2 italic_s italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT end_ARG )

This proves, that the maximum likelihood estimator 1si=1sφ(i)(v)1𝑠superscriptsubscript𝑖1𝑠superscript𝜑𝑖𝑣\frac{1}{s}\sum_{i=1}^{s}\varphi^{(i)}(v)divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ) converges to the expectation 𝔼[φ(v)]𝔼delimited-[]𝜑𝑣\mathbb{E}[\varphi(v)]blackboard_E [ italic_φ ( italic_v ) ] as s𝑠s\rightarrow\inftyitalic_s → ∞. For the covariance, we apply the matrix Bernstein inequality (Corollary 6.2.1, [44]) to the our maximum likelihood covariance estimator 1si=1s(φ(i)μ)(φ(i)μ)1𝑠superscriptsubscript𝑖1𝑠superscript𝜑𝑖𝜇superscriptsuperscript𝜑𝑖𝜇top\frac{1}{s}\sum_{i=1}^{s}(\varphi^{(i)}-\mu)(\varphi^{(i)}-\mu)^{\top}divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_μ ) ( italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_μ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Let xi=(φi(v)1si=1sφ(i)(v))subscript𝑥𝑖superscript𝜑𝑖𝑣1𝑠superscriptsubscript𝑖1𝑠superscript𝜑𝑖𝑣x_{i}=(\varphi^{i}(v)-\frac{1}{s}\sum_{i=1}^{s}\varphi^{(i)}(v))italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_v ) - divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_φ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_v ) ), then:

Pr(1si=1sxixi𝔼[xx]ϵ)2kdexp(sϵ2/2kdδ2(23+δ2))𝑃𝑟subscriptnorm1𝑠superscriptsubscript𝑖1𝑠subscript𝑥𝑖superscriptsubscript𝑥𝑖top𝔼delimited-[]𝑥superscript𝑥topitalic-ϵ2𝑘𝑑𝑠superscriptitalic-ϵ22𝑘𝑑superscript𝛿223superscript𝛿2Pr\mskip-4.0mu\left(\left|\left|\frac{1}{s}\sum_{i=1}^{s}x_{i}x_{i}^{\top}% \mskip-10.0mu-\mathbb{E}[xx^{\top}]\right|\right|_{\infty}\mskip-15.0mu\geq% \epsilon\mskip-3.0mu\right)\mskip-5.0mu\leq 2kd\exp\mskip-3.0mu\left(-\frac{s% \epsilon^{2}/2}{kd\delta^{2}(\frac{2}{3}+\delta^{2})}\right)italic_P italic_r ( | | divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT - blackboard_E [ italic_x italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≥ italic_ϵ ) ≤ 2 italic_k italic_d roman_exp ( - divide start_ARG italic_s italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_ARG start_ARG italic_k italic_d italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 end_ARG start_ARG 3 end_ARG + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG )

Again this proves that the maximum likelihood estimator fo the covariance converges to the expected covariance as s𝑠s\rightarrow\inftyitalic_s → ∞. Combining the two results, we can see that the Gaussian component 𝒩vsubscript𝒩𝑣\mathcal{N}_{v}caligraphic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT representing a node in the graph converges to the expected Gaussian 𝒩¯v=𝒩(𝔼[φ(v)],𝔼[xx])subscript¯𝒩𝑣𝒩𝔼delimited-[]𝜑𝑣𝔼delimited-[]𝑥superscript𝑥top\overline{\mathcal{N}}_{v}=\mathcal{N}(\mathbb{E}[\varphi(v)],\mathbb{E}[xx^{% \top}])over¯ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = caligraphic_N ( blackboard_E [ italic_φ ( italic_v ) ] , blackboard_E [ italic_x italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ). One final union bound yields that the whole Gaussian Mixture converges to a Gaussian Mixture ¯(G)=vV𝒩¯v¯𝐺subscript𝑣𝑉subscript¯𝒩𝑣\overline{\mathcal{M}}(G)=\sum_{v\in V}\overline{\mathcal{N}}_{v}over¯ start_ARG caligraphic_M end_ARG ( italic_G ) = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT over¯ start_ARG caligraphic_N end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as s𝑠s\rightarrow\inftyitalic_s → ∞. Finally, to show that the CNP is a pseudometric on the space of graphs, we can use the same argument as above. Additionally we need to show that CNP converges to the same Gaussian Mixture for two isomorphic graphs GGsimilar-to-or-equals𝐺superscript𝐺G\simeq G^{\prime}italic_G ≃ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Let A,A𝐴superscript𝐴A,A^{\prime}italic_A , italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the adjacency matrices of G,G𝐺superscript𝐺G,G^{\prime}italic_G , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT respectively, then PAP=A𝑃𝐴superscript𝑃topsuperscript𝐴PAP^{\top}=A^{\prime}italic_P italic_A italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for some permutation matrix P𝑃Pitalic_P. Recall the definition of CNP:

φ¯CNP(v,H,d)=missingi=1d+1Mi,_ with M=lex-sort([1AAv,_0H1AdAv,_dH])\bar{\varphi}_{\text{CNP}}(v,H,d)={\mathbin{\bigg{missing}}\|}_{i=1}^{d+1}M_{i% ,\_}\text{ with }M=\text{lex-sort}\left(\begin{bmatrix}\frac{1}{\|A\|}A^{0}_{v% ,\_}H\\ \vdots\\ \frac{1}{\|A\|^{d}}A^{d}_{v,\_}H\end{bmatrix}\right)over¯ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT CNP end_POSTSUBSCRIPT ( italic_v , italic_H , italic_d ) = roman_missing ∥ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i , _ end_POSTSUBSCRIPT with italic_M = lex-sort ( [ start_ARG start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ∥ italic_A ∥ end_ARG italic_A start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , _ end_POSTSUBSCRIPT italic_H end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG ∥ italic_A ∥ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG italic_A start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , _ end_POSTSUBSCRIPT italic_H end_CELL end_ROW end_ARG ] )

and consider what happens when you permute the rows of H𝐻Hitalic_H (so that all nodes have the same color in both graphs) and after the transformation, permuting them back:

PAPH=PPAPPH=AHsuperscript𝑃topsuperscript𝐴𝑃𝐻superscript𝑃top𝑃𝐴superscript𝑃top𝑃𝐻𝐴𝐻P^{\top}A^{\prime}PH=P^{\top}PAP^{\top}PH=AHitalic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_P italic_H = italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_P italic_A italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_P italic_H = italic_A italic_H

This also extends to matrix powers. It the node v𝑣vitalic_v has the same color as the node vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT it is isomorphic to in the other graph, so the Gaussian Mixture will have the exact same components (in a different order). Also, since H𝐻Hitalic_H is sampled uniformly i.i.d, H𝐻Hitalic_H and PH𝑃𝐻PHitalic_P italic_H have the same probability to be sampled. Thus, the two distributions, in fact, are the same. This proves that the CNP is a pseudometric on the space of graphs.

See 2 Proof. Consider the Eigenvalue decomposition of the non-negative, symmetric, real matrix Σ=VΛVΣ𝑉Λsuperscript𝑉top\Sigma=V\Lambda V^{\top}roman_Σ = italic_V roman_Λ italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. In the trace term of the closed form solution of the Wasserstein distance, we have:

Tr(Σi)TrsubscriptΣ𝑖\displaystyle\operatorname{Tr}(\Sigma_{i})roman_Tr ( roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =Tr(Di12ΣDi12)=Tr(Di12VΛVDi12)absentTrsuperscriptsubscript𝐷𝑖12Σsuperscriptsubscript𝐷𝑖12Trsuperscriptsubscript𝐷𝑖12𝑉Λsuperscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=\operatorname{Tr}(D_{i}^{-\frac{1}{2}}\Sigma D_{i}^{-\frac{1}{2}% })=\operatorname{Tr}(D_{i}^{-\frac{1}{2}}V\Lambda V^{\top}D_{i}^{-\frac{1}{2}})= roman_Tr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) = roman_Tr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=Tr(VDi12Di12VΛ)=Tr(Di1Λ)absentTrsuperscript𝑉topsuperscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12𝑉ΛTrsuperscriptsubscript𝐷𝑖1Λ\displaystyle=\operatorname{Tr}(V^{\top}D_{i}^{-\frac{1}{2}}D_{i}^{-\frac{1}{2% }}V\Lambda)=\operatorname{Tr}(D_{i}^{-1}\Lambda)= roman_Tr ( italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ ) = roman_Tr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Λ )

By the same reasoning Tr(Σj)=Tr(Dj1Λ)TrsubscriptΣ𝑗Trsuperscriptsubscript𝐷𝑗1Λ\operatorname{Tr}(\Sigma_{j})=\operatorname{Tr}(D_{j}^{-1}\Lambda)roman_Tr ( roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_Tr ( italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Λ ). Regarding the last term, we can use that VDV=Dsuperscript𝑉top𝐷𝑉𝐷V^{\top}DV=Ditalic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D italic_V = italic_D for any diagonal matrix D𝐷Ditalic_D and the fact that diagonal matrices commute:

(Di12VΛ12Di12VDi12)2superscriptsuperscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖122\displaystyle(D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}V^{% \top}D_{i}^{-\frac{1}{2}})^{2}( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Di12VΛ12Di12VDi12Di12VΛ12Di12VDi12absentsuperscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}V^{% \top}D_{i}^{-\frac{1}{2}}D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{% \frac{1}{2}}V^{\top}D_{i}^{-\frac{1}{2}}= italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=Di12VΛ12Di12Di12Di12Λ12Di12VDi12absentsuperscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}D_{% i}^{-\frac{1}{2}}D_{i}^{-\frac{1}{2}}\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}V% ^{\top}D_{i}^{-\frac{1}{2}}= italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=Di12VΛVDi12absentsuperscriptsubscript𝐷𝑖12𝑉Λsuperscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=D_{i}^{-\frac{1}{2}}V\Lambda V^{\top}D_{i}^{-\frac{1}{2}}= italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=ΣabsentΣ\displaystyle=\Sigma= roman_Σ

Then the similarly for the last term:

Σi12ΣjΣi12superscriptsubscriptΣ𝑖12subscriptΣ𝑗superscriptsubscriptΣ𝑖12\displaystyle\Sigma_{i}^{\frac{1}{2}}\Sigma_{j}\Sigma_{i}^{\frac{1}{2}}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=Di12VΛ12Di12VDi12Dj12VΛVDj12Di12VΛ12Di12VDi12absentsuperscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑗12𝑉Λsuperscript𝑉topsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}V^{% \top}D_{i}^{-\frac{1}{2}}D_{j}^{-\frac{1}{2}}V\Lambda V^{\top}D_{j}^{-\frac{1}% {2}}D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}V^{\top}D_{i}% ^{-\frac{1}{2}}= italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=Di12VΛ12Di12Di12Dj12ΛDj12Di12Λ12Di12VDi12absentsuperscriptsubscript𝐷𝑖12𝑉superscriptΛ12superscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑗12Λsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12superscriptΛ12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=D_{i}^{-\frac{1}{2}}V\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}D_{% i}^{-\frac{1}{2}}D_{j}^{-\frac{1}{2}}\Lambda D_{j}^{-\frac{1}{2}}D_{i}^{-\frac% {1}{2}}\Lambda^{\frac{1}{2}}D_{i}^{\frac{1}{2}}V^{\top}D_{i}^{-\frac{1}{2}}= italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Λ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=Di12VΛ2Dj1VDi12absentsuperscriptsubscript𝐷𝑖12𝑉superscriptΛ2superscriptsubscript𝐷𝑗1superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=D_{i}^{-\frac{1}{2}}V\Lambda^{2}D_{j}^{-1}V^{\top}D_{i}^{-\frac{% 1}{2}}= italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

We can now fairly easily see that:

(Σi12ΣjΣi12)12=Di12VΛDj12Di12VDi12superscriptsuperscriptsubscriptΣ𝑖12subscriptΣ𝑗superscriptsubscriptΣ𝑖1212superscriptsubscript𝐷𝑖12𝑉Λsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle(\Sigma_{i}^{\frac{1}{2}}\Sigma_{j}\Sigma_{i}^{\frac{1}{2}})^{% \frac{1}{2}}=D_{i}^{-\frac{1}{2}}V\Lambda D_{j}^{-\frac{1}{2}}D_{i}^{\frac{1}{% 2}}V^{\top}D_{i}^{-\frac{1}{2}}( roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Since the trace is invariant under cyclic permutations, we can write:

Tr((Σi12ΣjΣi12)12)TrsuperscriptsuperscriptsubscriptΣ𝑖12subscriptΣ𝑗superscriptsubscriptΣ𝑖1212\displaystyle\operatorname{Tr}((\Sigma_{i}^{\frac{1}{2}}\Sigma_{j}\Sigma_{i}^{% \frac{1}{2}})^{\frac{1}{2}})roman_Tr ( ( roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=Tr(Di12VΛDj12Di12VDi12)absentTrsuperscriptsubscript𝐷𝑖12𝑉Λsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12superscript𝑉topsuperscriptsubscript𝐷𝑖12\displaystyle=\operatorname{Tr}(D_{i}^{-\frac{1}{2}}V\Lambda D_{j}^{-\frac{1}{% 2}}D_{i}^{\frac{1}{2}}V^{\top}D_{i}^{-\frac{1}{2}})= roman_Tr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=Tr(VDi12Di12VΛDj12Di12)absentTrsuperscript𝑉topsuperscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12𝑉Λsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12\displaystyle=\operatorname{Tr}(V^{\top}D_{i}^{-\frac{1}{2}}D_{i}^{-\frac{1}{2% }}V\Lambda D_{j}^{-\frac{1}{2}}D_{i}^{\frac{1}{2}})= roman_Tr ( italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_V roman_Λ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=Tr(Di12Di12ΛDj12Di12)absentTrsuperscriptsubscript𝐷𝑖12superscriptsubscript𝐷𝑖12Λsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12\displaystyle=\operatorname{Tr}(D_{i}^{-\frac{1}{2}}D_{i}^{-\frac{1}{2}}% \Lambda D_{j}^{-\frac{1}{2}}D_{i}^{\frac{1}{2}})= roman_Tr ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Λ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=Tr(ΛDj12Di12)absentTrΛsuperscriptsubscript𝐷𝑗12superscriptsubscript𝐷𝑖12\displaystyle=\operatorname{Tr}(\Lambda D_{j}^{-\frac{1}{2}}D_{i}^{-\frac{1}{2% }})= roman_Tr ( roman_Λ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT )
=x=1nλxdx(i)dx(j)absentsuperscriptsubscript𝑥1𝑛subscript𝜆𝑥subscriptsuperscript𝑑𝑖𝑥subscriptsuperscript𝑑𝑗𝑥\displaystyle=\sum_{x=1}^{n}\frac{\lambda_{x}}{\sqrt{d^{(i)}_{x}d^{(j)}_{x}}}= ∑ start_POSTSUBSCRIPT italic_x = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG end_ARG

Plugging this in yields the claim.