1 Introduction

Let \(({\Omega }^{1}{,\mathcal {F}^1},{\varvec{p}}^{1})\) and \(({\Omega }^{2}{,\mathcal {F}^2},{\varvec{p}}^{2})\) be two probability spaces. Given a cost function \({d}^{}: {\Omega }^{1} \times {\Omega }^{2} \rightarrow \mathbb {R}^{\ge 0}\), and \(\ell \ge 1\), the (Kantorovich) Optimal Transport (OT) problem (Kantorovich, 1942) seeks:

$$\begin{aligned} \inf _{{\varvec{p}}^{}} \bigg (\;{\int }_{{\Omega }^{1} \times {\Omega }^{2}} {d}^{\ell } \textrm{d}{\varvec{p}}^{}\bigg )^{\frac{1}{\ell }} \text {, subject to: } {\int }_{{\Omega }^{1}} \textrm{d}{\varvec{p}}^{} = {\varvec{p}}^{2} \text { and } {\int }_{{\Omega }^{2}} \textrm{d}{\varvec{p}}^{} = {\varvec{p}}^{1}, \end{aligned}$$
(1)

where the \(\inf\) is over measures \({\varvec{p}}^{}\) on \({\Omega }^{1} \times {\Omega }^{2}\). Problem (1) is typically studied under the assumptions that \({\Omega }^{1}\) and \({\Omega }^{2}\) are in a Polish space on which d is a metric, in which case the minimum of (1) is the Wasserstein distance (WD). The WD is popular in many applications including shape interpolation (Solomon, 2015), generative modeling (Arjovsky et al., 2017; Fan et al., 2017), domain adaptation (Damodaran et al., 2018), and dictionary learning (Schmitz et al., 2018). The WD is a metric on the space of probability measures (Ambrosio & Gigli, 2013), and this property is useful in many ML tasks, e.g., clustering (Xing et al., 2003; Hartigan, 1975), nearest-neighbor search (Clarkson, 1999, 2006; Beygelzimer et al., 2006), and outlier detection (Angiulli & Pizzuti, 2002). Indeed, some of these tasks are tractable, or allow theoretical guarantees, when done on a metric space. E.g., finding the nearest neighbor (Beygelzimer et al., 2006; Clarkson, 1999, 2006) or the diameter (Indyk 1999) of a dataset requires a polylogarithimic computational effort under metric assumptions; approximation algorithms for clustering rely on metric assumptions, whose absence worsens known bounds (Ackermann et al., 2010); also, (Mémoli, 2011) uses the metric properties of the WD to study object matching via metric invariants.

Recently, a generalization of OT to multiple marginal measures has gained attention. Given probability spaces \(({\Omega }^{i}{,\mathcal {F}^i},{\varvec{p}}^{i})\), \(i = 1,\ldots ,n\), a function \(d:{\Omega }^{1}\times \cdots \times {\Omega }^{n}\mapsto \mathbb {R}^{\ge 0}\), a number \(\ell \ge 1\), and \({\Omega }^{ \backslash i}\triangleq {\Omega }^{1}\times \cdots \times {\Omega }^{i-1}\times {\Omega }^{i+1}\times \cdots \times {\Omega }^{n}\), Multi-Marginal Optimal Transport (MMOT) seeks:

$$\begin{aligned} \inf _{{\varvec{p}}^{}} \left( {\int }_{{\Omega }^{1} \times \cdots \times {\Omega }^{n}} {d}^{\ell } \textrm{d}{\varvec{p}}^{} \right) ^{\frac{1}{\ell }} \text {, subject to: } {\int }_{{\Omega }^{\backslash i}} \textrm{d}{\varvec{p}}^{} = {\varvec{p}}^{i}, i = 1,\ldots ,n, \end{aligned}$$
(2)

where the infimum is taken over measures \({\varvec{p}}^{}\) on \({\Omega }^{1} \times \cdots \times {\Omega }^{n}\). The term MMOT was coined in Pass (2012), and was surveyed by the same authors in Pass (2015).

Unfortunately, there is a lack of discussion about the (generalized) metric properties of MMOT. Much of the discussion on MMOT has focused on the existence of a minimizer, the uniqueness and structure of both Monge and Kantorovich solutions, applications, practical algorithms, and the choice of the cost function (Gerolin et al., 2019; Moameni & Pass, 2017; Pass, 2014; Peyré & Cuturi, 2019). Since the metric property of the WD is useful in many applications, understanding when the (potential) minimum of (2), \({\mathcal {W}}({\varvec{p}}^{1},\ldots ,{\varvec{p}}^{n})\), a multi-way distance, has metric-like properties is critical. For example, metric properties can improve distance-based clustering, so too can generalized metrics improve clustering based on multi-way distances. A precise definition of generalized metrics and of the generalized triangle inequality that these must satisfy is presented in Sect. 2. Later, in Sect. 5, we illustrate the improvement in using metrics on a task of clustering chemical compounds. Importantly, several algorithms in Xing et al. (2003), Hartigan (1975), Clarkson (2006), Clarkson (1999), Beygelzimer et al. (2006), Angiulli and Pizzuti (2002), Indyk (1999), Ackermann et al. (2010), and more, which use distances including WD as input, have guarantees if the distances are metrics. They extend to feeding off multi-distances, and hence can use MMOT, and have guarantees under generalized metrics similar to those under classic metrics. We now exemplify these extensions, and their potential applications:

Example 1

Given a set S with \(n\) distinct distributions we can find its 3-diameter \(\Delta \triangleq \max _{{\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3} \in S} {\mathcal {W}}({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3})\) with \(n\atopwithdelims ()3\) evaluations of \({\mathcal {W}}\). What if \({\mathcal {W}}\) satisfies the generalized triangle inequality \({\mathcal {W}}^{1,2,3}\le {\mathcal {W}}^{4,2,3} + {\mathcal {W}}^{1,4,3} + {\mathcal {W}}^{1,2,4}\), where \({\mathcal {W}}^{i,j,k} \triangleq {\mathcal {W}}({\varvec{p}}^{i},{\varvec{p}}^{j},{\varvec{p}}^{k})\)? Let \(\Delta = {\mathcal {W}}({{{\varvec{p}}}^{*}}^{1},{{{\varvec{p}}}^{*}}^{2},{{{\varvec{p}}}^{*}}^{3})\) for some \({{{\varvec{p}}}^{*}}^{1},{{{\varvec{p}}}^{*}}^{2},{{{\varvec{p}}}^{*}}^{3}\) and let us extend the notation \({\mathcal {W}}^{i,j,k}\) such that e.g. \({\mathcal {W}}^{i,*j,k} \triangleq {\mathcal {W}}({\varvec{p}}^{i},{{{\varvec{p}}}^{*}}^{j},{\varvec{p}}^{k})\). We can now show that for at least \(n/3\) distribution triplets \({\mathcal {W}}\ge \Delta /3\). Indeed, for all \({\varvec{p}}^{4} \in S\), we cannot simultaneously have \({\mathcal {W}}^{4,*2,*3},{\mathcal {W}}^{*1,*2,4},{\mathcal {W}}^{*1,4,*3}< \Delta /3\). Therefore, if we evaluate \({\mathcal {W}}\) on random distribution triplets, the probability of finding a triple for which \({\mathcal {W}}> \Delta /3\) is at least \(\Omega (1/n^2)\)Footnote 1. Hence, we are guaranteed to find a approximation \(\Delta '\) of \(\Delta\) that satisfies \(\Delta \ge \Delta ' > \Delta /3\) (also called a (1/3)-approximation) with only \(\mathcal {O}(n^2)\) evaluations of \({\mathcal {W}}\) on average, an improvement over \(n\atopwithdelims ()3\). Diameter estimation relates to outlier detection (Angiulli & Pizzuti, 2002) which is critical e.g. in cybersecurity (Singh & Upadhyaya, 2012).

Example 2

Let S be as above. We can compute the average \(A \triangleq \sum _{{\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3} \in S} {{\mathcal {W}}^{}}({{\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}})/{ n \atopwithdelims ()3}\) with \(n \atopwithdelims ()3\) evaluations of \({{\mathcal {W}}^{}}\). We can estimate A by averaging \({\mathcal {W}}\) over a set with \(\mathcal {O}(n^2)\) distinct triplets randomly sampled from S, improving over \(n \atopwithdelims ()3\).

If \({{\mathcal {W}}^{}}\) is a generalized metric, an argument as in Example 1 shows that with high probability we do not miss triplets with large \({{\mathcal {W}}^{}}\), which is the critical step to prove that we approximate A well. Average estimation is critical e.g. in differential privacy (Dwork and Lei 2009).

Example 3

Let S be as above. Consider building an hypergraph with nodes S and hyperedges defined as follows. For each distinct triplet \(({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3})\) for which \({{\mathcal {W}}^{}}({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}) < thr\), a constant threshold, include it as an hyperedge. Hypergraphs are increasingly important in modern ML, specially for clustering using multiwise relationships among objects (Ghoshdastidar & Dukkipati, 2015; Purkait et al., 2016). Let \({{\mathcal {W}}^{}}\) satisfy the triangle inequality in Example 1 and be invariant under arguments permutations. Let us again define \({{\mathcal {W}}^{}}({\varvec{p}}^{i},{\varvec{p}}^{j},{\varvec{p}}^{k}) = {{\mathcal {W}}^{i,j,k}}\). One can prove that, for any \({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\), \({{\mathcal {W}}^{1,2,3}} \ge \max \{ {{\mathcal {W}}^{1,2,4}}-{{\mathcal {W}}^{1,3,4}}-{{\mathcal {W}}^{2,3,4}} , {{\mathcal {W}}^{2,3,4}}-{{\mathcal {W}}^{1,3,4}}-{{\mathcal {W}}^{1,2,4}}, {{\mathcal {W}}^{1,3,4}}- {{\mathcal {W}}^{1,2,4}}-{{\mathcal {W}}^{2,3,4}}\}\). We can use this inequality to quickly ignore triplets with large \({{\mathcal {W}}^{1,2,3}}\) without evaluating them: plug into it already computed values for \({{\mathcal {W}}^{1,2,4}}\), \({{\mathcal {W}}^{1,3,4}}\), \({{\mathcal {W}}^{2,3,4}}\), do this for multiple \({\varvec{p}}^{4}\) to try to maximize the r.h.s, and check if the r.h.s. is larger than thr.

In this paper, we show that an important family of pairwise MMOTs defines a generalized metric. The idea and advantages of using pairwise-type MMOTs is not new. Previous work Altschuler and Boix-Adsera (2020), Altschuler and Boix-Adsera (2021), Benamou et al. (2015, 2016), Fan et al. (2022), Haasler et al. (2021), Haasler et al. (2021), however, uses them to make a point about concepts other than their metric properties. Most often, the focus is on the polynomial computability of MMOT under extra assumptions. To the best of our knowledge, we are the first to study the generalized metric properties of families of pairwise MMOT. There are many applications of MMOT, and of pairwise-type MMOT more specifically. In addition to the applications referenced in the aforementioned papers, the survey (Pass, 2015) lists applications in economics, physics, and financial mathematics. Specific examples of applications that use pairwise MMOT and that may benefit from the generalized metric properties we prove are locality sensitive hashing, labeling problem in classification, image registration, multi-agent matching (Li & Anantharam, 2019), and optimal coupling of multiple random variables (Angel & Spinka, 2019).

The rest of this paper is organized as follows. We first explain the difficulty of proving this via two negative results (Sects. 3.1 and 3.2). We present our main results on the generalized metric properties of this family of pairwise MMOT’s in Sect. 3.3. We present a proof for a simple case to illustrate the key ideas of the main proof in Sect. 4. Finally, with various numerical examples we show that a MMOT from our family, and which defines an n-metric, improves the task of clustering graphs.

2 Definitions and setup

Lists We write \(s_1,\ldots ,s_k\) as \(s_{1:k}\), \({\Omega }^{1},\ldots ,{\Omega }^{k}\) as \({\Omega }^{1:k}\), and \(A_{s_1,\ldots ,s_k}\) as \(A_{s_{1:k}}\). Note that \(A_{s_{1}:s_{k}}\) differs from \(A_{s_{1:k}}\). Assuming \(s_k > s_1\), then we have \(A_{s_{1}:s_{k}} \equiv A_{s_{1}},A_{s_{1}+1},A_{s_{1}+2},\ldots ,A_{s_{k}}\). By itself, 1 : i has no meaning, and it does not mean \(1,\ldots ,i\). For \(i\in \mathbb {N}\), we let \([i] \triangleq \{1,\ldots ,i\}\).

Generalized inner-product Given two equi-multidimensional arrays A and B, and \(\ell \in \mathbb {N}\), we define

$$\begin{aligned} \left\langle A,B \right\rangle _{\ell } \triangleq \sum _{s_{1:k}} (A_{s_{1:k}})^\ell B_{s_{1:k}} = \sum ^{{\Omega }^{1}}_{s_1=1}\dots \sum ^{{\Omega }^{k}}_{s_k=1}(A_{s_1,\dots ,s_k})^\ell B_{s_1,\dots ,s_k}, \end{aligned}$$

where \((\cdot )^\ell\) is element-wise \(\ell\)th power. Later on, A will be an array of distances and B will be an array of probabilities.

Probability spaces To facilitate exposition, we state our main contributions for probability spaces with a finite sample space in \({\Omega }\), an event set \(\sigma\)-algebra which is the power set of the sample space, and a probability measure described by a probability mass function. We refer to probability mass functions using bold letters, e.g. \({\varvec{p}}\), \({\varvec{q}}\), \({\varvec{r}}\), etc.

When talking about \(n\) probability spaces, the ith space has sample space \({\Omega }^{i} = \{{\Omega }^{i}_{1:{\vert \Omega ^{i}\vert }}\} \subseteq {\Omega }\), an event space \(2^{{\Omega }^{i}}\), and a probability mass function \({\varvec{p}}^{i}\), or \({\varvec{q}}^{i}\), or \({\varvec{r}}^{i}\), etc. Variable \({\vert \Omega ^{i}\vert }\) is the number of atoms in \({\Omega }^{i}\). Symbol \({\varvec{p}}^{i}_{s}\) denotes the probability of the atomic event \(\{{\Omega }^{i}_{s}\}\). Without loss of generality (w.l.o.g.) we assume \({\varvec{p}}^{i}_{s} > 0, \forall i\in [n], \forall s\in [{\vert \Omega ^{i}\vert }]\). Our notation assumes that atoms can be indexed, but our results extend beyond this assumption. W.l.o.g., we assume that \({\Omega }^{i}_{s} = {\Omega }^{i}_{t}\) if and only if \(s = t\).

Symbol \({\varvec{p}}^{i_{1:k}}\) denotes a mass function for the probability space with sample space \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}\) and event space \(2^{{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}}\). In particular, \({\varvec{p}}^{i_{1:k}}_{s_{1:k}}\) (i.e. \({\varvec{p}}^{i_{1},\ldots ,i_k}_{s_1,\ldots ,s_{k}}\)) is the probability of the atomic event \(\{({\Omega }^{i_1}_{s_1},\ldots ,{\Omega }^{i_k}_{s_k})\}\). We use \({\varvec{p}}^{i_{1:k} \mid j_{1:r}}\) to denote a probability mass function for the probability space with sample space \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}\), and event space \(2^{{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}}\), such that \({\varvec{p}}^{i_{1:k} \mid j_{1:r}}_{s_{1:k} \mid t_{1:r}} \triangleq {{\varvec{p}}^{i_1,\ldots ,i_k , j_1,\ldots ,j_r}_{s_1,\ldots ,s_k , t_1,\ldots ,t_r} }/{{\varvec{p}}^{j_1,\ldots ,j_r}_{ t_1,\ldots ,t_r} }\), i.e. a conditional probability.

Definition 1

(Gluing map) Consider a mass function \({\varvec{q}}^{k}\) over \({\Omega }^{k}\) and \(n-1\) conditional mass functions \(\{{\varvec{q}}^{i \mid k}\}_{i \in [n]\backslash \{k\}}\) over \(\{{\Omega }^{i}\}_{i \in [n]\backslash \{k\}}\). The map \({\mathcal {G}}\), that is called a gluing map, defines the mass function \({\varvec{p}}\) over \({\Omega }^{1}\times \cdots \times {\Omega }^{n}\) as

$$\begin{aligned} {\varvec{p}}&={{\mathcal {G}}\left( {\varvec{q}}^{k},\{{\varvec{q}}^{i \mid k}\}_{i \in [n]\backslash \{k\}} \right) } . \end{aligned}$$
(3)

To be more specific, \({\varvec{p}}^{}_{s_1,\ldots ,s_n} = {\varvec{q}}^{k}_{s_k} \prod _{i \in [n]\backslash \{k\}} {\varvec{q}}^{i\mid k}_{s_i }.\)

Distances and metrics We use “distance” to refer to an object that, depending on extra assumptions, might, or might not, have the properties of a (generalized) metric. For a metric we use the standard definition, and for generalized metrics we use the definitions in Kiss et al. (2018).

The definition in Kiss et al. (2018) is the same as ours but is expressed in a slightly different form. In particular, the C(n) in our Definition 3 will correspond to \(1/K_n\) in Kiss et al. (2018). Also, Kiss et al. (2018) expresses their generalized triangle inequality by replacing the rth variable inside a multi-distance with a new variable, while in our definition we remove the rth variable and append a new variable at the end of the argument’s list. Given that both definitions also require permutation invariance of arguments to the multi-distance, these two definitions are equivalent.

Finally, our notions of metric and generalized metric are more general than usual in the sense that they support the use of different distance functions depending on the spaces from where we are drawing elements. This grants an extra layer of generality to our results.

Definition 2

(Metric) Let \(\{{d}^{i,j}\}_{i,j}\) be a set of distances of the form \({d}^{i,j}:{\Omega }^{i}\times {\Omega }^{j}\mapsto \mathbb {R}\) and \({d}^{i,j}({\Omega }^{i}_{s},{\Omega }^{j}_{t}) \triangleq {{d}^{i,j}_{s,t}}\). We say that \({d}^{}\) is a metric when, for any ijk, and any \(s\in [{\vert \Omega ^{{i}}\vert }], r\in [{\vert \Omega ^{{j}}\vert }], t\in [{\vert \Omega ^{{k}}\vert }]\), we have: (1) \({d}^{i,j}_{r,s} \ge 0\); (2) \({d}^{i,j}_{r,s} = {d}^{j,i}_{s,r}\); (3) \({d}^{i,j}_{s,r} = 0 \text { iff } {\Omega }^{i}_{s} = {\Omega }^{j}_{r}\); (4) \({d}^{i,j}_{s,r} \le {d}^{i,k}_{s,t} + {d}^{k,j}_{t,r}\).

Definition 3

(Generalized metric) Let \(\{{d}^{i_{1:n}}\}_{i_{1:n}}\) be a set of distances of the form \({d}^{i_{1:n}}:{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n} \mapsto \mathbb {R}\) and \({d}^{i_{1:n}}({\Omega }^{i_1}_{s_1},\ldots ,{\Omega }^{i_k}_{s_n}) \triangleq {{d}^{i_{1:n}}_{s_{1:n}}}\). We say that \({d}\) is an \((n,C(n))\)-metric when, for any \(i_{1:n+1}\) and \(s_{1:n+1}\) with \(s_r \in [{\vert \Omega ^{i_r}\vert }]\forall r\in [n+1]\), we have:  (1) \({{d}^{i_{1:n}}_{s_{1:n}}} \ge 0\);  (2) \({{d}^{i_{1:n}}_{s_{1:n}}} = {{d}^{\sigma (i_{1:n})}_{\sigma (s_{1:n})}}\) for any permutation \(\sigma ;\; 3) {{d}^{i_{1:n}}_{s_{1:n}}} = 0 \text { iff } {\Omega }^{i_r}_{s_r} = {\Omega }^{i_t}_{s_t},\; \forall r,t\in [n];\; 4)\) \(C(n){{d}^{i_{1:n}}_{s_{1:n}}} \le \sum ^{n}_{r=1} {{d}^{i_1,\ldots ,i_{r-1},i_{r+1},\ldots ,i_{n+1}}_{s_1,\ldots ,s_{r-1},s_{r+1},\ldots ,s_{n+1}}}\).

Definition 4

(Generalized metric on distributions) Let \({{\mathcal {W}}^{}}\) be a map from \(n\) probability spaces to \(\mathbb {R}\) such that \({{\mathcal {W}}^{i_{1:n}}} \triangleq {\mathcal {W}}({\varvec{p}}^{i_{1:n}})\) is the image of the probability spaces with indices \(i_{1:n}\). For any \(n+1\) probability spaces with samples \({\Omega }^{1:n+1}\) and masses \({\varvec{p}}^{1:n+1}\), and any permutation \(\sigma\),\({{\mathcal {W}}^{}}\) is an \((n,C(n))\)-metric if:   (1) \({\mathcal {W}}^{1,\ldots ,n} \ge 0\) ;   (2) \({\mathcal {W}}^{1,\ldots ,n} = 0 \text { iff } {\varvec{p}}^{i} = {\varvec{p}}^{j}\), \({\Omega }^{i}={\Omega }^{j}, \ \forall i,j\);  (3) \({\mathcal {W}}^{1,\ldots ,n} = {\mathcal {W}}^{\sigma (1,\ldots ,n)}\) ;  (4) \(C(n) {\mathcal {W}}^{1,..,n} \le \sum \limits ^{n}_{r = 1} {\mathcal {W}}^{1,..,r-1,r+1,..,n+1}\).

Remark 1

Equalities \({\varvec{p}}^{i} = {\varvec{p}}^{j}\) and \({\Omega }^{i}={\Omega }^{j}\), mean that \({\vert \Omega ^{i}\vert } = {\vert \Omega ^{j}\vert }\), and that there exists a bijection \(b^{i,j}(\cdot )\) from \([{\vert \Omega ^{i}\vert }]\) to \([{\vert \Omega ^{j}\vert }]\) such that \({\varvec{p}}^{i}_{s} = {\varvec{p}}^{j}_{b^{i,j}(s)}\) and \({\Omega }^{i}_{s}={\Omega }^{j}_{b^{i,j}(s)}\), \(\forall\) \(s \in [{\vert \Omega ^{i}\vert }]\).

Remark 2

The inequality in the property 4 in Definition 4 is the generalized triangle inequality. Figure 1 depicts a geometric analogy for the generalized triangle inequality. Just like the measure (length) of one of the sides of a triangle is always smaller than the sum of the measure of the two other sides, the measure (area/volume/etc) of one of the facets of a simplex is always smaller than the sum of the measure of the other facets. The constant C(n) is included in our definition so that we later can prove that a stronger inequality than \(C(n) =1\) holds.

Fig. 1
figure 1

Geometric analog of the generalized triangle inequality: the total area of any three faces in a tetrahedron is greater than that of the fourth face

We abbreviate \((n,1)\)-metric by \(n\)-metric. In our setup, the \(\inf\) in (2) is always attained (recall, finite-spaces) and amounts to solving an LP. We refer to the minimizing distributions by \({\varvec{p}}^{*}, {\varvec{q}}^{*}, {\varvec{r}}^{*}\), etc. We define the following map from \(n\) probability spaces to \(\mathbb {R}\). The definition below amounts to (2) when \({\varvec{p}}\)’s are empirical measures.

Definition 5

(MMOT distance for finite spaces) Let \(\{{d}^{i_{1:n}}\}_{i_{1:n}}\) be a set of distances of the form \({d}^{i_{1:n}}:{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n}\mapsto \mathbb {R}\) and \({d}^{i_{1:n}}({\Omega }^{i_1}_{s_1},\ldots ,{\Omega }^{i_k}_{s_n}) \triangleq {{d}^{i_{1:n}}_{s_{1:n}}}\). The MMOT distance associated with \(\{{d}^{i_{1:n}}\}_{i_{1:n}}\) for \(n\) probability spaces with masses \({\varvec{p}}^{i_{1:n}}\) over \({\Omega }^{i_{1:n}}\) is

$$\begin{aligned} {\mathcal {W}}({\varvec{p}}^{i_{1:n}}) \triangleq {\mathcal {W}}^{i_{1:n}} = \min _{{\varvec{r}}^{}:{\varvec{r}}^{i_s} = {\varvec{p}}^{i_s} \forall s \in [n]} \left\langle {d}^{i_{1:n}},{\varvec{r}} \right\rangle _{\ell }^{\frac{1}{\ell }}, \end{aligned}$$
(4)

where \({\varvec{r}}^{}\) is a mass function over \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n}\), and \({\varvec{r}}^{i_s}\) be the marginal probability of \({\varvec{r}}\) on \({\Omega }^{i_s}\).

Remark 3

Solving (4) amounts to solving a linear program when \(\ell = 1\), and to solving a convex optimization problem when \(\ell >1\) and d is non-negative. Solving a linear program can be done in polynomial time in the number of variables (Karmarkar, 1984). However, for general MMOT, the number of variables is exponential in the number of marginals n and their support sizes k, which makes the overall run time be \(\text {poly}(k^n)\). Even approximately solving MMOT in some general instances is NP-hard. However, under certain structural assumptions on the input costs/distances, we can get tractable algorithms (Altschuler & Boix-Adsera 2021; Lin et al., 2022).

3 Main results

To prove that MMOT leads to an \(n\)-metric, it is natural to extend the ideas in the classical proof that WD is a metric. The hardest property to prove is the triangle inequality. Its proof for the WD follows from a gluing lemma and the assumption that \({d}\) itself is a metric (Definition 2). Our hope is that we can prove the triangle inequality for MMOT if we can prove (a) a generalized gluing lemma and assume that (b) \({d}\) is a generalized metric. Unfortunately, and as we explain in Sects. 3.1 and 3.2, (a) is not possible, and (b) is not sufficient. This requires developing completely new proofs.

3.1 The gluing lemma does not generalize to higher dimensions

The gluing lemma used to prove that WD is a metric is as follows. For its proof see Ambrosio and Gigli (2013), Lemma 2.1.

Lemma 1

(Gluing lemma) Let \({\varvec{p}}^{1,3}\) and \({\varvec{p}}^{2,3}\) be arbitrary mass functions for \({\Omega }^{1}\times {\Omega }^{3}\) and \({\Omega }^{2}\times {\Omega }^{3}\), respectively, with the same marginal, \({\varvec{p}}^{3}\), over \({\Omega }^{3}\). There exists a mass function \({\varvec{r}}^{1,2,3}\) for \({\Omega }^{1}\times {\Omega }^{2}\times {\Omega }^{3}\) whose marginals over \({\Omega }^{1}\times {\Omega }^{3}\) and \({\Omega }^{2}\times {\Omega }^{3}\) equal \({\varvec{p}}^{1,3}\) and \({\varvec{p}}^{2,3}\) respectively.

The way Lemma 1 is used to prove WD’s triangle inequality is as follows. Assume \({d}\) is a metric (Definition 2). Let \(\ell = 1\) for simplicity. Let \({{{\varvec{p}}}^{*}}^{1,2}\), \({{{\varvec{p}}}^{*}}^{1,3}\), and \({{{\varvec{p}}}^{*}}^{2,3}\) be optimal transports such that \({\mathcal {W}}^{1,2} = \left\langle {{{\varvec{p}}}^{*}}^{1,2},{d}^{1,2} \right\rangle\), \({\mathcal {W}}^{1,3} =\left\langle {{{\varvec{p}}}^{*}}^{1,3},{d}^{1,3} \right\rangle\), and \({\mathcal {W}}^{2,3} =\left\langle {{{\varvec{p}}}^{*}}^{2,3},{d}^{2,3} \right\rangle\). Define \({\varvec{r}}^{1,2,3}\) as in Lemma 1, and let \({\varvec{r}}^{1,3}\), \({\varvec{r}}^{2,3}\), and \({\varvec{r}}^{1,2}\) be its bivariate marginals. We then have

$$\begin{aligned}&\left\langle {{{\varvec{p}}}^{*}}^{1,2},{d}^{1,2} \right\rangle {\mathop {\le }\limits ^{\text {sub-optimal } {\varvec{r}}}} \left\langle {\varvec{r}}^{1,2},{d}^{1,2} \right\rangle =\sum _{s,t} {\varvec{r}}^{1,2}_{s,t}{{d}^{1,2}_{s,t}} = \sum _{s,t,l} {\varvec{r}}^{1,2,3}_{s,t,l}{{d}^{1,2}_{s,t}} \\&{\mathop {\le }\limits ^{\text {{ d} is metric}}} \sum _{s,t,l} {\varvec{r}}^{1,2,3}_{s,t,l}({{d}^{1,3}_{s,l}} + {{d}^{2,3}_{t,l}}) = \left\langle {\varvec{r}}^{1,3},{d}^{1,3} \right\rangle + \left\langle {\varvec{r}}^{2,3},{d}^{2,3} \right\rangle \\&{\mathop {=}\limits ^{\text {Lemma 1}}}\left\langle {{{\varvec{p}}}^{*}}^{1,3},{d}^{1,3} \right\rangle + \left\langle {{{\varvec{p}}}^{*}}^{2,3},{d}^{2,3} \right\rangle . \end{aligned}$$

Our first roadblock is that Lemma 1 does not generalize to higher dimensions. For simplicity, we now omit the sample spaces on which mass functions are defined. When a set of mass functions have all their marginals over the same sample sub-spaces equal, we will say they are compatible.

Theorem 1

(No gluing) There exists mass functions \({\varvec{p}}^{1,2,4}\), \({\varvec{p}}^{1,3,4}\), and \({\varvec{p}}^{2,3,4}\) with compatible marginals such that there is no mass function \({\varvec{r}}^{1,2,3,4}\) compatible with them.

Proof

If this were not the case, then, by computing marginals of the mass functions in the theorem’s statement, it would be true that, given arbitrary mass functions \({\varvec{p}}^{1,2}\), \({\varvec{p}}^{1,3}\), and \({\varvec{p}}^{2,3}\) with compatible univariate marginals, we should be able to find \({\varvec{r}}^{1,2,3}\) whose bivariate marginals equal these three mass functions. But this is not the case. A counter example, taken from Remark 3 in Haasler et al. (2021), is given below. For example, let

$$\begin{aligned} {\varvec{p}}^{1,2} = {\varvec{p}}^{1,3} = \frac{1}{2}\begin{bmatrix} 1 &{} 0\\ 0 &{} 1 \end{bmatrix}\quad \text {and}\quad {\varvec{p}}^{2,3}= \frac{1}{2}\begin{bmatrix} 0 &{} 1\\ 1 &{} 0 \end{bmatrix}. \end{aligned}$$

These marginals have compatible univariate marginals, namely, \({\varvec{p}}^{1} = {\varvec{p}}^{2} = {\varvec{p}}^{3} = [1, 1]/2\). Yet, the following system of eqs. over \(\{{\varvec{r}}^{1,2,3}_{i,j,k}\}_{i,j,k \in [2]}\) is easily checked to be infeasible

$$\begin{aligned}&\left( {\varvec{r}}^{1,2,3}_{i,j,k} \ge 0 \; \forall i,j,k\right) \wedge \Big (\sum _{i}{\varvec{r}}^{1,2,3}_{i,j,k} = {\varvec{p}}^{2,3}_{j,k} \; \forall j,k \Big ) \wedge \Big ( \sum _{j}{\varvec{r}}^{1,2,3}_{i,j,k} = {\varvec{p}}^{1,3}_{i,k} \;\forall i,k \Big )\\&\wedge \Big ( \sum _{k}{\varvec{r}}^{1,2,3}_{i,j,k} = {\varvec{p}}^{1,2}_{i,j} \;\forall i,j \Big ). \end{aligned}$$

\(\square\)

Remark 4

It might be possible to obtain a generalized gluing lemma if, in addition to requiring consistency of the univariate marginal distributions, we also require consistency of the \((n-1)\)-variate marginal distributions. We leave studying properties of MMOTs defined with additional k-variate marginal consistency constraints as future work.

3.2 Cost d being an \(n\)-metric is not a sufficient condition for MMOT to be an \(n\)-metric

Theorem 1 tells us that, even if we assume that \({d}\) is an \(n\)-metric, we cannot adapt the classical proof showing WD is a metric to MMOT leading to an \(n\)-metric. The question remains, however, whether there exists such a proof at all only under the assumption that \({d}\) is \(n\)-metric. Theorem 2 settles this question in the negative.

Theorem 2

Let \({\mathcal {W}}\) be MMOT distance as in Definition 5 with \(\ell =1\). There exists a sample space \({\Omega }^{}\), mass functions \({\varvec{p}}^{1}\), \({\varvec{p}}^{2}\), \({\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\) over \({\Omega }^{}\), and distance \({d}:{\Omega }^{}\times {\Omega }^{}\times {\Omega }^{} \mapsto \mathbb {R}\) such that \({d}\) is an \(n\)-metric (\(n= 3\)), but

$$\begin{aligned} {\mathcal {W}}^{1,2,3} > {\mathcal {W}}^{1,2,4} + {\mathcal {W}}^{1,3,4} + {\mathcal {W}}^{2,3,4}. \end{aligned}$$
(5)

Proof

Let \({\Omega }\) be the six points in Fig. 2, where we assume that \(0 < \epsilon \ll 1\), and hence that there are no three co-linear points, and no two equal points. Let \({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\) be as in Fig. 2, each is represented by a unique color and is uniformly distributed over the points of the same color. Given any \(x,y,z \in {\Omega }\) let \({d}(x,y,z) = \gamma\) if exactly two points are equal, and let \({d}(x,y,z)\) be the area of the corresponding triangle otherwise, where \(\gamma\) lower bounds the area of the triangle formed by any three non-co-linear points, e.g. \(\gamma = \epsilon / 4\). A few geometric considerations (see Appendix 1) show that \({d}\) is an \(n\)-metric (\(n=3, C(n) = 1\)) and that (5) holds as \(\frac{1}{2} > \frac{1}{8} + \left( \frac{1}{8} + \frac{\epsilon }{4} \right) + \left( \frac{1}{8} + \frac{\epsilon }{4}\right)\). \(\square\)

Fig. 2
figure 2

Sample space \({\Omega }\), mass functions \(\{{\varvec{p}}^{i}\}^4_{i=1}\), and cost function \({d}\) that lead to violation (5). Red triangle is \({\varvec{p}}^1\). Blue star is \({\varvec{p}}^2\). Green squares are \({\varvec{p}}^3\). Yellow circles are \({\varvec{p}}^4\)

Remark 5

Theorem 2 can be generalized to spaces of dimension \(> 2\), and to \(n> 3\), and \(\ell > 1\).

3.3 A new family of pairwise MMOTs that are generalized metrics

We will prove that the properties of a generalized metric (Definition 4) hold for the novel family of MMOT distances defined as follows.

Definition 6

(Pairwise MMOT distance) Let \(\{{d}^{i,j}\}_{i,j}\) be a set of distances of the form \({d}^{i,j}:{\Omega }^{i}\times {\Omega }^{j}\mapsto \mathbb {R}\) and \({d}^{i,j}({\Omega }^{i}_{s},{\Omega }^{j}_{t}) \triangleq {{d}^{i,j}_{s,t}}\). The Pairwise MMOT distance associated with \(\{{d}^{i,j}\}_{i,j}\) for \(n\) probability spaces with masses \({\varvec{p}}^{i_{1:n}}\) over \({\Omega }^{i_{1:n}}\) is \({\mathcal {W}}({\varvec{p}}^{i_{1:n}}) \triangleq {\mathcal {W}}^{i_{1:n}}\) with

$$\begin{aligned} {\mathcal {W}}^{i_{1:n}} = \min _{\begin{array}{c} {\varvec{r}}^{}:{\varvec{r}}^{i_s} = {\varvec{p}}^{i_s}\\ \forall s \in [n] \end{array}} \sum _{ 1 \le s < t \le n} \left\langle {d}^{i_s,i_t},{\varvec{r}}^{i_s,i_t} \right\rangle _{\ell }^{\frac{1}{\ell }}, \end{aligned}$$
(6)

where \({\varvec{r}}\) is a mass over \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n}\), with marginals \({\varvec{r}}^{i_s}\) and \({\varvec{r}}^{i_s,i_t}\) over \({\Omega }^{i_s}\) and \({\Omega }^{i_s}\times {\Omega }^{i_t}\).

Remark 6

Note that each choice of \(\{{d}^{i,j}\}_{i,j}\) results in a different MMOT. Our results in this section hold for the whole family of such MMOTs.

Remark 7

Swapping \(\min\) and \(\sum\) gives a new definition \({\mathcal {W}}^{i_{1:n}}_{\text {pairs}} = \sum _{1 \le s<t\le n} {\mathcal {W}}^{i_s,i_t}\), where \({\mathcal {W}}^{i_s,i_t}\) is the WD between \({\Omega }^{i_s}\) and \({\Omega }^{i_t}\). This is trivially an \(n\)-metric (cf. (Kiss et al., 2018)) but is different from eq. (6). In particular, it does not provide a joint optimal transport, which is important to many applications such as color transfer (Strössner & Kressner, 2022), tomographic reconstruction (Abraham et al., 2017), and robust localization and sensor fusion (Elvander et al., 2020).

If \(n= 2\), Definition 6 reduces to the Wasserstein distance. Our definition is a special case of the Kantorovich formulation for the general MMOT problem discussed in Pass (2015). Furthermore, if \(\ell = 1\), we can get pairwise MMOT (Definition 6) from general MMOT (Definition 5), by defining \({d}^{i_{1:n}}:{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n} \mapsto \mathbb {R}\) such that

$$\begin{aligned} {d}^{i_{1:n}}(w^{1:n}) = \left( \sum _{1\le s<t\le n}({d}^{i_s,i_t}(w^s,w^t))^{\ell }\right) ^{\frac{1}{\ell }}, \end{aligned}$$
(7)

for some set of distances \(\{{d}^{i,j}\}_{i,j}\). However, if \(\ell > 1\), choosing \({d}^{i_{1:n}}\) as in eq. (7) in Definition 5 does not lead to Definition 6, but rather to an upper bound of it. Indeed, with this choice we get

$$\begin{aligned}&\min _{{ \varvec{r}}:{ \varvec{r}}^{i_s} = { \varvec{p}}^{i_s},\; \forall s \in [n]} \langle d^{i_{1:n}} , { \varvec{r}}\rangle ^{1/\ell }_{\ell } = \end{aligned}$$
(8)
$$\begin{aligned}&\min _{{ \varvec{r}}:{ \varvec{r}}^{i_s}={ \varvec{p}}^{i_s},\; \forall s \in [n]} \left( \sum _{w^{1:n}} \sum _{1 \le s < t \le n } (d^{i_s,i_t}(w^s , w^t))^{\ell } { \varvec{r}}(w^{1:n})\right) ^{1/\ell }= \end{aligned}$$
(9)
$$\begin{aligned}&\min _{{ \varvec{r}}:{ \varvec{r}}^{i_s}={ \varvec{p}}^{i_s},\; \forall s \in [n]} \left( \sum _{1 \le s < t \le n } \sum _{w^{1:n}} (d^{i_s,i_t}(w^s , w^t))^{\ell } { \varvec{r}}(w^{1:n})\right) ^{1/\ell }= \end{aligned}$$
(10)
$$\begin{aligned}&\min _{{ \varvec{r}}:{ \varvec{r}}^{i_s}={ \varvec{p}}^{i_s} \forall s \in [n]} \left( \sum _{1 \le s < t \le n } \sum _{w^{s},w^{t}} (d^{i_s,i_t}(w^s , w^t))^{\ell } { \varvec{r}}^{i_s,i_t}(w^{s},w^{t})\right) ^{1/ \ell }= \end{aligned}$$
(11)
$$\begin{aligned}&\min _{{ \varvec{r}}:{ \varvec{r}}^{i_s}={ \varvec{p}}^{i_s} \forall s \in [n]} \left( \sum _{1 \le s < t \le n } \langle d^{i_s,i_t} , { \varvec{r}}^{i_s,i_t}\rangle _{\ell } \right) ^{1/\ell } \end{aligned}$$
(12)
$$\begin{aligned}&\le \min _{{ \varvec{r}}:{ \varvec{r}}^{i_s}={ \varvec{p}}^{i_s} \forall s \in [n]} \sum _{1 \le s < t \le n } \langle d^{i_s,i_t} , { \varvec{r}}^{i_s,i_t}\rangle ^{1/\ell }_{\ell }, \end{aligned}$$
(13)

where the last inequality follows from Hölder’s inequality.

It is easy to prove that if \(\{{d}^{i,j}\}_{i,j}\) is a metric (Definition 2), then \({d}\) is an \(n\)-metric (Definition 3). However, because of Theorem 2, we know that this is not sufficient to guarantee that the pairwise MMOT distance is an \(n\)-metric, which only makes the proof of the next theorem all the more interesting.

Theorem 3

If \({d}\) is a metric (Definition 2), then the pairwise MMOT distance (Definition 6) associated with \({d}\) is an \((n,C(n))\)-metric, with \(C(n) \ge 1\).

Proof

The proof is presented in Appendix 1. \(\square\)

We currently do not know the most general conditions under which Definition 3 is an \(n\)-metric. However, working with Definition 6 allows us sharply bound the best possible C(n), which would unlikely be possible in a general setting. As Theorem 4 shows, the best C(n) is \(C(n) = \Theta (n)\).

Theorem 4

In Theorem 3, the constant \(C(n)\) can be made larger than \((n-1)/5\) for \(n> 7\), and there exists sample spaces \({\Omega }^{1:n}\), mass functions \({\varvec{p}}^{1:n}\), and a metric d over \({\Omega }^{1:n}\) such that \(C(n) \le n-1\).

Proof

The proof is presented in Appendix 4. \(\square\)

Remark 8

Note that if \({\Omega }^{i} = {\Omega },\ \forall \ i\) and \({d}: {\Omega }\times \dots \times {\Omega }\mapsto \mathbb {R}\) such that \({d}(w^{1:n}) = \min _{w \in {\Omega }}\sum _{s \in [n]} {d}^{1,2}(w^s,w)\) and \({d}^{1,2}\) is a metric, then \({d}\) is an \(n\)-metric (Kiss et al., 2018). One can then prove, see Carlier and Ekeland (2010), that Definition 4 is equivalent to \({{\mathcal {W}}^{}}({\varvec{p}}^{{1:n}}) =\min _{{\varvec{p}}} \sum _{s\in [n]}{{\mathcal {W}}^{}}({\varvec{p}}^{s},{\varvec{p}})\), which is also called the Wasserstein barycenter distance (WBD) (Agueh, 2011). The later definition makes \({{\mathcal {W}}^{}}({\varvec{p}}^{{1:n}})\) a Fermat n-distance as defined in Kiss et al. (2018, Proposition 3.1), from which it follows immediately via this same proposition that it is an n-metric with \(C(n) = \Theta (n)\). The pairwise MMOT is not a Fermat distance, and Theorems 3 and 4 do not follow from (Kiss et al., 2018). Hence, a novel proof strategy is required.

In the next section, we give a self contained proof that the generalized triangle inequality holds with \(C(n) = 1\) for \(n= 3\) when \({d}\) is a metric. This proof contains the key ideas behind the proof of the triangle inequality in Theorems 3 and 4. Proving the generalized triangle inequality for a general \(n\), and a large \(C(n)\), is the hardest part of these theorems, compared to proving the other properties required by generalized metrics.

4 Proof of the generalized triangle inequality for \(n= 3\), \(\ell =1\), and \(C(n) = 1\)

We will prove that for any mass functions \({\varvec{p}}^{1},\dots ,{\varvec{p}}^{4}\) over \({\Omega }^{1},\dots ,{\Omega }^{4}\), respectively, if \({d}^{i,j}:{\Omega }^{i}\times {\Omega }^{j}\mapsto \mathbb {R}\) is a metric for any \(i,j\in \{1,\dots ,4\}, i \ne j\), then \({\mathcal {W}}^{1,2,3} \le {\mathcal {W}}^{1,2,4} + {\mathcal {W}}^{1,3,4} + {\mathcal {W}}^{2,3,4}\).

We write this inequality more succinctly as

$$\begin{aligned} {\mathcal {W}}^{1,2,3} \le {\mathcal {W}}^{\backslash 3} + {\mathcal {W}}^{\backslash 2} + {\mathcal {W}}^{\backslash 1}, \end{aligned}$$
(14)

using a symbol \({\mathcal {W}}^{\backslash r}\) whose meaning is obvious. We begin by expanding all the terms in (14),

$$\begin{aligned}&{\mathcal {W}}^{1,2,3} = \left\langle {d}^{1,2},{{{\varvec{p}}}^{*}}^{1,2} \right\rangle + \left\langle {d}^{1,3},{{{\varvec{p}}}^{*}}^{1,3} \right\rangle +\left\langle {d}^{2,3},{{{\varvec{p}}}^{*}}^{2,3} \right\rangle , \end{aligned}$$

and,

$$\begin{aligned} {\mathcal {W}}^{\backslash 3} + {\mathcal {W}}^{\backslash 2} + {\mathcal {W}}^{\backslash 1}{} & {} =\left\langle {d}^{1,2},{{{\varvec{p}}}^{*}}^{(3)^{1,2}} \right\rangle + \left\langle {d}^{1,4},{{{\varvec{p}}}^{*}}^{(3)^{1,4}} \right\rangle +\left\langle {d}^{2,4},{{{\varvec{p}}}^{*}}^{(3)^{2,4}} \right\rangle \\{} & {} \quad +\left\langle {d}^{1,3},{{{\varvec{p}}}^{*}}^{(2)^{1,3}} \right\rangle + \left\langle {d}^{1,4},{{{\varvec{p}}}^{*}}^{(2)^{1,4}} \right\rangle +\left\langle {d}^{3,4},{{{\varvec{p}}}^{*}}^{(2)^{3,4}} \right\rangle \\{} & {} \quad + \left\langle {d}^{2,3},{{{\varvec{p}}}^{*}}^{(1)^{2,3}} \right\rangle + \left\langle {d}^{2,4},{{{\varvec{p}}}^{*}}^{(1)^{2,4}} \right\rangle +\left\langle {d}^{3,4},{{{\varvec{p}}}^{*}}^{(1)^{3,4}} \right\rangle , \end{aligned}$$

where \(\{{{{\varvec{p}}}^{*}}^{i,j}\}\) are the bivariate marginals of the optimal joint distribution \({{{\varvec{p}}}^{*}}^{1,2,3}\) for \({\mathcal {W}}^{1,2,3}\), and \(\{{{{\varvec{p}}}^{*}}^{(r)^{i,j}}\}\) are the bivariate marginals of the optimal joint distribution for \({\mathcal {W}}^{\backslash r}\).

Now we define the following probability mass function on \({\Omega }^{1}\times \dots \times {\Omega }^{4}\), namely \({\varvec{p}}^{1,2,3,4}\), such that

$$\begin{aligned} {\varvec{p}}^{1,2,3,4}_{s,t,l,u} = {\varvec{p}}^{4}_{u} \frac{{{{\varvec{p}}}^{*}}^{(3)^{1,4}}_{s,u}}{{\varvec{p}}^{4}_{u}}\frac{{{{\varvec{p}}}^{*}}^{(2)^{3,4}}_{l,u}}{{\varvec{p}}^{4}_{u}}\frac{{{{\varvec{p}}}^{*}}^{(1)^{2,4}}_{t,u}}{{\varvec{p}}^{4}_{u}}. \end{aligned}$$
(15)

Notice that this definition is such that its bivariate marginals match the optimal bivariate marginals of each of the terms in the generalized triangle inequality. Recall that w.l.o.g. we assume that no element in \({\Omega }^{i}\) has zero mass, so the denominators are not zero. Since the bivariate probability mass functions \({\varvec{p}}^{1,2},{\varvec{p}}^{1,3}\), and \({\varvec{p}}^{2,3}\) of \({\varvec{p}}^{1,2,3,4}\) are feasible but sub-optimal choices of minimizers in Eq. (6) in Definition 6, we have that

$$\begin{aligned} {\mathcal {W}}^{1,2,3}\le \left\langle {d}^{1,2},{\varvec{p}}^{1,2} \right\rangle + \left\langle {d}^{1,3},{\varvec{p}}^{1,3} \right\rangle +\left\langle {d}^{2,3},{\varvec{p}}^{2,3} \right\rangle . \end{aligned}$$
(16)

It is convenient to introduce the following more compact notation \(w_{i,j} = \left\langle {d}^{i,j},{\varvec{p}}^{i,j} \right\rangle\) and \(w^*_{i,j,(r)} = \left\langle {d}^{i,j},{{{\varvec{p}}}^{*}}^{(r)^{i,j}} \right\rangle\). Notice that, for any ijk and r, we have \(w_{i,j}\le w_{i,k} + w_{j,k}\) and \(w^*_{i,j,(r)}\le w^*_{i,k,(r)} + w^*_{j,k,(r)}\). This follows directly from the assumption that \(\{{d}^{i,j}\}\) are metrics. Let us prove that \(w_{1,2}\le w_{1,4} + w_{2,4}\):

$$\begin{aligned}&w_{1,2} = \sum _{s,t} {{d}^{1,2}_{s,t}}{\varvec{p}}^{1,2}_{s,t}=\sum _{s,t,l} {{d}^{1,2}_{s,t}}{\varvec{p}}^{1,2,4}_{s,t,l}\le \sum _{s,t,l} ({{d}^{1,4}_{s,l}}+{{d}^{2,4}_{t,l}}){\varvec{p}}^{1,2,4}_{s,t,l}=w_{1,4}+w_{2,4}. \end{aligned}$$
(17)

Similarly, \(w_{1,3}\le w_{1,4} + w_{3,4}\), and \(w_{2,3}\le w_{2,4} + w_{3,4}\). Combining these inequalities and (16) we can write \({\mathcal {W}}^{1,2,3}\le w_{1,2}+ w_{1,3}+w_{2,3} \le (w_{1,4} + w_{2,4}) + (w_{1,4} + w_{3,4})+ (w_{2,4} + w_{3,4})\). Note that by construction, the bivariate marginals of \({\varvec{p}}^{1,2,3,4}\) in (15) satisfy \({\varvec{p}}^{1,4}={{{\varvec{p}}}^{*}}^{(3)^{1,4}}\), \({\varvec{p}}^{3,4}={{{\varvec{p}}}^{*}}^{(2)^{3,4}}\), and \({\varvec{p}}^{2,4}={{{\varvec{p}}}^{*}}^{(1)^{2,4}}\). Hence,

$$\begin{aligned}&{\mathcal {W}}^{1,2,3} \le (w^*_{1,4,(3)} + w^*_{2,4,(1)}) + (w^*_{1,4,(3)} + w^*_{3,4,(2)})+ (w^*_{2,4,(1)} + w^*_{3,4,(2)}). \end{aligned}$$
(18)

Using the new notation, we can re-write the r.h.s. of (14) as

$$\begin{aligned}&{\mathcal {W}}^{\backslash 3} + {\mathcal {W}}^{\backslash 2} + {\mathcal {W}}^{\backslash 1} =\nonumber (w^*_{1,2,(3)}+w^*_{1,4,(3)}+w^*_{2,4,(3)}) \\&+ (w^*_{1,3,(2)}+w^*_{1,4,(2)}+w^*_{3,4,(2)}) + (w^*_{2,3,(1)}+w^*_{2,4,(1)}+w^*_{3,4,(1)}). \end{aligned}$$
(19)

To finish the proof we show that the r.h.s. of (18) can be upper bounded by the r.h.s. of (19). We use the triangular inequality of \(w^*_{i,j,(k)}\) and apply it to the 1st, 4th, and 5th terms on the r.h.s. of (18) as specified by the parenthesis:

$$\begin{aligned}&(w^*_{1,4,(3)} + w^*_{2,4,(1)}) + (w^*_{1,4,(3)} + w^*_{3,4,(2)})+ (w^*_{2,4,(1)} + w^*_{3,4,(2)})\\&\le ( (w^*_{1,2,(3)} + w^*_{2,4,(3)})+ w^*_{2,4,(1)}) + (w^*_{1,4,(3)}+ (w^*_{1,3,(2)} + w^*_{1,4,(2)}))\\&\quad + ((w^*_{2,3,(1)} + w^*_{3,4,(1)})+ w^*_{3,4,(2)}),\nonumber \end{aligned}$$

and observe that the terms in the r.h.s. of this last inequality are accounted for on the r.h.s. of (19).

Note that this last step, figuring out to which terms we should apply the triangular inequality property of \(w^*\) such that we can “cover” the r.h.s. of (18) with the r.h.s. of (19), is critical. Also, the fact that we want to prove the MMOT triangle inequality holds for \(C(n) = \Theta (n)\) makes the last step even harder. To address this, in the proofs of Theorems 3 and 4 (presented in the Appendix) we develop a general procedure and special hash functions to expand (using the triangle inequality) and to match terms.

5 Numerical experiments

We show how using a MMOT that defines an \(n\)-metric (\(n> 2\)) improves two tasks of clustering graphs compared to using a non-\(n\)-metric MMOT or an optimal transport (OT) that defines a 2-metric. One clustering task, Sect. 5.2, is on synthetic graphs. The other one, Sect. 5.3, is a real task of clustering chemical molecules.

5.1 Multi-distance based clustering

We use the same clustering strategy for both tasks. We cluster graphs by (i) computing their spectrum, (ii) treating each spectrum as a probability distribution, (iii) using WD and three different MMOT’s to compute distances among these distributions, and iv) feeding these distances to different distance-based clustering algorithms to recover the true cluster memberships, as illustrated below. We give details of each step in our procedure next.

figure a

To produce the spectra, we transform the set of graphs \(\{G^i\}_{i}\) into vectors \(\{v^i\}_{i}\) to be clustered. Each \(v^i\) is the (complex-valued) spectrum of a matrix \(M^i\) representing non-backtracking walks on \(G^i\), which approximates the length spectrum \(\mu ^i\) of \(G^i\) (Torres et al., 2019). Object \(\mu ^i\) uniquely identifies (the 2-core (Batagelj & Zaveršnik, 2011) of) \(G^i\) (up to an isomorphism) (Constantine & Lafont, 2019), but is too abstract to be used directly. Hence, we use its approximation \(v^i\). The length of \(v^i\) and \(v^j\) for equal-sized \(G^i\) and \(G^j\) can be different, depending on how we approximate \(\mu ^i\). This motivates the use distance-based clustering and OT (multi) distances, since OT allows comparing objects of different lengths. Note that unlike the length spectrum, the classical spectrum of a graph (the eigenvalues of e.g. an adjacency matrix, Laplacian matrix, or random-walk matrix) has the advantage of having the same length for graphs with the same number of nodes. However, it does not uniquely identify a graph. For example, a star graph with 5 nodes and the graph that is the union of a square with an isolated node are co-spectral but are not isomorphic.

To produce the (hyper) graph of graphs’ spectra we need (hyper) edges and edge-weights. Each \(v^i\) from the previous step is interpreted as a uniform distribution \({\varvec{p}}^{i}\) over \({\Omega }^{i} = \{v^i_k, k = 1,\ldots ,\}\). To produce the graph of graphs’ spectra, we compute a sampled version \(\hat{T}^{\text {A}}\) of the matrix \(T^{\text {A}} = \{{\mathcal {W}}^{i,j}\}_{i,j}\), where \({\mathcal {W}}^{i,j}\) is the WD between \({\varvec{p}}^{i}\) and \({\varvec{p}}^{j}\) using a \({d}^{i,j}\) defined by \({{d}^{i,j}_{s,t}}\) \(= |v^i_s - v^j_t |\). We produce three different hyper-graph of graphs’ spectra using different MMOTs. We compute a sampled version \(\hat{T}^{\text {B}}\) of the tensor \(T^{\text {B}} = \{{\mathcal {W}}^{i,j,k}\}_{i,j,k}\), where \({\mathcal {W}}^{i,j,k}\) is our pairwise MMOT defined as in Definition 6 with \({d}^{i,j}\) defined as for \(T_{\text {A}}\). We also compute a sampled version \(\hat{T}^{\text {C}}\) of the tensor \(T^{\text {C}} = \{{\mathcal {W}}^{i,j,k}\}_{i,j,k}\), where \({\mathcal {W}}^{i,j,k}\) is the barycenter MMOT defined as in Remark 8 with the \({d}^{i,j}\) as before. Finally, we compute a sampled version \(\hat{T}^{\text {D}}\) of the tensor \(T^{\text {D}}\) with \(T^{\text {D}}_{i,j,k} ={\mathcal {W}}^{i,j,k}\), where \({\mathcal {W}}^{i,j,k}\) is the non-n-metric defined in Theorem 2, but now considering points in the complex plane. The sampled tensors \(\hat{T}^{\text {B}}\), \(\hat{T}^{\text {C}}\), and \(\hat{T}^{\text {D}}\) are built by randomly selecting z triples (ijk), \(z=100\) in the synthetic experiments and \(z=600\) in the real experiments, and setting \(\hat{T}^{\text {B}}_{i,j,k} = {T}^{\text {B}}_{i,j,k}\), \(\hat{T}^{\text {C}}_{i,j,k} = {T}^{\text {C}}_{i,j,k}\), and \(\hat{T}^{\text {D}}_{i,j,k} = {T}^{\text {D}}_{i,j,k}\). The non-sampled triples are given a very large value. The sampled matrix \(\hat{T}^{\text {A}}\) is built by sampling \((3/2)\times z\) pairs (ij) and setting \(\hat{T}^{\text {A}}_{i,j} = {T}^{\text {A}}_{i,j}\), and setting a large value for non-sampled pairs. All (multi) distances \({\mathcal {W}}^{i,j}\) and \({\mathcal {W}}^{i,j,k}\) amount to solving a linear program, for which we use CVX (Grant & Boyd, 2008, 2014).

To obtain clusters, we feed the weighted (hyper) graphs specified by \(\hat{T}^{\text {A}}\), \(\hat{T}^{\text {B}}\), \(\hat{T}^{\text {C}}\), and \(\hat{T}^{\text {D}}\), to different (hyper) graph-cut algorithms. We feed the matrix of distances \(\hat{T}^{\text {A}}\) to a spectral clustering algorithm (Shi & Malik, 2000) based on normalized random-walk Laplacians to produce one clustering solution, which we name \(\mathcal {C}^{\text {A}}\). We feed the multi-distances \(\hat{T}^{\text {B}}\), \(\hat{T}^{\text {C}}\) and \(\hat{T}^{\text {D}}\) to the two hypergraph-based clustering methods NH-Cut (Ghoshdastidar & Dukkipati, 2017a) and TTM (Ghoshdastidar & Dukkipati, 2015, 2017b). These produce different clustering solutions which we name \(\mathcal {C}^{\text {B1}}\) and \(\mathcal {C}^{\text {B2}}\), \(\mathcal {C}^{\text {C1}}\) and \(\mathcal {C}^{\text {C2}}\), and \(\mathcal {C}^{\text {D1}}\) and \(\mathcal {C}^{\text {D2}}\) respectively. Both NH-Cut and TTM find clusters by (approximately) computing minimum cuts of a hypergraph where each hyperedge’s weight is the MMOT distance among three graphs. Both NH-Cut and TTM require a threshold that is used to prune the hypergraph. Edges whose weight (multi-distance) is larger than a given threshold are removed. This threshold is tuned to minimize each clustering solution error. We code all clustering algorithms to output N clusters, the correct number of clusters.

To compute the quality of each clustering solution, we compute the fraction of miss-classified graphs. In particular, if for each \(x \in \{\) A, B1, B2,C1, C2, D1, D2 \(\}\), we define \(\mathcal {C}^x(i) = k\) to mean that clustering solution \(\mathcal {C}^x\) assigns graph \(G^i\) to cluster k, then the error in solution \(\mathcal {C}^x\) is

$$\begin{aligned} \min _{\sigma } \frac{1}{N}\sum ^{N}_{i=1} \mathbb {I}(\mathcal {C}^{\text {ground truth}}(i) = \mathcal {C}^{x}(\sigma (i)), \end{aligned}$$
(20)

where N is the number of clusters, and the \(\min\) over all permutations \(\sigma\) of the elements \(\{1,\dots ,N\}\) is needed because the specific cluster IDs output have no real meaning. This value is computed 100 times (with all random numbers in our code being drawn independently among experiments) so that we can report an histogram of errors, or error bars, for each method.

The complete code for our experiments can be found here https://github.com/bentoayr/MMOT.

5.2 Synthetic graphs dataset

We generate seven synthetic clusters of ten graphs each by including in each cluster multiple random perturbations – with independent edge addition/removal with \(p=0.05\) – of a complete graph, a complete bipartite graph, a cyclic chains, a k-dimensional cube, a K-hop lattice, a periodic 2D grid, or an Erdős-Rényi graph. A random class prediction has a 0.857 error rate. To recover the class of each graph, we follow the procedure described in Sect. 5.1. In Fig. 3 we show the distribution of clustering errors using different (multi)-distances and clustering algorithms over 100 independent experiments. The mean error rates are written in the legend box of each figure.

Fig. 3
figure 3

Comparing the effect that different distances and metrics have on clustering synthetic graphs. Histogram with dashed outline is for non-n-metric, with solid outline for barycenter, and with dotted outline for pairwise MMOT

Figure 3-(left, center) show that, as expected, both TTM and NH-Cut work better when hyperedges are computed using an \(n\)-metric, and that pairwise MMOT works better than WBD.

Figure 3-(right) shows that clustering using only pairwise relationships among graphs leads to worse accuracy than if using triple-wise relationships as in Figs. 3-(left, center). The results in Fig. 3-(right) are similarly bad to the results one obtains when use the MMOT explained in Remark 7, which does not give joint transports and is trivially an \(n\)-metric. Indeed, both TTM and NH-Cut are based on spectral clustering of the hypergraph weight tensor by first reducing it to a matrix via “summing out” one of the dimensions and then using regular spectral clustering techniques. At the same time, if tensor a \(T_{i,j,k}\) satisfies \(T_{i,j,k} = {\mathcal {W}}_{i,j} + {\mathcal {W}}_{j,k} + {\mathcal {W}}_{i,k}\) for some distances \({\mathcal {W}}_{i,j}\), the matrix obtained via such reduction has spectral properties close to those of \({\mathcal {W}}_{i,j}\).

5.2.1 Injection of triangle inequality violations

To double check that this difference in performance is due to the \(n\)-metric properties of pairwise MMOT and WBD, we perturb \({{\mathcal {W}}^{}}\) to introduce triangle inequality violations and measure its effect on clustering accuracy.

To introduce triangle inequality violations, we perturb the tensor \({{\mathcal {W}}^{i,j,k}}\) as follows. For each set of four different graphs (ijkl), we find which among \({{\mathcal {W}}^{i,j,k}},{{\mathcal {W}}^{i,j,l}},{{\mathcal {W}}^{i,l,k}},{{\mathcal {W}}^{l,j,k}}\) we can change the least to produce a triangle inequality violation among these values. Let us assume that this value is \({{\mathcal {W}}^{i,j,k}}\), and that to violate \({{\mathcal {W}}^{i,j,k}} \le {{\mathcal {W}}^{i,j,l}} + {{\mathcal {W}}^{i,l,k}} + {{\mathcal {W}}^{l,j,k}}\), the quantity \({{\mathcal {W}}^{i,j,k}}\) needs to increase at least by \(\delta\), where \(\delta = {{\mathcal {W}}^{i,j,l}} + {{\mathcal {W}}^{i,l,k}} + {{\mathcal {W}}^{l,j,k}} - {{\mathcal {W}}^{i,j,k}}\). We then increase \({{\mathcal {W}}^{i,j,k}}\) by \(1.3\times \delta\). We repeat this procedure such that, in total, \(20\%\) of the entries of the tensor \({{\mathcal {W}}^{}}\) get changed.

Table 1 shows the effect of adding violations on the mean error rate for different MMOT distances. These violations clearly affect pairwise MMOT and barycenter-MMOT (both \(n\)-metrics), but do not have an impact on non n-metric distances.

Table 1 The mean error rates for clustering of synthetic graph datasets for different MMOT metric and non-metric distances, with and without injected triangle inequality violation, are shown

5.2.2 Reproducibility

Our code is fully written in Matlab 2020a. It requires installing CVX, available in http://cvxr.com/cvx/download/.

To produce Fig. 3 open Matlab and run the file run_me_for_synthetic_experiments.m. To produce the numbers in the second row of Table 1, run the same file but with fraction_viol = 0.2; strength_of_viol = 0.3;. Note that the numbers in the first row of Table 1 are the mean values in Fig. 3. The call to run_me_for_synthetic_experiments.m takes 26 hours to complete using 12 cores, each core from an Intel(R) Xeon(R) Processor E5-2660 v4 (35M Cache, 2.00 GHz). Except for the Weiszfeld.m file, all of the code was written by us and is distributed under an MIT License. This license is described in the README.txt file at the root our repository. The license for Weiszfeld.m is on the header of the file itself.

5.3 Molecular graphs dataset

This experiment is motivated by the important task in chemistry of clustering chemical compounds, represented as graphs, by their structure (McGregor & Pallai, 1997, Seeland et al., 2014, Wilkens et al., 2005). We use the molecular dataset in the supplementary material of Sutherland et al. (2003), which can be downloaded at https://pubs.acs.org/doi/abs/10.1021/ci034143r#_i21. It contains the adjacency matrices of graphs corresponding to five types of compounds: cyclooxygenase-2 inhibitors (467 graphs), benzodiazepine receptor ligands (405 graphs), estrogen receptor ligands (1009 graphs), dihydrofolate reductase inhibitors (756 graphs), and monoamine oxidase inhibitors (1641 graphs).

To build our clusters, we randomly get ten graphs of each type, and prune them so that they have no node with a degree smaller than 1. We note that, unlike for the synthetic data, the molecular graphs have weighted adjacency matrices, whose entries can be 0, 1, or 2. A random class prediction has 0.8 error rate. To recover the class of each graph, we follow the procedure described in Sect. 5.1. We repeat this experiment 100 times, independently, to collect error statistics.

In Fig. 4 we show the distribution of clustering errors using different (multi)-distances and clustering algorithms.

Fig. 4
figure 4

Comparing the effect that different distances and metrics have on clustering molecular graphs. Histogram with dashed outline is for non-n-metric, with solid outline for barycenter, and with dotted outline for pairwise MMOT

Figure 4-(left, center) show that both TTM and NH-Cut work better when hyperedges are computed using \(n\)-metrics, and Fig. 4-(right) shows that clustering using pairwise relationships performs worse than using triple-wise relations. For the same reasons as explained in Sect. 5.2, the results in Fig. 4-(right) are similarly bad to the results one would obtain if we used the MMOT explained in Remark 7.

There is a starker difference between \(n\)-metrics and non-\(n\)-metrics than that seen in Fig. 4, which we now discuss.

The number of possible 3-sized hyperedges is cubic with the number of graphs being clustered. Thus, in our experiments we randomly sample z triples (ijk) and only for these we create an hyperedge with weight \({{\mathcal {W}}^{i,j,k}}\).

Figure 5 shows the effect of z (x-axis) on performance. Comparing more graphs, i.e. increasing z, should improve clustering. However, for a non-\(n\)-metric, as z grows, triangle inequality violations can appear that introduce confusion: a graph can be “close” to two clusters that are far away, confusing TTM and NH-Cut. This compensates the benefits of a high z and results in the flat curves in Fig. 5.

Fig. 5
figure 5

Hypergraphs are built by comparing molecules’ shapes using MMOT and creating hyperedges for similarly-shaped molecule groups. With a non generalized metric, gains from using a richer hypergraph (more hyperedges) are lost due to anomalies allowed by the lack of metricity (flat curves in green and red), not so with a generalized metric (negative slope curves). The generalised metric studied (pairwise) leads to a better clustering than previously studied metrics (barycenter). Error bars are the standard deviation of the average over 100 independent experiments

5.3.1 Reproducibility

To produce Fig. 4 open Matlab and run the file run_me_for_molecular_experiments.m. The call to run_me_for_molecular_experiments.m takes 31 hours to complete using 12 cores, each core from an Intel(R) Xeon(R) Processor E5-2660 v4 (35M Cache, 2.00 GHz). Except for the Weiszfeld.m file and the molecular dataset, all of the code was written by us and is distributed under an MIT License. This license is described in the README.txt file at the root our repository. The license for Weiszfeld.m is on the header of the file itself. The license for the dataset is described on the README.txt file inside the dataset folder.

We contacted that authors (via email) about the use of their dataset and they have informed us that there are no licenses attached to it, as long as we attribute it a citation. This data contains no personally identifiable information nor offensive content.

6 Discussion and future work

In this paper, we have proved that for a general MMOT, the cost function being a generalized metric is not sufficient to guarantee that MMOT defines a generalized metric. Nevertheless, we have shown that a new family of multi-distances that generalize optimal transport to multiple distributions, the family of pairwise multi-marginal optimal transports (pairwise MMOT), leads to a multi-distance that satisfies generalized metric properties. This now opens the door to us using pairwise MMOT in combination with several algorithms whose good performance depends on metric properties. In addition, we have established coefficients for the generalized triangle inequality associated with the pairwise MMOT, and proved that these coefficients cannot be improved, up to a linear factor.

Our results are for the pairwise MMOT family. In future work, we seek to find new sufficient conditions under which other variants of MMOT lead to generalized metrics, and, for certain families of MMOT, find necessary conditions for these same properties to hold.

Finally, in future work we will also study how the structure of optimal coupling among distributions induced by our MMOTs, i.e. the support of \({{\varvec{p}}}^{*}\), compares with those of other MMOTs, such as the Barycenter MMOT.