Abstract
The Optimal transport (OT) problem is rapidly finding its way into machine learning. Favoring its use are its metric properties. Many problems admit solutions with guarantees only for objects embedded in metric spaces, and the use of non-metrics can complicate solving them. Multi-marginal OT (MMOT) generalizes OT to simultaneously transporting multiple distributions. It captures important relations that are missed if the transport only involves two distributions. Research on MMOT, however, has been focused on its existence, uniqueness, practical algorithms, and the choice of cost functions. There is a lack of discussion on the metric properties of MMOT, which limits its theoretical and practical use. Here, we prove new generalized metric properties for a family of pairwise MMOTs. We first explain the difficulty of proving this via two negative results. Afterward, we prove the MMOTs’ metric properties. Finally, we show that the generalized triangle inequality of this family of MMOTs cannot be improved. We illustrate the superiority of our MMOTs over other generalized metrics, and over non-metrics in both synthetic and real tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Let \(({\Omega }^{1}{,\mathcal {F}^1},{\varvec{p}}^{1})\) and \(({\Omega }^{2}{,\mathcal {F}^2},{\varvec{p}}^{2})\) be two probability spaces. Given a cost function \({d}^{}: {\Omega }^{1} \times {\Omega }^{2} \rightarrow \mathbb {R}^{\ge 0}\), and \(\ell \ge 1\), the (Kantorovich) Optimal Transport (OT) problem (Kantorovich, 1942) seeks:
where the \(\inf\) is over measures \({\varvec{p}}^{}\) on \({\Omega }^{1} \times {\Omega }^{2}\). Problem (1) is typically studied under the assumptions that \({\Omega }^{1}\) and \({\Omega }^{2}\) are in a Polish space on which d is a metric, in which case the minimum of (1) is the Wasserstein distance (WD). The WD is popular in many applications including shape interpolation (Solomon, 2015), generative modeling (Arjovsky et al., 2017; Fan et al., 2017), domain adaptation (Damodaran et al., 2018), and dictionary learning (Schmitz et al., 2018). The WD is a metric on the space of probability measures (Ambrosio & Gigli, 2013), and this property is useful in many ML tasks, e.g., clustering (Xing et al., 2003; Hartigan, 1975), nearest-neighbor search (Clarkson, 1999, 2006; Beygelzimer et al., 2006), and outlier detection (Angiulli & Pizzuti, 2002). Indeed, some of these tasks are tractable, or allow theoretical guarantees, when done on a metric space. E.g., finding the nearest neighbor (Beygelzimer et al., 2006; Clarkson, 1999, 2006) or the diameter (Indyk 1999) of a dataset requires a polylogarithimic computational effort under metric assumptions; approximation algorithms for clustering rely on metric assumptions, whose absence worsens known bounds (Ackermann et al., 2010); also, (Mémoli, 2011) uses the metric properties of the WD to study object matching via metric invariants.
Recently, a generalization of OT to multiple marginal measures has gained attention. Given probability spaces \(({\Omega }^{i}{,\mathcal {F}^i},{\varvec{p}}^{i})\), \(i = 1,\ldots ,n\), a function \(d:{\Omega }^{1}\times \cdots \times {\Omega }^{n}\mapsto \mathbb {R}^{\ge 0}\), a number \(\ell \ge 1\), and \({\Omega }^{ \backslash i}\triangleq {\Omega }^{1}\times \cdots \times {\Omega }^{i-1}\times {\Omega }^{i+1}\times \cdots \times {\Omega }^{n}\), Multi-Marginal Optimal Transport (MMOT) seeks:
where the infimum is taken over measures \({\varvec{p}}^{}\) on \({\Omega }^{1} \times \cdots \times {\Omega }^{n}\). The term MMOT was coined in Pass (2012), and was surveyed by the same authors in Pass (2015).
Unfortunately, there is a lack of discussion about the (generalized) metric properties of MMOT. Much of the discussion on MMOT has focused on the existence of a minimizer, the uniqueness and structure of both Monge and Kantorovich solutions, applications, practical algorithms, and the choice of the cost function (Gerolin et al., 2019; Moameni & Pass, 2017; Pass, 2014; Peyré & Cuturi, 2019). Since the metric property of the WD is useful in many applications, understanding when the (potential) minimum of (2), \({\mathcal {W}}({\varvec{p}}^{1},\ldots ,{\varvec{p}}^{n})\), a multi-way distance, has metric-like properties is critical. For example, metric properties can improve distance-based clustering, so too can generalized metrics improve clustering based on multi-way distances. A precise definition of generalized metrics and of the generalized triangle inequality that these must satisfy is presented in Sect. 2. Later, in Sect. 5, we illustrate the improvement in using metrics on a task of clustering chemical compounds. Importantly, several algorithms in Xing et al. (2003), Hartigan (1975), Clarkson (2006), Clarkson (1999), Beygelzimer et al. (2006), Angiulli and Pizzuti (2002), Indyk (1999), Ackermann et al. (2010), and more, which use distances including WD as input, have guarantees if the distances are metrics. They extend to feeding off multi-distances, and hence can use MMOT, and have guarantees under generalized metrics similar to those under classic metrics. We now exemplify these extensions, and their potential applications:
Example 1
Given a set S with \(n\) distinct distributions we can find its 3-diameter \(\Delta \triangleq \max _{{\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3} \in S} {\mathcal {W}}({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3})\) with \(n\atopwithdelims ()3\) evaluations of \({\mathcal {W}}\). What if \({\mathcal {W}}\) satisfies the generalized triangle inequality \({\mathcal {W}}^{1,2,3}\le {\mathcal {W}}^{4,2,3} + {\mathcal {W}}^{1,4,3} + {\mathcal {W}}^{1,2,4}\), where \({\mathcal {W}}^{i,j,k} \triangleq {\mathcal {W}}({\varvec{p}}^{i},{\varvec{p}}^{j},{\varvec{p}}^{k})\)? Let \(\Delta = {\mathcal {W}}({{{\varvec{p}}}^{*}}^{1},{{{\varvec{p}}}^{*}}^{2},{{{\varvec{p}}}^{*}}^{3})\) for some \({{{\varvec{p}}}^{*}}^{1},{{{\varvec{p}}}^{*}}^{2},{{{\varvec{p}}}^{*}}^{3}\) and let us extend the notation \({\mathcal {W}}^{i,j,k}\) such that e.g. \({\mathcal {W}}^{i,*j,k} \triangleq {\mathcal {W}}({\varvec{p}}^{i},{{{\varvec{p}}}^{*}}^{j},{\varvec{p}}^{k})\). We can now show that for at least \(n/3\) distribution triplets \({\mathcal {W}}\ge \Delta /3\). Indeed, for all \({\varvec{p}}^{4} \in S\), we cannot simultaneously have \({\mathcal {W}}^{4,*2,*3},{\mathcal {W}}^{*1,*2,4},{\mathcal {W}}^{*1,4,*3}< \Delta /3\). Therefore, if we evaluate \({\mathcal {W}}\) on random distribution triplets, the probability of finding a triple for which \({\mathcal {W}}> \Delta /3\) is at least \(\Omega (1/n^2)\)Footnote 1. Hence, we are guaranteed to find a approximation \(\Delta '\) of \(\Delta\) that satisfies \(\Delta \ge \Delta ' > \Delta /3\) (also called a (1/3)-approximation) with only \(\mathcal {O}(n^2)\) evaluations of \({\mathcal {W}}\) on average, an improvement over \(n\atopwithdelims ()3\). Diameter estimation relates to outlier detection (Angiulli & Pizzuti, 2002) which is critical e.g. in cybersecurity (Singh & Upadhyaya, 2012).
Example 2
Let S be as above. We can compute the average \(A \triangleq \sum _{{\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3} \in S} {{\mathcal {W}}^{}}({{\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}})/{ n \atopwithdelims ()3}\) with \(n \atopwithdelims ()3\) evaluations of \({{\mathcal {W}}^{}}\). We can estimate A by averaging \({\mathcal {W}}\) over a set with \(\mathcal {O}(n^2)\) distinct triplets randomly sampled from S, improving over \(n \atopwithdelims ()3\).
If \({{\mathcal {W}}^{}}\) is a generalized metric, an argument as in Example 1 shows that with high probability we do not miss triplets with large \({{\mathcal {W}}^{}}\), which is the critical step to prove that we approximate A well. Average estimation is critical e.g. in differential privacy (Dwork and Lei 2009).
Example 3
Let S be as above. Consider building an hypergraph with nodes S and hyperedges defined as follows. For each distinct triplet \(({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3})\) for which \({{\mathcal {W}}^{}}({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}) < thr\), a constant threshold, include it as an hyperedge. Hypergraphs are increasingly important in modern ML, specially for clustering using multiwise relationships among objects (Ghoshdastidar & Dukkipati, 2015; Purkait et al., 2016). Let \({{\mathcal {W}}^{}}\) satisfy the triangle inequality in Example 1 and be invariant under arguments permutations. Let us again define \({{\mathcal {W}}^{}}({\varvec{p}}^{i},{\varvec{p}}^{j},{\varvec{p}}^{k}) = {{\mathcal {W}}^{i,j,k}}\). One can prove that, for any \({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\), \({{\mathcal {W}}^{1,2,3}} \ge \max \{ {{\mathcal {W}}^{1,2,4}}-{{\mathcal {W}}^{1,3,4}}-{{\mathcal {W}}^{2,3,4}} , {{\mathcal {W}}^{2,3,4}}-{{\mathcal {W}}^{1,3,4}}-{{\mathcal {W}}^{1,2,4}}, {{\mathcal {W}}^{1,3,4}}- {{\mathcal {W}}^{1,2,4}}-{{\mathcal {W}}^{2,3,4}}\}\). We can use this inequality to quickly ignore triplets with large \({{\mathcal {W}}^{1,2,3}}\) without evaluating them: plug into it already computed values for \({{\mathcal {W}}^{1,2,4}}\), \({{\mathcal {W}}^{1,3,4}}\), \({{\mathcal {W}}^{2,3,4}}\), do this for multiple \({\varvec{p}}^{4}\) to try to maximize the r.h.s, and check if the r.h.s. is larger than thr.
In this paper, we show that an important family of pairwise MMOTs defines a generalized metric. The idea and advantages of using pairwise-type MMOTs is not new. Previous work Altschuler and Boix-Adsera (2020), Altschuler and Boix-Adsera (2021), Benamou et al. (2015, 2016), Fan et al. (2022), Haasler et al. (2021), Haasler et al. (2021), however, uses them to make a point about concepts other than their metric properties. Most often, the focus is on the polynomial computability of MMOT under extra assumptions. To the best of our knowledge, we are the first to study the generalized metric properties of families of pairwise MMOT. There are many applications of MMOT, and of pairwise-type MMOT more specifically. In addition to the applications referenced in the aforementioned papers, the survey (Pass, 2015) lists applications in economics, physics, and financial mathematics. Specific examples of applications that use pairwise MMOT and that may benefit from the generalized metric properties we prove are locality sensitive hashing, labeling problem in classification, image registration, multi-agent matching (Li & Anantharam, 2019), and optimal coupling of multiple random variables (Angel & Spinka, 2019).
The rest of this paper is organized as follows. We first explain the difficulty of proving this via two negative results (Sects. 3.1 and 3.2). We present our main results on the generalized metric properties of this family of pairwise MMOT’s in Sect. 3.3. We present a proof for a simple case to illustrate the key ideas of the main proof in Sect. 4. Finally, with various numerical examples we show that a MMOT from our family, and which defines an n-metric, improves the task of clustering graphs.
2 Definitions and setup
Lists We write \(s_1,\ldots ,s_k\) as \(s_{1:k}\), \({\Omega }^{1},\ldots ,{\Omega }^{k}\) as \({\Omega }^{1:k}\), and \(A_{s_1,\ldots ,s_k}\) as \(A_{s_{1:k}}\). Note that \(A_{s_{1}:s_{k}}\) differs from \(A_{s_{1:k}}\). Assuming \(s_k > s_1\), then we have \(A_{s_{1}:s_{k}} \equiv A_{s_{1}},A_{s_{1}+1},A_{s_{1}+2},\ldots ,A_{s_{k}}\). By itself, 1 : i has no meaning, and it does not mean \(1,\ldots ,i\). For \(i\in \mathbb {N}\), we let \([i] \triangleq \{1,\ldots ,i\}\).
Generalized inner-product Given two equi-multidimensional arrays A and B, and \(\ell \in \mathbb {N}\), we define
where \((\cdot )^\ell\) is element-wise \(\ell\)th power. Later on, A will be an array of distances and B will be an array of probabilities.
Probability spaces To facilitate exposition, we state our main contributions for probability spaces with a finite sample space in \({\Omega }\), an event set \(\sigma\)-algebra which is the power set of the sample space, and a probability measure described by a probability mass function. We refer to probability mass functions using bold letters, e.g. \({\varvec{p}}\), \({\varvec{q}}\), \({\varvec{r}}\), etc.
When talking about \(n\) probability spaces, the ith space has sample space \({\Omega }^{i} = \{{\Omega }^{i}_{1:{\vert \Omega ^{i}\vert }}\} \subseteq {\Omega }\), an event space \(2^{{\Omega }^{i}}\), and a probability mass function \({\varvec{p}}^{i}\), or \({\varvec{q}}^{i}\), or \({\varvec{r}}^{i}\), etc. Variable \({\vert \Omega ^{i}\vert }\) is the number of atoms in \({\Omega }^{i}\). Symbol \({\varvec{p}}^{i}_{s}\) denotes the probability of the atomic event \(\{{\Omega }^{i}_{s}\}\). Without loss of generality (w.l.o.g.) we assume \({\varvec{p}}^{i}_{s} > 0, \forall i\in [n], \forall s\in [{\vert \Omega ^{i}\vert }]\). Our notation assumes that atoms can be indexed, but our results extend beyond this assumption. W.l.o.g., we assume that \({\Omega }^{i}_{s} = {\Omega }^{i}_{t}\) if and only if \(s = t\).
Symbol \({\varvec{p}}^{i_{1:k}}\) denotes a mass function for the probability space with sample space \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}\) and event space \(2^{{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}}\). In particular, \({\varvec{p}}^{i_{1:k}}_{s_{1:k}}\) (i.e. \({\varvec{p}}^{i_{1},\ldots ,i_k}_{s_1,\ldots ,s_{k}}\)) is the probability of the atomic event \(\{({\Omega }^{i_1}_{s_1},\ldots ,{\Omega }^{i_k}_{s_k})\}\). We use \({\varvec{p}}^{i_{1:k} \mid j_{1:r}}\) to denote a probability mass function for the probability space with sample space \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}\), and event space \(2^{{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_k}}\), such that \({\varvec{p}}^{i_{1:k} \mid j_{1:r}}_{s_{1:k} \mid t_{1:r}} \triangleq {{\varvec{p}}^{i_1,\ldots ,i_k , j_1,\ldots ,j_r}_{s_1,\ldots ,s_k , t_1,\ldots ,t_r} }/{{\varvec{p}}^{j_1,\ldots ,j_r}_{ t_1,\ldots ,t_r} }\), i.e. a conditional probability.
Definition 1
(Gluing map) Consider a mass function \({\varvec{q}}^{k}\) over \({\Omega }^{k}\) and \(n-1\) conditional mass functions \(\{{\varvec{q}}^{i \mid k}\}_{i \in [n]\backslash \{k\}}\) over \(\{{\Omega }^{i}\}_{i \in [n]\backslash \{k\}}\). The map \({\mathcal {G}}\), that is called a gluing map, defines the mass function \({\varvec{p}}\) over \({\Omega }^{1}\times \cdots \times {\Omega }^{n}\) as
To be more specific, \({\varvec{p}}^{}_{s_1,\ldots ,s_n} = {\varvec{q}}^{k}_{s_k} \prod _{i \in [n]\backslash \{k\}} {\varvec{q}}^{i\mid k}_{s_i }.\)
Distances and metrics We use “distance” to refer to an object that, depending on extra assumptions, might, or might not, have the properties of a (generalized) metric. For a metric we use the standard definition, and for generalized metrics we use the definitions in Kiss et al. (2018).
The definition in Kiss et al. (2018) is the same as ours but is expressed in a slightly different form. In particular, the C(n) in our Definition 3 will correspond to \(1/K_n\) in Kiss et al. (2018). Also, Kiss et al. (2018) expresses their generalized triangle inequality by replacing the rth variable inside a multi-distance with a new variable, while in our definition we remove the rth variable and append a new variable at the end of the argument’s list. Given that both definitions also require permutation invariance of arguments to the multi-distance, these two definitions are equivalent.
Finally, our notions of metric and generalized metric are more general than usual in the sense that they support the use of different distance functions depending on the spaces from where we are drawing elements. This grants an extra layer of generality to our results.
Definition 2
(Metric) Let \(\{{d}^{i,j}\}_{i,j}\) be a set of distances of the form \({d}^{i,j}:{\Omega }^{i}\times {\Omega }^{j}\mapsto \mathbb {R}\) and \({d}^{i,j}({\Omega }^{i}_{s},{\Omega }^{j}_{t}) \triangleq {{d}^{i,j}_{s,t}}\). We say that \({d}^{}\) is a metric when, for any i, j, k, and any \(s\in [{\vert \Omega ^{{i}}\vert }], r\in [{\vert \Omega ^{{j}}\vert }], t\in [{\vert \Omega ^{{k}}\vert }]\), we have: (1) \({d}^{i,j}_{r,s} \ge 0\); (2) \({d}^{i,j}_{r,s} = {d}^{j,i}_{s,r}\); (3) \({d}^{i,j}_{s,r} = 0 \text { iff } {\Omega }^{i}_{s} = {\Omega }^{j}_{r}\); (4) \({d}^{i,j}_{s,r} \le {d}^{i,k}_{s,t} + {d}^{k,j}_{t,r}\).
Definition 3
(Generalized metric) Let \(\{{d}^{i_{1:n}}\}_{i_{1:n}}\) be a set of distances of the form \({d}^{i_{1:n}}:{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n} \mapsto \mathbb {R}\) and \({d}^{i_{1:n}}({\Omega }^{i_1}_{s_1},\ldots ,{\Omega }^{i_k}_{s_n}) \triangleq {{d}^{i_{1:n}}_{s_{1:n}}}\). We say that \({d}\) is an \((n,C(n))\)-metric when, for any \(i_{1:n+1}\) and \(s_{1:n+1}\) with \(s_r \in [{\vert \Omega ^{i_r}\vert }]\forall r\in [n+1]\), we have: (1) \({{d}^{i_{1:n}}_{s_{1:n}}} \ge 0\); (2) \({{d}^{i_{1:n}}_{s_{1:n}}} = {{d}^{\sigma (i_{1:n})}_{\sigma (s_{1:n})}}\) for any permutation \(\sigma ;\; 3) {{d}^{i_{1:n}}_{s_{1:n}}} = 0 \text { iff } {\Omega }^{i_r}_{s_r} = {\Omega }^{i_t}_{s_t},\; \forall r,t\in [n];\; 4)\) \(C(n){{d}^{i_{1:n}}_{s_{1:n}}} \le \sum ^{n}_{r=1} {{d}^{i_1,\ldots ,i_{r-1},i_{r+1},\ldots ,i_{n+1}}_{s_1,\ldots ,s_{r-1},s_{r+1},\ldots ,s_{n+1}}}\).
Definition 4
(Generalized metric on distributions) Let \({{\mathcal {W}}^{}}\) be a map from \(n\) probability spaces to \(\mathbb {R}\) such that \({{\mathcal {W}}^{i_{1:n}}} \triangleq {\mathcal {W}}({\varvec{p}}^{i_{1:n}})\) is the image of the probability spaces with indices \(i_{1:n}\). For any \(n+1\) probability spaces with samples \({\Omega }^{1:n+1}\) and masses \({\varvec{p}}^{1:n+1}\), and any permutation \(\sigma\),\({{\mathcal {W}}^{}}\) is an \((n,C(n))\)-metric if: (1) \({\mathcal {W}}^{1,\ldots ,n} \ge 0\) ; (2) \({\mathcal {W}}^{1,\ldots ,n} = 0 \text { iff } {\varvec{p}}^{i} = {\varvec{p}}^{j}\), \({\Omega }^{i}={\Omega }^{j}, \ \forall i,j\); (3) \({\mathcal {W}}^{1,\ldots ,n} = {\mathcal {W}}^{\sigma (1,\ldots ,n)}\) ; (4) \(C(n) {\mathcal {W}}^{1,..,n} \le \sum \limits ^{n}_{r = 1} {\mathcal {W}}^{1,..,r-1,r+1,..,n+1}\).
Remark 1
Equalities \({\varvec{p}}^{i} = {\varvec{p}}^{j}\) and \({\Omega }^{i}={\Omega }^{j}\), mean that \({\vert \Omega ^{i}\vert } = {\vert \Omega ^{j}\vert }\), and that there exists a bijection \(b^{i,j}(\cdot )\) from \([{\vert \Omega ^{i}\vert }]\) to \([{\vert \Omega ^{j}\vert }]\) such that \({\varvec{p}}^{i}_{s} = {\varvec{p}}^{j}_{b^{i,j}(s)}\) and \({\Omega }^{i}_{s}={\Omega }^{j}_{b^{i,j}(s)}\), \(\forall\) \(s \in [{\vert \Omega ^{i}\vert }]\).
Remark 2
The inequality in the property 4 in Definition 4 is the generalized triangle inequality. Figure 1 depicts a geometric analogy for the generalized triangle inequality. Just like the measure (length) of one of the sides of a triangle is always smaller than the sum of the measure of the two other sides, the measure (area/volume/etc) of one of the facets of a simplex is always smaller than the sum of the measure of the other facets. The constant C(n) is included in our definition so that we later can prove that a stronger inequality than \(C(n) =1\) holds.
We abbreviate \((n,1)\)-metric by \(n\)-metric. In our setup, the \(\inf\) in (2) is always attained (recall, finite-spaces) and amounts to solving an LP. We refer to the minimizing distributions by \({\varvec{p}}^{*}, {\varvec{q}}^{*}, {\varvec{r}}^{*}\), etc. We define the following map from \(n\) probability spaces to \(\mathbb {R}\). The definition below amounts to (2) when \({\varvec{p}}\)’s are empirical measures.
Definition 5
(MMOT distance for finite spaces) Let \(\{{d}^{i_{1:n}}\}_{i_{1:n}}\) be a set of distances of the form \({d}^{i_{1:n}}:{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n}\mapsto \mathbb {R}\) and \({d}^{i_{1:n}}({\Omega }^{i_1}_{s_1},\ldots ,{\Omega }^{i_k}_{s_n}) \triangleq {{d}^{i_{1:n}}_{s_{1:n}}}\). The MMOT distance associated with \(\{{d}^{i_{1:n}}\}_{i_{1:n}}\) for \(n\) probability spaces with masses \({\varvec{p}}^{i_{1:n}}\) over \({\Omega }^{i_{1:n}}\) is
where \({\varvec{r}}^{}\) is a mass function over \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n}\), and \({\varvec{r}}^{i_s}\) be the marginal probability of \({\varvec{r}}\) on \({\Omega }^{i_s}\).
Remark 3
Solving (4) amounts to solving a linear program when \(\ell = 1\), and to solving a convex optimization problem when \(\ell >1\) and d is non-negative. Solving a linear program can be done in polynomial time in the number of variables (Karmarkar, 1984). However, for general MMOT, the number of variables is exponential in the number of marginals n and their support sizes k, which makes the overall run time be \(\text {poly}(k^n)\). Even approximately solving MMOT in some general instances is NP-hard. However, under certain structural assumptions on the input costs/distances, we can get tractable algorithms (Altschuler & Boix-Adsera 2021; Lin et al., 2022).
3 Main results
To prove that MMOT leads to an \(n\)-metric, it is natural to extend the ideas in the classical proof that WD is a metric. The hardest property to prove is the triangle inequality. Its proof for the WD follows from a gluing lemma and the assumption that \({d}\) itself is a metric (Definition 2). Our hope is that we can prove the triangle inequality for MMOT if we can prove (a) a generalized gluing lemma and assume that (b) \({d}\) is a generalized metric. Unfortunately, and as we explain in Sects. 3.1 and 3.2, (a) is not possible, and (b) is not sufficient. This requires developing completely new proofs.
3.1 The gluing lemma does not generalize to higher dimensions
The gluing lemma used to prove that WD is a metric is as follows. For its proof see Ambrosio and Gigli (2013), Lemma 2.1.
Lemma 1
(Gluing lemma) Let \({\varvec{p}}^{1,3}\) and \({\varvec{p}}^{2,3}\) be arbitrary mass functions for \({\Omega }^{1}\times {\Omega }^{3}\) and \({\Omega }^{2}\times {\Omega }^{3}\), respectively, with the same marginal, \({\varvec{p}}^{3}\), over \({\Omega }^{3}\). There exists a mass function \({\varvec{r}}^{1,2,3}\) for \({\Omega }^{1}\times {\Omega }^{2}\times {\Omega }^{3}\) whose marginals over \({\Omega }^{1}\times {\Omega }^{3}\) and \({\Omega }^{2}\times {\Omega }^{3}\) equal \({\varvec{p}}^{1,3}\) and \({\varvec{p}}^{2,3}\) respectively.
The way Lemma 1 is used to prove WD’s triangle inequality is as follows. Assume \({d}\) is a metric (Definition 2). Let \(\ell = 1\) for simplicity. Let \({{{\varvec{p}}}^{*}}^{1,2}\), \({{{\varvec{p}}}^{*}}^{1,3}\), and \({{{\varvec{p}}}^{*}}^{2,3}\) be optimal transports such that \({\mathcal {W}}^{1,2} = \left\langle {{{\varvec{p}}}^{*}}^{1,2},{d}^{1,2} \right\rangle\), \({\mathcal {W}}^{1,3} =\left\langle {{{\varvec{p}}}^{*}}^{1,3},{d}^{1,3} \right\rangle\), and \({\mathcal {W}}^{2,3} =\left\langle {{{\varvec{p}}}^{*}}^{2,3},{d}^{2,3} \right\rangle\). Define \({\varvec{r}}^{1,2,3}\) as in Lemma 1, and let \({\varvec{r}}^{1,3}\), \({\varvec{r}}^{2,3}\), and \({\varvec{r}}^{1,2}\) be its bivariate marginals. We then have
Our first roadblock is that Lemma 1 does not generalize to higher dimensions. For simplicity, we now omit the sample spaces on which mass functions are defined. When a set of mass functions have all their marginals over the same sample sub-spaces equal, we will say they are compatible.
Theorem 1
(No gluing) There exists mass functions \({\varvec{p}}^{1,2,4}\), \({\varvec{p}}^{1,3,4}\), and \({\varvec{p}}^{2,3,4}\) with compatible marginals such that there is no mass function \({\varvec{r}}^{1,2,3,4}\) compatible with them.
Proof
If this were not the case, then, by computing marginals of the mass functions in the theorem’s statement, it would be true that, given arbitrary mass functions \({\varvec{p}}^{1,2}\), \({\varvec{p}}^{1,3}\), and \({\varvec{p}}^{2,3}\) with compatible univariate marginals, we should be able to find \({\varvec{r}}^{1,2,3}\) whose bivariate marginals equal these three mass functions. But this is not the case. A counter example, taken from Remark 3 in Haasler et al. (2021), is given below. For example, let
These marginals have compatible univariate marginals, namely, \({\varvec{p}}^{1} = {\varvec{p}}^{2} = {\varvec{p}}^{3} = [1, 1]/2\). Yet, the following system of eqs. over \(\{{\varvec{r}}^{1,2,3}_{i,j,k}\}_{i,j,k \in [2]}\) is easily checked to be infeasible
\(\square\)
Remark 4
It might be possible to obtain a generalized gluing lemma if, in addition to requiring consistency of the univariate marginal distributions, we also require consistency of the \((n-1)\)-variate marginal distributions. We leave studying properties of MMOTs defined with additional k-variate marginal consistency constraints as future work.
3.2 Cost d being an \(n\)-metric is not a sufficient condition for MMOT to be an \(n\)-metric
Theorem 1 tells us that, even if we assume that \({d}\) is an \(n\)-metric, we cannot adapt the classical proof showing WD is a metric to MMOT leading to an \(n\)-metric. The question remains, however, whether there exists such a proof at all only under the assumption that \({d}\) is \(n\)-metric. Theorem 2 settles this question in the negative.
Theorem 2
Let \({\mathcal {W}}\) be MMOT distance as in Definition 5 with \(\ell =1\). There exists a sample space \({\Omega }^{}\), mass functions \({\varvec{p}}^{1}\), \({\varvec{p}}^{2}\), \({\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\) over \({\Omega }^{}\), and distance \({d}:{\Omega }^{}\times {\Omega }^{}\times {\Omega }^{} \mapsto \mathbb {R}\) such that \({d}\) is an \(n\)-metric (\(n= 3\)), but
Proof
Let \({\Omega }\) be the six points in Fig. 2, where we assume that \(0 < \epsilon \ll 1\), and hence that there are no three co-linear points, and no two equal points. Let \({\varvec{p}}^{1},{\varvec{p}}^{2},{\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\) be as in Fig. 2, each is represented by a unique color and is uniformly distributed over the points of the same color. Given any \(x,y,z \in {\Omega }\) let \({d}(x,y,z) = \gamma\) if exactly two points are equal, and let \({d}(x,y,z)\) be the area of the corresponding triangle otherwise, where \(\gamma\) lower bounds the area of the triangle formed by any three non-co-linear points, e.g. \(\gamma = \epsilon / 4\). A few geometric considerations (see Appendix 1) show that \({d}\) is an \(n\)-metric (\(n=3, C(n) = 1\)) and that (5) holds as \(\frac{1}{2} > \frac{1}{8} + \left( \frac{1}{8} + \frac{\epsilon }{4} \right) + \left( \frac{1}{8} + \frac{\epsilon }{4}\right)\). \(\square\)
Remark 5
Theorem 2 can be generalized to spaces of dimension \(> 2\), and to \(n> 3\), and \(\ell > 1\).
3.3 A new family of pairwise MMOTs that are generalized metrics
We will prove that the properties of a generalized metric (Definition 4) hold for the novel family of MMOT distances defined as follows.
Definition 6
(Pairwise MMOT distance) Let \(\{{d}^{i,j}\}_{i,j}\) be a set of distances of the form \({d}^{i,j}:{\Omega }^{i}\times {\Omega }^{j}\mapsto \mathbb {R}\) and \({d}^{i,j}({\Omega }^{i}_{s},{\Omega }^{j}_{t}) \triangleq {{d}^{i,j}_{s,t}}\). The Pairwise MMOT distance associated with \(\{{d}^{i,j}\}_{i,j}\) for \(n\) probability spaces with masses \({\varvec{p}}^{i_{1:n}}\) over \({\Omega }^{i_{1:n}}\) is \({\mathcal {W}}({\varvec{p}}^{i_{1:n}}) \triangleq {\mathcal {W}}^{i_{1:n}}\) with
where \({\varvec{r}}\) is a mass over \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n}\), with marginals \({\varvec{r}}^{i_s}\) and \({\varvec{r}}^{i_s,i_t}\) over \({\Omega }^{i_s}\) and \({\Omega }^{i_s}\times {\Omega }^{i_t}\).
Remark 6
Note that each choice of \(\{{d}^{i,j}\}_{i,j}\) results in a different MMOT. Our results in this section hold for the whole family of such MMOTs.
Remark 7
Swapping \(\min\) and \(\sum\) gives a new definition \({\mathcal {W}}^{i_{1:n}}_{\text {pairs}} = \sum _{1 \le s<t\le n} {\mathcal {W}}^{i_s,i_t}\), where \({\mathcal {W}}^{i_s,i_t}\) is the WD between \({\Omega }^{i_s}\) and \({\Omega }^{i_t}\). This is trivially an \(n\)-metric (cf. (Kiss et al., 2018)) but is different from eq. (6). In particular, it does not provide a joint optimal transport, which is important to many applications such as color transfer (Strössner & Kressner, 2022), tomographic reconstruction (Abraham et al., 2017), and robust localization and sensor fusion (Elvander et al., 2020).
If \(n= 2\), Definition 6 reduces to the Wasserstein distance. Our definition is a special case of the Kantorovich formulation for the general MMOT problem discussed in Pass (2015). Furthermore, if \(\ell = 1\), we can get pairwise MMOT (Definition 6) from general MMOT (Definition 5), by defining \({d}^{i_{1:n}}:{\Omega }^{i_1}\times \cdots \times {\Omega }^{i_n} \mapsto \mathbb {R}\) such that
for some set of distances \(\{{d}^{i,j}\}_{i,j}\). However, if \(\ell > 1\), choosing \({d}^{i_{1:n}}\) as in eq. (7) in Definition 5 does not lead to Definition 6, but rather to an upper bound of it. Indeed, with this choice we get
where the last inequality follows from Hölder’s inequality.
It is easy to prove that if \(\{{d}^{i,j}\}_{i,j}\) is a metric (Definition 2), then \({d}\) is an \(n\)-metric (Definition 3). However, because of Theorem 2, we know that this is not sufficient to guarantee that the pairwise MMOT distance is an \(n\)-metric, which only makes the proof of the next theorem all the more interesting.
Theorem 3
If \({d}\) is a metric (Definition 2), then the pairwise MMOT distance (Definition 6) associated with \({d}\) is an \((n,C(n))\)-metric, with \(C(n) \ge 1\).
Proof
The proof is presented in Appendix 1. \(\square\)
We currently do not know the most general conditions under which Definition 3 is an \(n\)-metric. However, working with Definition 6 allows us sharply bound the best possible C(n), which would unlikely be possible in a general setting. As Theorem 4 shows, the best C(n) is \(C(n) = \Theta (n)\).
Theorem 4
In Theorem 3, the constant \(C(n)\) can be made larger than \((n-1)/5\) for \(n> 7\), and there exists sample spaces \({\Omega }^{1:n}\), mass functions \({\varvec{p}}^{1:n}\), and a metric d over \({\Omega }^{1:n}\) such that \(C(n) \le n-1\).
Proof
The proof is presented in Appendix 4. \(\square\)
Remark 8
Note that if \({\Omega }^{i} = {\Omega },\ \forall \ i\) and \({d}: {\Omega }\times \dots \times {\Omega }\mapsto \mathbb {R}\) such that \({d}(w^{1:n}) = \min _{w \in {\Omega }}\sum _{s \in [n]} {d}^{1,2}(w^s,w)\) and \({d}^{1,2}\) is a metric, then \({d}\) is an \(n\)-metric (Kiss et al., 2018). One can then prove, see Carlier and Ekeland (2010), that Definition 4 is equivalent to \({{\mathcal {W}}^{}}({\varvec{p}}^{{1:n}}) =\min _{{\varvec{p}}} \sum _{s\in [n]}{{\mathcal {W}}^{}}({\varvec{p}}^{s},{\varvec{p}})\), which is also called the Wasserstein barycenter distance (WBD) (Agueh, 2011). The later definition makes \({{\mathcal {W}}^{}}({\varvec{p}}^{{1:n}})\) a Fermat n-distance as defined in Kiss et al. (2018, Proposition 3.1), from which it follows immediately via this same proposition that it is an n-metric with \(C(n) = \Theta (n)\). The pairwise MMOT is not a Fermat distance, and Theorems 3 and 4 do not follow from (Kiss et al., 2018). Hence, a novel proof strategy is required.
In the next section, we give a self contained proof that the generalized triangle inequality holds with \(C(n) = 1\) for \(n= 3\) when \({d}\) is a metric. This proof contains the key ideas behind the proof of the triangle inequality in Theorems 3 and 4. Proving the generalized triangle inequality for a general \(n\), and a large \(C(n)\), is the hardest part of these theorems, compared to proving the other properties required by generalized metrics.
4 Proof of the generalized triangle inequality for \(n= 3\), \(\ell =1\), and \(C(n) = 1\)
We will prove that for any mass functions \({\varvec{p}}^{1},\dots ,{\varvec{p}}^{4}\) over \({\Omega }^{1},\dots ,{\Omega }^{4}\), respectively, if \({d}^{i,j}:{\Omega }^{i}\times {\Omega }^{j}\mapsto \mathbb {R}\) is a metric for any \(i,j\in \{1,\dots ,4\}, i \ne j\), then \({\mathcal {W}}^{1,2,3} \le {\mathcal {W}}^{1,2,4} + {\mathcal {W}}^{1,3,4} + {\mathcal {W}}^{2,3,4}\).
We write this inequality more succinctly as
using a symbol \({\mathcal {W}}^{\backslash r}\) whose meaning is obvious. We begin by expanding all the terms in (14),
and,
where \(\{{{{\varvec{p}}}^{*}}^{i,j}\}\) are the bivariate marginals of the optimal joint distribution \({{{\varvec{p}}}^{*}}^{1,2,3}\) for \({\mathcal {W}}^{1,2,3}\), and \(\{{{{\varvec{p}}}^{*}}^{(r)^{i,j}}\}\) are the bivariate marginals of the optimal joint distribution for \({\mathcal {W}}^{\backslash r}\).
Now we define the following probability mass function on \({\Omega }^{1}\times \dots \times {\Omega }^{4}\), namely \({\varvec{p}}^{1,2,3,4}\), such that
Notice that this definition is such that its bivariate marginals match the optimal bivariate marginals of each of the terms in the generalized triangle inequality. Recall that w.l.o.g. we assume that no element in \({\Omega }^{i}\) has zero mass, so the denominators are not zero. Since the bivariate probability mass functions \({\varvec{p}}^{1,2},{\varvec{p}}^{1,3}\), and \({\varvec{p}}^{2,3}\) of \({\varvec{p}}^{1,2,3,4}\) are feasible but sub-optimal choices of minimizers in Eq. (6) in Definition 6, we have that
It is convenient to introduce the following more compact notation \(w_{i,j} = \left\langle {d}^{i,j},{\varvec{p}}^{i,j} \right\rangle\) and \(w^*_{i,j,(r)} = \left\langle {d}^{i,j},{{{\varvec{p}}}^{*}}^{(r)^{i,j}} \right\rangle\). Notice that, for any i, j, k and r, we have \(w_{i,j}\le w_{i,k} + w_{j,k}\) and \(w^*_{i,j,(r)}\le w^*_{i,k,(r)} + w^*_{j,k,(r)}\). This follows directly from the assumption that \(\{{d}^{i,j}\}\) are metrics. Let us prove that \(w_{1,2}\le w_{1,4} + w_{2,4}\):
Similarly, \(w_{1,3}\le w_{1,4} + w_{3,4}\), and \(w_{2,3}\le w_{2,4} + w_{3,4}\). Combining these inequalities and (16) we can write \({\mathcal {W}}^{1,2,3}\le w_{1,2}+ w_{1,3}+w_{2,3} \le (w_{1,4} + w_{2,4}) + (w_{1,4} + w_{3,4})+ (w_{2,4} + w_{3,4})\). Note that by construction, the bivariate marginals of \({\varvec{p}}^{1,2,3,4}\) in (15) satisfy \({\varvec{p}}^{1,4}={{{\varvec{p}}}^{*}}^{(3)^{1,4}}\), \({\varvec{p}}^{3,4}={{{\varvec{p}}}^{*}}^{(2)^{3,4}}\), and \({\varvec{p}}^{2,4}={{{\varvec{p}}}^{*}}^{(1)^{2,4}}\). Hence,
Using the new notation, we can re-write the r.h.s. of (14) as
To finish the proof we show that the r.h.s. of (18) can be upper bounded by the r.h.s. of (19). We use the triangular inequality of \(w^*_{i,j,(k)}\) and apply it to the 1st, 4th, and 5th terms on the r.h.s. of (18) as specified by the parenthesis:
and observe that the terms in the r.h.s. of this last inequality are accounted for on the r.h.s. of (19).
Note that this last step, figuring out to which terms we should apply the triangular inequality property of \(w^*\) such that we can “cover” the r.h.s. of (18) with the r.h.s. of (19), is critical. Also, the fact that we want to prove the MMOT triangle inequality holds for \(C(n) = \Theta (n)\) makes the last step even harder. To address this, in the proofs of Theorems 3 and 4 (presented in the Appendix) we develop a general procedure and special hash functions to expand (using the triangle inequality) and to match terms.
5 Numerical experiments
We show how using a MMOT that defines an \(n\)-metric (\(n> 2\)) improves two tasks of clustering graphs compared to using a non-\(n\)-metric MMOT or an optimal transport (OT) that defines a 2-metric. One clustering task, Sect. 5.2, is on synthetic graphs. The other one, Sect. 5.3, is a real task of clustering chemical molecules.
5.1 Multi-distance based clustering
We use the same clustering strategy for both tasks. We cluster graphs by (i) computing their spectrum, (ii) treating each spectrum as a probability distribution, (iii) using WD and three different MMOT’s to compute distances among these distributions, and iv) feeding these distances to different distance-based clustering algorithms to recover the true cluster memberships, as illustrated below. We give details of each step in our procedure next.
To produce the spectra, we transform the set of graphs \(\{G^i\}_{i}\) into vectors \(\{v^i\}_{i}\) to be clustered. Each \(v^i\) is the (complex-valued) spectrum of a matrix \(M^i\) representing non-backtracking walks on \(G^i\), which approximates the length spectrum \(\mu ^i\) of \(G^i\) (Torres et al., 2019). Object \(\mu ^i\) uniquely identifies (the 2-core (Batagelj & Zaveršnik, 2011) of) \(G^i\) (up to an isomorphism) (Constantine & Lafont, 2019), but is too abstract to be used directly. Hence, we use its approximation \(v^i\). The length of \(v^i\) and \(v^j\) for equal-sized \(G^i\) and \(G^j\) can be different, depending on how we approximate \(\mu ^i\). This motivates the use distance-based clustering and OT (multi) distances, since OT allows comparing objects of different lengths. Note that unlike the length spectrum, the classical spectrum of a graph (the eigenvalues of e.g. an adjacency matrix, Laplacian matrix, or random-walk matrix) has the advantage of having the same length for graphs with the same number of nodes. However, it does not uniquely identify a graph. For example, a star graph with 5 nodes and the graph that is the union of a square with an isolated node are co-spectral but are not isomorphic.
To produce the (hyper) graph of graphs’ spectra we need (hyper) edges and edge-weights. Each \(v^i\) from the previous step is interpreted as a uniform distribution \({\varvec{p}}^{i}\) over \({\Omega }^{i} = \{v^i_k, k = 1,\ldots ,\}\). To produce the graph of graphs’ spectra, we compute a sampled version \(\hat{T}^{\text {A}}\) of the matrix \(T^{\text {A}} = \{{\mathcal {W}}^{i,j}\}_{i,j}\), where \({\mathcal {W}}^{i,j}\) is the WD between \({\varvec{p}}^{i}\) and \({\varvec{p}}^{j}\) using a \({d}^{i,j}\) defined by \({{d}^{i,j}_{s,t}}\) \(= |v^i_s - v^j_t |\). We produce three different hyper-graph of graphs’ spectra using different MMOTs. We compute a sampled version \(\hat{T}^{\text {B}}\) of the tensor \(T^{\text {B}} = \{{\mathcal {W}}^{i,j,k}\}_{i,j,k}\), where \({\mathcal {W}}^{i,j,k}\) is our pairwise MMOT defined as in Definition 6 with \({d}^{i,j}\) defined as for \(T_{\text {A}}\). We also compute a sampled version \(\hat{T}^{\text {C}}\) of the tensor \(T^{\text {C}} = \{{\mathcal {W}}^{i,j,k}\}_{i,j,k}\), where \({\mathcal {W}}^{i,j,k}\) is the barycenter MMOT defined as in Remark 8 with the \({d}^{i,j}\) as before. Finally, we compute a sampled version \(\hat{T}^{\text {D}}\) of the tensor \(T^{\text {D}}\) with \(T^{\text {D}}_{i,j,k} ={\mathcal {W}}^{i,j,k}\), where \({\mathcal {W}}^{i,j,k}\) is the non-n-metric defined in Theorem 2, but now considering points in the complex plane. The sampled tensors \(\hat{T}^{\text {B}}\), \(\hat{T}^{\text {C}}\), and \(\hat{T}^{\text {D}}\) are built by randomly selecting z triples (i, j, k), \(z=100\) in the synthetic experiments and \(z=600\) in the real experiments, and setting \(\hat{T}^{\text {B}}_{i,j,k} = {T}^{\text {B}}_{i,j,k}\), \(\hat{T}^{\text {C}}_{i,j,k} = {T}^{\text {C}}_{i,j,k}\), and \(\hat{T}^{\text {D}}_{i,j,k} = {T}^{\text {D}}_{i,j,k}\). The non-sampled triples are given a very large value. The sampled matrix \(\hat{T}^{\text {A}}\) is built by sampling \((3/2)\times z\) pairs (i, j) and setting \(\hat{T}^{\text {A}}_{i,j} = {T}^{\text {A}}_{i,j}\), and setting a large value for non-sampled pairs. All (multi) distances \({\mathcal {W}}^{i,j}\) and \({\mathcal {W}}^{i,j,k}\) amount to solving a linear program, for which we use CVX (Grant & Boyd, 2008, 2014).
To obtain clusters, we feed the weighted (hyper) graphs specified by \(\hat{T}^{\text {A}}\), \(\hat{T}^{\text {B}}\), \(\hat{T}^{\text {C}}\), and \(\hat{T}^{\text {D}}\), to different (hyper) graph-cut algorithms. We feed the matrix of distances \(\hat{T}^{\text {A}}\) to a spectral clustering algorithm (Shi & Malik, 2000) based on normalized random-walk Laplacians to produce one clustering solution, which we name \(\mathcal {C}^{\text {A}}\). We feed the multi-distances \(\hat{T}^{\text {B}}\), \(\hat{T}^{\text {C}}\) and \(\hat{T}^{\text {D}}\) to the two hypergraph-based clustering methods NH-Cut (Ghoshdastidar & Dukkipati, 2017a) and TTM (Ghoshdastidar & Dukkipati, 2015, 2017b). These produce different clustering solutions which we name \(\mathcal {C}^{\text {B1}}\) and \(\mathcal {C}^{\text {B2}}\), \(\mathcal {C}^{\text {C1}}\) and \(\mathcal {C}^{\text {C2}}\), and \(\mathcal {C}^{\text {D1}}\) and \(\mathcal {C}^{\text {D2}}\) respectively. Both NH-Cut and TTM find clusters by (approximately) computing minimum cuts of a hypergraph where each hyperedge’s weight is the MMOT distance among three graphs. Both NH-Cut and TTM require a threshold that is used to prune the hypergraph. Edges whose weight (multi-distance) is larger than a given threshold are removed. This threshold is tuned to minimize each clustering solution error. We code all clustering algorithms to output N clusters, the correct number of clusters.
To compute the quality of each clustering solution, we compute the fraction of miss-classified graphs. In particular, if for each \(x \in \{\) A, B1, B2,C1, C2, D1, D2 \(\}\), we define \(\mathcal {C}^x(i) = k\) to mean that clustering solution \(\mathcal {C}^x\) assigns graph \(G^i\) to cluster k, then the error in solution \(\mathcal {C}^x\) is
where N is the number of clusters, and the \(\min\) over all permutations \(\sigma\) of the elements \(\{1,\dots ,N\}\) is needed because the specific cluster IDs output have no real meaning. This value is computed 100 times (with all random numbers in our code being drawn independently among experiments) so that we can report an histogram of errors, or error bars, for each method.
The complete code for our experiments can be found here https://github.com/bentoayr/MMOT.
5.2 Synthetic graphs dataset
We generate seven synthetic clusters of ten graphs each by including in each cluster multiple random perturbations – with independent edge addition/removal with \(p=0.05\) – of a complete graph, a complete bipartite graph, a cyclic chains, a k-dimensional cube, a K-hop lattice, a periodic 2D grid, or an Erdős-Rényi graph. A random class prediction has a 0.857 error rate. To recover the class of each graph, we follow the procedure described in Sect. 5.1. In Fig. 3 we show the distribution of clustering errors using different (multi)-distances and clustering algorithms over 100 independent experiments. The mean error rates are written in the legend box of each figure.
Figure 3-(left, center) show that, as expected, both TTM and NH-Cut work better when hyperedges are computed using an \(n\)-metric, and that pairwise MMOT works better than WBD.
Figure 3-(right) shows that clustering using only pairwise relationships among graphs leads to worse accuracy than if using triple-wise relationships as in Figs. 3-(left, center). The results in Fig. 3-(right) are similarly bad to the results one obtains when use the MMOT explained in Remark 7, which does not give joint transports and is trivially an \(n\)-metric. Indeed, both TTM and NH-Cut are based on spectral clustering of the hypergraph weight tensor by first reducing it to a matrix via “summing out” one of the dimensions and then using regular spectral clustering techniques. At the same time, if tensor a \(T_{i,j,k}\) satisfies \(T_{i,j,k} = {\mathcal {W}}_{i,j} + {\mathcal {W}}_{j,k} + {\mathcal {W}}_{i,k}\) for some distances \({\mathcal {W}}_{i,j}\), the matrix obtained via such reduction has spectral properties close to those of \({\mathcal {W}}_{i,j}\).
5.2.1 Injection of triangle inequality violations
To double check that this difference in performance is due to the \(n\)-metric properties of pairwise MMOT and WBD, we perturb \({{\mathcal {W}}^{}}\) to introduce triangle inequality violations and measure its effect on clustering accuracy.
To introduce triangle inequality violations, we perturb the tensor \({{\mathcal {W}}^{i,j,k}}\) as follows. For each set of four different graphs (i, j, k, l), we find which among \({{\mathcal {W}}^{i,j,k}},{{\mathcal {W}}^{i,j,l}},{{\mathcal {W}}^{i,l,k}},{{\mathcal {W}}^{l,j,k}}\) we can change the least to produce a triangle inequality violation among these values. Let us assume that this value is \({{\mathcal {W}}^{i,j,k}}\), and that to violate \({{\mathcal {W}}^{i,j,k}} \le {{\mathcal {W}}^{i,j,l}} + {{\mathcal {W}}^{i,l,k}} + {{\mathcal {W}}^{l,j,k}}\), the quantity \({{\mathcal {W}}^{i,j,k}}\) needs to increase at least by \(\delta\), where \(\delta = {{\mathcal {W}}^{i,j,l}} + {{\mathcal {W}}^{i,l,k}} + {{\mathcal {W}}^{l,j,k}} - {{\mathcal {W}}^{i,j,k}}\). We then increase \({{\mathcal {W}}^{i,j,k}}\) by \(1.3\times \delta\). We repeat this procedure such that, in total, \(20\%\) of the entries of the tensor \({{\mathcal {W}}^{}}\) get changed.
Table 1 shows the effect of adding violations on the mean error rate for different MMOT distances. These violations clearly affect pairwise MMOT and barycenter-MMOT (both \(n\)-metrics), but do not have an impact on non n-metric distances.
5.2.2 Reproducibility
Our code is fully written in Matlab 2020a. It requires installing CVX, available in http://cvxr.com/cvx/download/.
To produce Fig. 3 open Matlab and run the file run_me_for_synthetic_experiments.m. To produce the numbers in the second row of Table 1, run the same file but with fraction_viol = 0.2; strength_of_viol = 0.3;. Note that the numbers in the first row of Table 1 are the mean values in Fig. 3. The call to run_me_for_synthetic_experiments.m takes 26 hours to complete using 12 cores, each core from an Intel(R) Xeon(R) Processor E5-2660 v4 (35M Cache, 2.00 GHz). Except for the Weiszfeld.m file, all of the code was written by us and is distributed under an MIT License. This license is described in the README.txt file at the root our repository. The license for Weiszfeld.m is on the header of the file itself.
5.3 Molecular graphs dataset
This experiment is motivated by the important task in chemistry of clustering chemical compounds, represented as graphs, by their structure (McGregor & Pallai, 1997, Seeland et al., 2014, Wilkens et al., 2005). We use the molecular dataset in the supplementary material of Sutherland et al. (2003), which can be downloaded at https://pubs.acs.org/doi/abs/10.1021/ci034143r#_i21. It contains the adjacency matrices of graphs corresponding to five types of compounds: cyclooxygenase-2 inhibitors (467 graphs), benzodiazepine receptor ligands (405 graphs), estrogen receptor ligands (1009 graphs), dihydrofolate reductase inhibitors (756 graphs), and monoamine oxidase inhibitors (1641 graphs).
To build our clusters, we randomly get ten graphs of each type, and prune them so that they have no node with a degree smaller than 1. We note that, unlike for the synthetic data, the molecular graphs have weighted adjacency matrices, whose entries can be 0, 1, or 2. A random class prediction has 0.8 error rate. To recover the class of each graph, we follow the procedure described in Sect. 5.1. We repeat this experiment 100 times, independently, to collect error statistics.
In Fig. 4 we show the distribution of clustering errors using different (multi)-distances and clustering algorithms.
Figure 4-(left, center) show that both TTM and NH-Cut work better when hyperedges are computed using \(n\)-metrics, and Fig. 4-(right) shows that clustering using pairwise relationships performs worse than using triple-wise relations. For the same reasons as explained in Sect. 5.2, the results in Fig. 4-(right) are similarly bad to the results one would obtain if we used the MMOT explained in Remark 7.
There is a starker difference between \(n\)-metrics and non-\(n\)-metrics than that seen in Fig. 4, which we now discuss.
The number of possible 3-sized hyperedges is cubic with the number of graphs being clustered. Thus, in our experiments we randomly sample z triples (i, j, k) and only for these we create an hyperedge with weight \({{\mathcal {W}}^{i,j,k}}\).
Figure 5 shows the effect of z (x-axis) on performance. Comparing more graphs, i.e. increasing z, should improve clustering. However, for a non-\(n\)-metric, as z grows, triangle inequality violations can appear that introduce confusion: a graph can be “close” to two clusters that are far away, confusing TTM and NH-Cut. This compensates the benefits of a high z and results in the flat curves in Fig. 5.
5.3.1 Reproducibility
To produce Fig. 4 open Matlab and run the file run_me_for_molecular_experiments.m. The call to run_me_for_molecular_experiments.m takes 31 hours to complete using 12 cores, each core from an Intel(R) Xeon(R) Processor E5-2660 v4 (35M Cache, 2.00 GHz). Except for the Weiszfeld.m file and the molecular dataset, all of the code was written by us and is distributed under an MIT License. This license is described in the README.txt file at the root our repository. The license for Weiszfeld.m is on the header of the file itself. The license for the dataset is described on the README.txt file inside the dataset folder.
We contacted that authors (via email) about the use of their dataset and they have informed us that there are no licenses attached to it, as long as we attribute it a citation. This data contains no personally identifiable information nor offensive content.
6 Discussion and future work
In this paper, we have proved that for a general MMOT, the cost function being a generalized metric is not sufficient to guarantee that MMOT defines a generalized metric. Nevertheless, we have shown that a new family of multi-distances that generalize optimal transport to multiple distributions, the family of pairwise multi-marginal optimal transports (pairwise MMOT), leads to a multi-distance that satisfies generalized metric properties. This now opens the door to us using pairwise MMOT in combination with several algorithms whose good performance depends on metric properties. In addition, we have established coefficients for the generalized triangle inequality associated with the pairwise MMOT, and proved that these coefficients cannot be improved, up to a linear factor.
Our results are for the pairwise MMOT family. In future work, we seek to find new sufficient conditions under which other variants of MMOT lead to generalized metrics, and, for certain families of MMOT, find necessary conditions for these same properties to hold.
Finally, in future work we will also study how the structure of optimal coupling among distributions induced by our MMOTs, i.e. the support of \({{\varvec{p}}}^{*}\), compares with those of other MMOTs, such as the Barycenter MMOT.
Data availability
All used data is available via the link below. https://github.com/bentoayr/MMOT.
Code availability
All used code is available via the link below. https://github.com/bentoayr/MMOT.
Change history
08 January 2023
The URLs for Data Availability and Code Availability have been updated
Notes
The asymptotic notation used in this paper is as follows: \(f(n)=\mathcal {O}(g(n))\) denotes an asymptotically tight upper-bound on f(n), i.e., there exist \(M, n_0>0\) such that \(0 \le f(n) \le M g(n)\) for all \(n\ge n_0\); \(f(n)=\Omega (g(n))\) denotes an asymptotic lower-bound on f(n), i.e., there exists \(m, n_0>0\) such that \(0 \le m g(n)\le f(n)\) for all \(n\ge n_0\); \(f(n)=\Theta (g(n))\) denotes an asymptotically tight bound on f(n), i.e., there exist \(m, M, n_0>0\), such that \(0 \le m g(n) \le f(n)\le M g(n)\) for all \(n\ge n_0\). This notation is standard and can be found, e.g., in Cormen et al. (2009, Chapter 3).
References
Abraham, I., Abraham, R., Bergounioux, M., & Carlier, G. (2017). Tomographic reconstruction from a few views: A multi-marginal optimal transport approach. Applied Mathematics & Optimization, 75(1), 55–73.
Ackermann, M. R., Blömer, J., & Sohler, C. (2010). Clustering for metric and nonmetric distance measures. ACM Transactions on Algorithms (TALG), 6(4), 1–26.
Agueh, M., et al. (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2), 904–924.
Altschuler, J. M., & Boix-Adsera, E. (2020). Polynomial-time algorithms for multimarginal optimal transport problems with structure. arXiv e-prints
Altschuler, J. M., & Boix-Adsera, E. (2021). Hardness results for multimarginal optimal transport problems. Discrete Optimization, 42, 100669.
Ambrosio, L., & Gigli, N. (2013). A user’s guide to optimal transport (pp. 1–155). Springer
Angel, O., & Spinka, Y. (2019) Pairwise optimal coupling of multiple random variables. Preprint arXiv:1903.00632
Angiulli, F., & Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In European Conf. on principles of data mining and knowledge discovery (pp. 15–27). Springer.
Arjovsky, M., et al. (2017) Wasserstein generative adversarial networks. In ICML.
Batagelj, V., & Zaveršnik, M. (2011). Fast algorithms for determining (generalized) core groups in social networks. Advance in Data Analysis and Classification, 5(2), 129–145.
Benamou, J.-D., Carlier, G., & Nenna, L. (2016) A Numerical Method to Solve Multi-marginal Optimal Transport Problems with Coulomb Cost, pp. 577–601. Springer.
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), 1111–1138.
Beygelzimer, A., Kakade, S., & Langford, J. (2006). Cover trees for nearest neighbor. In Proceedings of ICML (pp. 97–104).
Carlier, G., & Ekeland, I. (2010). Matching for teams. Economic theory, 42(2), 397–418.
Clarkson, K. L. (2006) Nearest-neighbor searching and metric space dimensions. Nearest-neighbor methods for learning and vision: Theory and practice, pp. 15–59.
Clarkson, K. L. (1999). Nearest neighbor queries in metric spaces. Discrete & Computational Geometry, 22(1), 63–93.
Constantine, D., & Lafont, J.-F. (2019). Marked length rigidity for one-dimensional spaces. Journal of Topology and Analysis, 11(03), 585–621.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to algorithms (3rd ed.). MIT Press.
Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., & Courty, N. (2018) Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European conference on computer vision (ECCV) (pp. 447–463).
Dwork, C., & Lei, J. (2009). Differential privacy and robust statistics. In Proceedings of ACM symposium on theory of computing (pp. 371–380).
Elvander, F., Haasler, I., Jakobsson, A., & Karlsson, J. (2020). Multi-marginal optimal transport using partial information with applications in robust localization and sensor fusion. Signal Processing, 171, 107474.
Fan, J., Haasler, I., Karlsson, J., & Chen, Y. (2022). On the complexity of the optimal transport problem with graph-structured cost. In International conference on artificial intelligence and statistics (pp. 9147–9165). PMLR.
Fan, H., Su, H., Guibas, L.J. (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 605–613).
Gerolin, A., Kausamo, A., & Rajala, T. (2019). Duality theory for multi-marginal optimal transport with repulsive costs in metric spaces. ESAIM: Control, Optimisation and Calculus of Variations, 25, 62.
Ghoshdastidar, D., & Dukkipati, A. (2015). A provable generalized tensor spectral method for uniform hypergraph partitioning. In ICML (pp. 400–409).
Ghoshdastidar, D., Dukkipati, A., et al. (2017). Consistency of spectral hypergraph partitioning under planted partition model. The Annals of Statistics, 45(1), 289–315.
Ghoshdastidar, D., & Dukkipati, A. (2017). Uniform hypergraph partitioning: Provable tensor methods and sampling techniques. The Journal of Machine Learning Research, 18(1), 1638–1678.
Grant, M., & Boyd, S. (2008). Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, & H. Kimura (Eds.), Recent Advances in Learning and Control: Lecture Notes in Control and Information Sciences (pp. 95–110). Springer. http://stanford.edu/~boyd/graph_dcp.html
Grant, M., & Boyd, S. (2014). CVX: Matlab Software for Disciplined Convex Programming, version 2.1. http://cvxr.com/cvx
Haasler, I., Singh, R., Zhang, Q., Karlsson, J., & Chen, Y. (2021). Multi-marginal optimal transport and probabilistic graphical models. IEEE Transactions on Information Theory
Haasler, I., Ringh, A., Chen, Y., & Karlsson, J. (2021). Multimarginal optimal transport with a tree-structured cost and the schrodinger bridge problem. SIAM Journal on Control and Optimization, 59(4), 2428–2453.
Hartigan, J. A. (1975). Clustering algorithms. Wiley.
Indyk, P. (1999) Sublinear time algorithms for metric space problems. In Proceedings of ACM symposium on theory of computing (pp. 428–434).
Kantorovich, L. V. (1942). On the translocation of masses. Doklady Akademii Nauk, 37, 199–201.
Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming.
Kiss, G., Marichal, J.-L., & Teheux, B. (2018). A generalization of the concept of distance based on the simplex inequality. Beiträge zur Algebra und Geometrie/Contributions to Algebra and Geometry, 59(2), 247–266.
Li, C. T., & Anantharam, V. (2019) Pairwise multi-marginal optimal transport and embedding for earth mover’s distance. Preprint arXiv:1908.01388
Lin, T., Ho, N., Cuturi, M., & Jordan, M. I. (2022). On the complexity of approximating multimarginal optimal transport. Journal of Machine Learning Research, 23(65), 1–43.
McGregor, M. J., & Pallai, P. V. (1997). Clustering of large databases of compounds: Using the mdl “keys’’ as structural descriptors. Journal of Chemical Information and Computer Sciences, 37(3), 443–448.
Mémoli, F. (2011). Gromov-Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4), 417–487.
Moameni, A., & Pass, B. (2017). Solutions to multi-marginal optimal transport problems concentrated on several graphs. ESAIM: Control, Optimisation and Calculus of Variations, 23(2), 551–567.
Pass, B. (2012). On the local structure of optimal measures in the multi-marginal optimal transportation problem. Calculus of Variations and Partial Differential Equations, 43(3–4), 529–536.
Pass, B. (2014). Multi-marginal optimal transport and multi-agent matching problems: Uniqueness and structure of solutions. Discrete & Continuous Dynamical Systems, 34(4), 1623.
Pass, B. (2015). Multi-marginal optimal transport: Theory and applications. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6), 1771–1790.
Peyré, G., Cuturi, M., et al. (2019). Computational optimal transport. Foundations and Trends® in Machine Learning, 11(5–6), 355–607.
Purkait, P., Chin, T.-J., Sadri, A., & Suter, D. (2016). Clustering with hypergraphs: The case for large hyperedges. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9), 1697–1711.
Schmitz, M. A., Heitz, M., Bonneel, N., Ngole, F., Coeurjolly, D., Cuturi, M., et al. (2018). Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1), 643–678.
Seeland, M., Johannes, A. K., & Kramer, S. (2014) Structural clustering of millions of molecular graphs. In Proceedings of ACM symposium on applied computing.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Singh, K., & Upadhyaya, S. (2012). Outlier detection: Applications and techniques. International Journal of Computer Science Issues (IJCSI), 9(1), 307.
Solomon, J., et al. (2015). Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG), 34(4), 66.
Strössner, C., & Kressner, D. (2022) Low-rank tensor approximations for solving multi-marginal optimal transport problems. Preprint arXiv:2202.07340
Sutherland, J. J., O’brien, L. A., & Weaver, D. F. (2003). Spline-fitting with a genetic algorithm: A method for developing classification structure- activity relationships. Journal of Chemical Information and Computer Sciences, 43(6), 1906–1915.
Torres, L., Suárez-Serrato, P., & Eliassi-Rad, T. (2019). Non-backtracking cycles: Length spectrum theory and graph mining applications. Applied Network Science, 4(1), 41.
Wilkens, S. J., Janes, J., & Su, A. I. (2005). Hiers: hierarchical scaffold clustering using topological chemical graphs. Journal of Medicinal Chemistry, 48(9), 3182–3193.
Xing, E. P., Jordan, M. I., Russell, S. J., & Ng, A. Y. (2003). Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems (pp. 521–528).
Funding
The authors gratefully acknowledge the support of the National Science Foundation (IIS-1741129) and the National Institutes of Health (Grant 1U01AI124302).
Author information
Authors and Affiliations
Contributions
JB, AS, and LM contributed equally in effort among the tasks of writing of the manuscript, the checking of the correctness of the proofs, and the running of the numerical experiments. LM made the first pass on the writing of paper. JB set the goals of the project.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Editors: Krzysztof Dembczynski and Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Details for proof of theorem 2
Proof
Note that Definition 3 supports using a different function \({d}^{i,j,k}\) for different product sample spaces \({\Omega }^{i}\times {\Omega }^{j}\times {\Omega }^{k}\). In the case of Theorem 2, however, we only use \({\Omega }\times {\Omega }\times {\Omega }\), so, when checking the \(n\)-metric properties, we can drop the upper indices in \({d}\) in Definition 3.
For simplicity, we will abuse the notation and use \({d}(x,y,z)\) and \({{d}^{}_{i,j,k}}\) interchangeably, where i, j, and k are the index of x, y, and z, in \({\Omega }^{}\).
Given \(x,y,z,w \in {\Omega }^{}\), it is immediate to see that (i) \({d}(x,y,z) \ge 0\), (ii) \({d}(x,y,z)\) is permutation invariant, and that (iii) \({d}(x,y,z) = 0\) if and only \(x=y=z\) (remember that there are no three co-linear points in \({\Omega }^{}\)). It is also not hard to see that, \({d}(x,y,z) \le {d}(x,y,w) + {d}(x,w,z) + {d}(w,y,z)\). To be specific, if \({d}(x,y,z) = 0\), then the inequality is obvious. If \({d}(x,y,z) = \gamma\), then without loss of generality we can assume that \(x = y \ne z\). In this case, if furthermore \(w = x\), then \({d}(x,w,z) = \gamma\), and the inequality follows. If \(w = z\), then \({d}(x,y,w) = \gamma\), and the inequality follows. If w is different from x, y, z then \(\gamma \le {d}(x,w,z)\), and the inequality follows. If \({d}(x,y,z) > \gamma\), it must be that x, y and z are different. In which case we need do consider two special cases. If w is equal to one among x, y, z, say \(w=x\) without loss of generality, then \({d}(x,y,z) = {d}(y,z,w)\), and the inequality follows. If w is different from x, y, z, then we have \({d}(x,y,z) = {d}(x,y,w) + {d}(x,w,z) + {d}(w,y,z)\) if w is contained by the triangle formed by x, y, and z, and otherwise, we have \({d}(x,y,z) < {d}(x,y,w) + {d}(x,w,z) + {d}(w,y,z)\). In other words, \({d}\) is an \(n\)-metric (\(n= 3\)).
Given a mass function \({\varvec{p}}^{i,j,k}\), the value \(\left\langle {{d}^{i,j,k}_{}},{\varvec{p}}^{i,j,k} \right\rangle\) represents the average area of the triangle whose three vertices are sampled from \({\varvec{p}}^{i,j,k}\). Computing the MMOT distance \({{\mathcal {W}}^{i,j,k}}\) for the mass functions \({\varvec{p}}^{i}\), \({\varvec{p}}^{j}\), \({\varvec{p}}^{k}\), amounts to finding the mass function \({{{\varvec{p}}}^{*}}^{i,j,k}\) with univariate marginals \({\varvec{p}}^{i}\), \({\varvec{p}}^{j}\), \({\varvec{p}}^{k}\) that minimizes this average area.
Now consider \({\varvec{p}}^{1}\), \({\varvec{p}}^{2}\), \({\varvec{p}}^{3}\), and \({\varvec{p}}^{4}\) as depicted in Fig. 2-(left). The mass functions \({\varvec{p}}^{1}\), \({\varvec{p}}^{2}\) assign probability one to each one of the blue and red points, and zero probability to every other point in \({\Omega }^{}\). The mass functions \({\varvec{p}}^{3}\) and \({\varvec{p}}^{4}\) assign equal probability to each one of the green points, and orange points, respectively, and 0 probability to other points in \({\Omega }\).
Now we compute the distances \({{\mathcal {W}}^{1,2,3}}\), \({{\mathcal {W}}^{1,2,4}}\), \({{\mathcal {W}}^{1,3,4}}\), and \({{\mathcal {W}}^{2,3,4}}\). The MMOT distance \({{\mathcal {W}}^{1,2,3}}\) is equal to the average of the area of the two shaded triangles in Fig. 6-(left), which is \({{\mathcal {W}}^{1,2,3}} = 0.5\times (0.5) + 0.5\times (0.5) = 0.5\). The distance \({{\mathcal {W}}^{1,2,4}}\) is equal to the average of the area of the two shaded triangles in Fig. 6-(right), which is \({{\mathcal {W}}^{1,2,4}} = 0.5\times (0.5 \epsilon ) + 0.5\times (0.25 - 0.5\epsilon ) = 0.125\).
The MMOT distances for \({{\mathcal {W}}^{1,3,4}}\) and \({{\mathcal {W}}^{2,3,4}}\) are the same by symmetry. We focus on the computation of \({{\mathcal {W}}^{2,3,4}}\). Since both \({\varvec{p}}^{4}\) and \({\varvec{p}}^{3}\) are uniform over their respective supports, it must be the case that \({{{\varvec{p}}}^{*}}^{2,3,4}\) – the optimal joint distribution in the computation of \({{\mathcal {W}}^{2,3,4}}\) – has a bi-variate marginal \({{{\varvec{p}}}^{*}}^{3,4}\) of the form
where \(\alpha \in \left[ 0, \frac{1}{2}\right]\). Therefore, the distance \({{\mathcal {W}}^{2,3,4}}\) is equal to the weighted average of the area of the four shaded triangles in Fig. 7, where we split the four triangles into two different drawings for clarity sake.
In other words,
where we are using the fact that \(\min _{\alpha \in [0,\frac{1}{2}]}(\text {linear function of } \alpha )\) must be minimized at either extreme \(\alpha = 0\) or \(\alpha = \frac{1}{2}\). It is finally easy to observe that
\(\square\)
Appendix 2: useful lemmas
Lemma 2
Let \({\varvec{p}}\) be as in Definition 1 eq. (3) for some \({\varvec{q}}^{k}\) and \(\{{\varvec{q}}^{i\mid k}\}_{i\in [n]\backslash k}\). Let \({\varvec{p}}^{i}\) and \({\varvec{p}}^{i,k}, i \ne k,\) be the marginals of \({\varvec{p}}\) over \({\Omega }^{i}\) and \({\Omega }^{i} \times {\Omega }^{k}\), respectively. Let \({\varvec{q}}^{i,k} = {\varvec{q}}^{i\mid k}{\varvec{q}}^{k}\), \(i \ne k\), and let \({\varvec{q}}^{i}\) be its marginals over \({\Omega }^{i}\). We have that \({\varvec{p}}^{i} = {\varvec{q}}^{i} \; \forall i,\) and \({\varvec{p}}^{i,k} = {\varvec{q}}^{i,k} \; \forall i \ne k\).
Proof
Think of \({\varvec{p}}\) as describing \(n\) discrete random variables (r.v.’s). It follows from the factorisation in (3) that conditioned on the kth r.v. the other r.v.’s are independent. The result follows. \(\square\)
Lemma 3
Let \({d}\) be a metric and \({\varvec{p}}\) a mass over \({\Omega }^{1}\times \cdots \times {\Omega }^{n}\). Let \({\varvec{p}}^{i,j}\) be the marginal of \({\varvec{p}}\) over \({\Omega }^{i}\times {\Omega }^{j}\). Define \(w_{i,j} = \left\langle {d}^{i,j},{\varvec{p}}^{i,j} \right\rangle _{\ell }^{\frac{1}{\ell }}\). For any \(i,j,k\in [n]\) and \(\ell \in \mathbb {N}\) we have that \(w_{i,j} \le w_{i,k} + w_{k,j}\).
Proof
Let \({\varvec{p}}^{i,j,k}\) be the marginal of \({\varvec{p}}\) over \({\Omega }^{i} \times {\Omega }^{j}\times {\Omega }^{k}\). Write
Use Minkowski’s inequality on a \(L_{\ell }\) space with measure \({\varvec{p}}^{i,j,k}\) to bound this by
. \(\square\)
Appendix 3: proof of theorem 3
We will need the following hash function in this proof, as well as the useful Lemmas in Appendix 1.
1.1 Special hash function
Definition 7
The map \({\mathcal {H}}^{n}\) transforms a tuple \((i,j), 1 \le i < j\le n\), into either 2, 3 or 4 triples according to
where two tuples (respectively triples) are assumed duplicates iff all of their components agree and
and
\(h(\cdot )\) is also a function of \(n\) but for simplicity we omit it in the notation. \(h(\cdot )\) is defined as
In what follows, the symbol \(\oplus\) denotes a list join operation with no duplicate removal, e.g. \(\{x,y\}\oplus \{x,z\}=\{x,y,x,z\}\).
Lemma 4
Let \((a,b,c) \in {\mathcal {H}}^{n}(i,j)\) for \(1 \le i < j \le n\). Then, \(1 \le a < b \le n+1\), \(1 \le c \le n\), and \(c \notin \{a,b\}\). Furthermore, the set
has no duplicates.
Proof
The fact that \(1\le a < b \le n+1\) and that \(1 \le c \le n\) is immediate. To see that \(c\notin \{a,b\}\), we just need to notice that \(h(i) \notin \{i,n+1\}\) for \(i\in [n]\). The fact that \(h(i) \ne n+ 1\) follows the range of h being \([n]\). If we had \(h(i) = i\), then we would have \((i-2) \text { mod } n= i-1\), which is not possible. To see that (23) does not have duplicates, we need to see that, starting from two different tuples, the different expressions that define the triples that go into (23) can never be equal.
Given \(1 \le i< j \le n, 1 \le i' < j' \le n, (i,j) \ne (i',j')\) we will show that
-
1.
\({\mathcal {H}}^{n}(i,j)\) does not have duplicates;
-
2.
\({\mathcal {H}}^{n}(i,j)\) and \({\mathcal {H}}^{n}(i',j')\) do not have overlaps, that is, \({\mathcal {H}}^{n}_{1}(i,j)\), \({\mathcal {H}}^{n}_{2}(i,j)\), \({\mathcal {H}}^{n}_{1}(i',j')\), and \({\mathcal {H}}^{n}_{2}(i',j')\) do not have overlaps with each other.
It is obvious that \({\mathcal {H}}^{n}_{1}(i,j)\) does not have duplicates and nor does \({\mathcal {H}}^{n}_{2}(i,j)\) according to their definitions. Because \(i \ne j\), it is also trivial to show that \({\mathcal {H}}^{n}_{1}(i,j)\) and \({\mathcal {H}}^{n}_{2}(i,j)\) do not have overlaps, based on their definitions. Therefore, the first claim is indeed true. The burden now is to verify the second claim.
We show that the four sets have no overlaps with each other. We show this two sets at a time, there are in total 6 pairs to consider. As an immediate result of the discussion in the first claim, the following four combinations do not have overlaps: \({\mathcal {H}}^{n}_{1}(i,j)\) vs. \({\mathcal {H}}^{n}_{2}(i,j)\), \({\mathcal {H}}^{n}_{1}(i',j')\) vs. \({\mathcal {H}}^{n}_{2}(i',j')\), \({\mathcal {H}}^{n}_{1}(i,j)\) vs. \({\mathcal {H}}^{n}_{1}(i',j')\), \({\mathcal {H}}^{n}_{2}(i,j)\) vs. \({\mathcal {H}}^{n}_{2}(i',j')\). The two combinations left are \({\mathcal {H}}^{n}_{1}(i,j)\) vs \({\mathcal {H}}^{n}_{2}(i',j')\) and \({\mathcal {H}}^{n}_{1}(i',j')\) vs \({\mathcal {H}}^{n}_{2}(i,j)\). We notice that they are symmetric and, because the choice of the tuples \((i,j), (i', j')\) is arbitrary, we only need to show that \({\mathcal {H}}^{n}_{1}(i,j)\) and \({\mathcal {H}}^{n}_{2}(i',j')\) do not have overlaps, given \((i,j) \ne (i', j')\).
\({\mathcal {H}}^{n}_{1}(i,j)\) and \({\mathcal {H}}^{n}_{2}(i',j')\) each have two possibilities for the form of their output. Thus, together, there are four possibilities to consider. None of them have an overlap, which we show by contradiction.
-
1.
\({\mathcal {H}}^{n}_{1}(i,j) = \{(i,n+1,h(i))\}\) and \({\mathcal {H}}^{n}_{2}(i',j') = \{(j',n+1, h(j'))\}\). If these single-element sets have an overlap, that implies that \(i=j'\), but, according to the definition, \(i=1\) and \(i'=j'-1\) which implies \(j'>1\).
-
2.
\({\mathcal {H}}^{n}_{1}(i,j) = \{(i,n+1,h(i))\}\) and \({\mathcal {H}}^{n}_{2}(i',j') = \{(i',j',h(j')), (i',n+1,h(j'))\}\). For them to have an overlap, \(h(i) = h(j')\). That requires \(i = j'\) which contradictory to \(i=1\) and \(i' < j' - 1\) at the same time.
-
3.
\({\mathcal {H}}^{n}_{1}(i,j) = \{(i,j,h(i)),(j,n+1,h(i))\}\) and \({\mathcal {H}}^{n}_{2}(i',j') = \{(i',j',h(j')), (i',n+1,h(j'))\}\). For the first two components to equal, \(i=i'\), \(j=j'\), and \(i=j'\), which is contradictory to \(i' < j'-1\). For the second two components to equal, \(j=i'\) and \(i=j'\), which is contradictory to \(i < j\) or \(i' < j'\). Because of the existence of “\(n+1\)”, the components at different positions cannot collide.
-
4.
\({\mathcal {H}}^{n}_{1}(i,j) = \{(i,j,h(i)),(j,n+1,h(i))\}\) and \({\mathcal {H}}^{n}_{2}(i',j') = \{(j',n+1, h(j'))\}\). This implies \(j'=j\) and \(j'=i\), which is contradictory to \(i < j\).
\(\square\)
For example, if \(n= 3\), then the possible tuples (1, 2), and (1, 3), and (2, 3), get mapped respectively to (1, 2, 3), (2, 4, 3), (2, 4, 1), and (1, 4, 3), (1, 3, 2), (1, 4, 2), and (2, 3, 1), (3, 4, 1), (3, 4, 2), all of which are different and satisfy the claims in Lemma 4.
We now prove the four metric properties in order. It is trivial to prove the first three properties given the definition of our distance function for the transport problem. Then, we provide a detailed proof for the triangle inequality.
1.2 Non-negativity
Proof
The non-negativity of \({d}^{i,j}\) and \({\varvec{r}}^{i,j}\), implies that \(\left\langle {d}^{i,j},{\varvec{r}}^{i,j} \right\rangle _{\ell } \ge 0\), and hence that \({\mathcal {W}}\ge 0\). \(\square\)
1.3 Symmetry
Proof
Recall that the computation of \({\mathcal {W}}({\varvec{p}}^{i_{1:n}})\) involves a set of distances \(\{{{d}}^{a,b}\}_{a,b}\). Consider a generic permutation map \(\sigma\), and let \(\sigma ^{-1}\) be its inverse. Let \(\sigma\) and \(\sigma ^{-1}\) apply component-wise to its arguments. The computation of \({\mathcal {W}}({\varvec{p}}^{\sigma (i_{1:n})})\) involves a set of distances \(\{\tilde{{d}}^{a,b}\}_{a,b}\) that satisfy \(\tilde{{d}}^{i,j} = {{d}}^{\sigma ^{-1}(i,j)}\). Therefore, each term \(\left\langle \tilde{{d}}^{i,j},{\varvec{r}}^{i,j} \right\rangle _{\ell }\) involved in the computation of \({\mathcal {W}}({\varvec{p}}^{\sigma (i_{1 : n})})\), can be rewritten as \(\left\langle {{d}}^{\sigma ^{-1}(i,j)},{\varvec{r}}^{i,j} \right\rangle _{\ell }\), where a simple reindexing of the summation \(\sum _{i<j}\) allow us to write as \(\left\langle {{d}}^{i,j},{\varvec{r}}^{\sigma (i,j)} \right\rangle _{\ell }\). Since the mass function \({\varvec{r}}\) has as supporting sample space \({\Omega }^{\sigma (i_1)}\times \cdots \times {\Omega }^{\sigma (i_{n})}\), the marginal \({\varvec{r}}^{\sigma (i,j)}\) can be seen as the marginal \({\varvec{q}}^{i,j}\) of a mass function \({\varvec{q}}\) with support \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_{n}}\). Therefore, minimizing \(\sum _{i<j} (\left\langle \tilde{{d}}^{i,j},{\varvec{r}}^{i,j} \right\rangle _{\ell })^{1/\ell }\) for \({\varvec{r}}\) over \({\Omega }^{\sigma (i_1)}\times \cdots \times {\Omega }^{\sigma (i_{n})}\) is the same as minimizing \(\sum _{i<j} (\left\langle {{d}}^{i,j},{\varvec{q}}^{i,j} \right\rangle _{\ell })^{1/\ell }\) for \({\varvec{q}}\) over \({\Omega }^{i_1}\times \cdots \times {\Omega }^{i_{n}}\). \(\square\)
1.4 Identity
Proof
We prove each direction of the equivalence separately. Recall that \(\{{\varvec{p}}^{i}\}\) are given, they are the masses for which we want to compute the pairwise MMOT.
“\(\Longleftarrow\)”: If for each \(i,j\in [n]\) we have \({\Omega }^{i} = {\Omega }^{j}\), then \({\vert \Omega ^{i}\vert } = {\vert \Omega ^{j}\vert }\), and there exists a bijection \(b^{i,j}(\cdot )\) from \([{\vert \Omega ^{i}\vert }]\) to \([{\vert \Omega ^{j}\vert }]\) such that \({\Omega }^{i}_{s} = {\Omega }^{j}_{b^{i,j}(s)}\) for all s. If furthermore \({\varvec{p}}^{i} = {\varvec{p}}^{j}\), we can define a \({\varvec{r}}\) for \({\Omega }^{1}\times {\Omega }^{n}\) such that its univariate marginal over \({\Omega }^{i}\), \({\varvec{r}}^{i}\), satisfies \({\varvec{r}}^{i} = {\varvec{p}}^{i}\), and such that its bivariate marginal over \({\Omega }^{i}\times {\Omega }^{j}\), \({\varvec{r}}^{i,j}\), satisfies \({\varvec{r}}^{i,j}_{s,t} = {\varvec{p}}^{i}_{s}\), if \(t = b^{i,j}(s)\), and zero otherwise. Such a \({\varvec{r}}\) achieves an objective value of 0 in (6), the smallest value possible by the first metric property (already proved). Therefore, \({\mathcal {W}}^{1,\ldots ,n} = 0\).
“\(\Longrightarrow\)”: Now let \({{\varvec{r}}}^{*}\) be a minimizer of (6) for \({\mathcal {W}}^{1,\ldots ,n}\). Let \(\{{{{\varvec{r}}}^{*}}^{i}\}\) and \(\{{{{\varvec{r}}}^{*}}^{i,j}\}\) be its univariate and bivariate marginals respectively. If \({\mathcal {W}}^{1,\ldots ,n} = 0\) then \(\left\langle {d}^{i,j},{{{\varvec{r}}}^{*}}^{i,j} \right\rangle _{\ell } = 0\) for all i, j. Let us consider a specific pair i, j, and, without loss of generality, let us assume that \({\vert \Omega ^{i}\vert } \le {\vert \Omega ^{j}\vert }\). Since, by assumption, we have that \({{{\varvec{r}}}^{*}}^{i}_{s} = {\varvec{p}}^{i}_{s} > 0\) for all \(s\in [{\vert \Omega ^{i}\vert }]\), and \({{{\varvec{r}}}^{*}}^{j}_{s} = {\varvec{p}}^{j}_{s} > 0\) for all \(s\in [{\vert \Omega ^{j}\vert }]\), there exists an injection \(b^{i,j}(\cdot )\) from \([{\vert \Omega ^{i}\vert }]\) to \([{\vert \Omega ^{j}\vert }]\) such that \({{{\varvec{r}}}^{*}}^{i,j}_{s,b^{i,j}(s)} > 0\) for all \(s\in [{\vert \Omega ^{i}\vert }]\). Therefore, \(\left\langle {d}^{i,j},{{{\varvec{r}}}^{*}}^{i,j} \right\rangle _{\ell } = 0\) implies that \({{d}^{i,j}_{s,b^{i,j}(s)}} = 0\) for all \(s\in [{\vert \Omega ^{i}\vert }]\). Therefore, since \({d}\) is a metric, it must be that \({\Omega }^{i}_{s} = {\Omega }^{j}_{b^{i,j}(s)}\) for all \(s\in [{\vert \Omega ^{i}\vert }]\). Now lets us suppose that there exists an \(r\in [{\vert \Omega ^{j}\vert }]\) that is not in the range of \(b^{i,j}\). Since, by assumption, all of the elements of the sample spaces are different, it must be that \({{d}^{i,j}_{s,r}} > 0\) for all \(s\in [{\vert \Omega ^{i}\vert }]\). Therefore, since \(\left\langle {d}^{i,j},{{{\varvec{r}}}^{*}}^{i,j} \right\rangle _{\ell } = 0\), it must be that \({{{\varvec{r}}}^{*}}^{i,j}_{s,r} = 0\) for all \(s\in [{\vert \Omega ^{i}\vert }]\). This contradicts the fact that \(\sum _{s\in [{\vert \Omega ^{i}\vert }]} {{{\varvec{r}}}^{*}}^{i,j}_{s,r} = {{{\varvec{r}}}^{*}}^{j}_{r} = {\varvec{p}}^{j}_{r}> 0\) (the last inequality being true by assumption). Therefore, \({\vert \Omega ^{i}\vert } = {\vert \Omega ^{j}\vert }\), and the existence of \(b^{i,j}\) proves that \({\Omega }^{i} = {\Omega }^{j}\). At the same time, since \({{d}^{i,j}_{s,t}} > 0\) for all \(t \ne b^{i,j}(s)\), it must be that \({{{\varvec{r}}}^{*}}^{i,j}_{s,t} = 0\) for all \(t \ne b^{i,j}(s)\). Therefore, \({\varvec{p}}^{i}_{s} = {\varvec{p}}^{j}_{b^{i,j}(s)}\) for all s, i.e. \({\varvec{p}}^{i} = {\varvec{p}}^{j}\). \(\square\)
1.5 Generalized triangle inequality
Proof
Let \({{\varvec{p}}}^{*}\) be a minimizer for (the optimization problem associated with) \({\mathcal {W}}^{1,\ldots ,n}\), and let \({{{\varvec{p}}}^{*}}^{i,j}\) be the marginal induced by \({{\varvec{p}}}^{*}\) for the sample space \({\Omega }^{i}\times {\Omega }^{j}\). We would normally use \({{\varvec{r}}}^{*}\) for this minimizer, but, to avoid confusions between \({\varvec{r}}\) and r, we avoid doing so. We can write that
For \(r\in [n]\), let \({\varvec{p}}^{(*r)}\) be a minimizer for \({\mathcal {W}}^{1,\ldots ,r-1,r+1,\ldots ,n+1}\). We would normally use \({\varvec{r}}^{(*r)}\) for this minimizer, but, to avoid confusions between \({\varvec{r}}\) and r, we avoid doing so. For \(i,j\in [n+1]\backslash \{r\}\), let \({{\varvec{p}}^{(*r)}}^{i,j}\) be the marginal of \({\varvec{p}}^{(*r)}\) for the sample space \({\Omega }^{i}\times {\Omega }^{j}\). Recall that since \({\varvec{p}}^{(*r)}\) satisfies the constraints in (6), its marginal for the sample space \({\Omega }^{i}\) is \({{{\varvec{p}}}^{*}}^{i}\), which is given in advance.
Let \(h(\cdot )\) be the map defined as (22). Define the following mass function for \({\Omega }^{1}\times \cdots \times {\Omega }^{n+1}\),
where \({{\varvec{p}}^{(*h(i))}}^{i \mid n+1}\) is defined as the mass function that satisfies \({{\varvec{p}}^{(*h(i))}}^{i \mid n+1} {{{\varvec{p}}}^{*}}^{n+1} = {{\varvec{p}}^{(*h(i))}}^{i, n+1}\). Notice that since \(h(i) \notin \{i,n+1\}\), the probability \({{\varvec{p}}^{(*h(i))}}^{i, n+1}\) exists for all \(i \in [n]\).
Let \({\varvec{q}}^{1,\ldots ,n}\) be the marginal of \({\varvec{q}}\) for sample space \({\Omega }^{1}\times \cdots \times {\Omega }^{n}\), and \({\varvec{q}}^{i,j}\) be the marginal of \({\varvec{q}}\) for \({\Omega }^{i}\times {\Omega }^{j}\).
By Lemma 2, we know that the ith univariate marginal of \({\varvec{q}}\) is \({\varvec{p}}^{i}\) (given) and hence \({\varvec{q}}^{1,\ldots ,n}\) satisfies the constraints associated with \({\mathcal {W}}^{1,\ldots ,n}\). Therefore, we can write that
By Lemma 3, inequality (a) below holds; because \({d}\) is symmetric, (b) below holds; by the definition of \({\varvec{q}}\), (c) below follows. Therefore,
Let \(w_{(i,j)}\) denote each term on the r.h.s. of (24), and \(w_{(i,j,r)}\) denote \(\left\langle {d}^{i,j},{{\varvec{p}}^{(*r)}}^{i, j} \right\rangle _{\ell }^{\frac{1}{\ell }}\). Combining (26)–(28), we have
Finally, we write
and show that (30) upper-bounds the r.h.s of (29).
First, by Lemma 3 and the symmetry of d, we have
and,
as long as for each triple (a, b, c) in the above expressions, \(c \notin \{a,b\}\). We will use these inequalities to upper bound some of the terms on the r.h.s. of (29), which can be further upper bounded by (30). In particular, we will apply inequalities (31) and (32) such that the terms \(w_{a,b,c}\) that we get after their use have triples (a, b, c) that match the triples obtained via the map \({\mathcal {H}}^{ n}\) defined in Appendix 1. To be concrete, for example, if \({\mathcal {H}}^{n}\) maps (i, j) to \(\{(i,n+1,h(i)),(j,n+1,h(j))\}\), then we do not apply (31) and (32), and we leave \(w_{(i,n+1,h(i))} + w_{(j,n+1,h(j))}\) as is on the r.h.s. of (29). If, for example, \({\mathcal {H}}^{n}\) maps (i, j) to \(\{(i,n+1,h(i)),(i,j,h(j)),(i,n+1,h(j))\}\), then we leave the first term in \(w_{(i,n+1,h(i))} + w_{(j,n+1,h(j))}\) in the r.h.s. of (29) untouched, but we upper bound the second term using (32) to get \(w_{(i,n+1,h(i))} + w_{(i,j,h(j))} + w_{(j,n+1,h(j))}\).
After proceeding in this fashion, and by Lemma 4, we know that all of the terms \(w_{(a,b,c)}\) that we obtain have triples (a, b, c) with \(c\ne \{a,b\}\), with \(c\in [n]\), and \(1\le a < b \le n+1\). Therefore, these terms appear in (30). Also by Lemma 4, we know that we do not get each triple more than once. Therefore, the upper bound that we just constructed with the help of \({\mathcal {H}}^{n}\) for the r.h.s of (29) can be upper bounded by (30). \(\square\)
Appendix 4: proof of theorem 4
The proof of Theorem 4 requires a special hash function, discussed next, as well as the useful Lemmas in Appendix 1.
1.1 Special hash function
To prove that 4. in Definition 4 holds with \(C(n) \ge (n-1)/5\), \(n> 7\), we define the following special hash function. In what follows, the symbol \(\oplus\) denotes a list join operation with no duplicate removal, e.g. \(\{x,y\}\oplus \{x,z\}=\{x,y,x,z\}\).
Definition 8
The map \({\mathcal {H}}^{'n}\) transforms a triple \((i,j,r), 1\le i < j \le n, r \in {[n- 1]}\) to either 2, 3, or 4 triples according to
We assume that the first two components of each output triple are ordered. For example, \((i, r, h'(j,r)) \equiv (\min \{i,r\}, \max \{i,r\}, h'(j,r))\).
The following property of \({\mathcal {H}}^{'n}\) is critical to lower bound \(C(n)\).
Lemma 5
Let \((a,b,c) \in {\mathcal {H}}^{'n}(i,j,r)\), \(1 \le i < j \le n\), \(r \in [n-1]\). Then, \(1 \le a \le b \le n\), \(1 \le c \le n\), and \(c \notin \{a,b\}\). Furthermore,
has at most 5 copies of each triple, where two triples are equal iff they agree component-wise.
Proof
Recall the definitions:
The fact that \(1 \le a \le b \le n\) and that \(1 \le c \le n\) is immediate. The fact that \(c \ne \{a,b\}\) amounts to checking that \(h'(i,r) \notin \{i,r\}\) for all \(i\in [n]\), and \(r\in [n-1]\). This can be checked directly from (35). E.g. \(h'(i,r) = i\) would imply either \((i = n) \wedge (r \text { mod } n-1 = i-1)\), or \((i < n) \wedge ((i+r-1)\text { mod } n= i-1\)), both of which are impossible. The rest of the proof amounts to checking that if (a, b, c) is in the output of \({\mathcal {H}}^{'n}\), then there are at most 5 different inputs that lead to (a, b, c). There are 10 possible candidate input triples that lead to output (a, b, c). Namely,
-
\((a, b, c) = (i_{1}, r_{1}, h'(i_{1}, r_{1})) = {\mathcal {H}}^{'n}(i_{1}, j_{1}, r_{1}),\) if \(j_{1} = h'(i_{1}, r_{1})\) and \(i_{1} < r_{1}\),
-
\((a, b, c) = (r_{2}, i_{2}, h'(i_{2}, r_{2}))={\mathcal {H}}^{'n}(i_{2}, j_{2}, r_{2}),\) if \(j_{2} = h'(i_{2}, r_{2})\) and \(r_{2} < i_{2}\),
-
\((a, b, c) = (i_{3}, j_{3}, h'(i_{3}, r_{3}))= {\mathcal {H}}^{'n} (i_{3}, j_{3}, r_{3}),\) if \(j_{3} \ne h'(i_{3}, r_{3})\),
-
\((a, b, c) = (j_{4}, r_{4}, h'(i_{4}, r_{4}))= {\mathcal {H}}^{'n}(i_{4}, j_{4}, r_{4}),\) if \(j_{4} \ne h'(i_{4}, r_{4})\) and \(j_{4} < r_{4}\),
-
\((a, b, c) = (r_{5}, j_{5}, h'(i_{5}, r_{5}))= {\mathcal {H}}^{'n}(i_{5}, j_{5}, r_{5}),\) if \(j_{5} \ne h'(i_{5}, r_{5})\) and \(r_{5} < j_{5}\),
-
\((a, b, c) = (j_{6}, r_{6}, h'(j_{6}, r_{6}))= {\mathcal {H}}^{'n}( (i_{6}, j_{6}, r_{6}),\) if \(i_{6} = h'(j_{6}, r_{6})\) and \(j_{6} < r_{6}\),
-
\((a, b, c) = (r_{7}, j_{7}, h'(j_{7}, r_{7}))= {\mathcal {H}}^{'n} (i_{7}, j_{7}, r_{7}),\) if \(i_{7} = h'(j_{7}, r_{7})\) and \(r_{7} < j_{7}\),
-
\((a, b, c) = (i_{8}, j_{8}, h'(j_{8}, r_{8}))= {\mathcal {H}}^{'n} (i_{8}, j_{8}, r_{8}),\) if \(i_{8} \ne h'(j_{8}, r_{8})\),
-
\((a, b, c) = (i_{9}, r_{9}, h'(j_{9}, r_{9}))= {\mathcal {H}}^{'n} (i_{9}, j_{9}, r_{9}),\) if \(i_{9} \ne h'(j_{9}, r_{9})\) and \(i_{9} < r_{9}\),
-
\((a, b, c) = (r_{10}, i_{10}, h'(j_{10}, r_{10}))= {\mathcal {H}}^{'n} (i_{10}, j_{10}, r_{10}),\) if \(i_{10} \ne h'(j_{10}, r_{10})\) and \(r_{10} < i_{10}\).
Twelve pairs of the 10 scenarios above cannot simultaneously hold. The 12 pairs of scenarios that cannot both hold are displayed as edges in a graph in Fig. 8. E.g. edge (4, 10) represents that scenarios 4 and 10 that cannot both hold. The proof that the pairs represented by edges in Fig. 8 cannot both hold is done below. A maximum independent set of this graph is \(\{3,8,9,2,x\}\), where \(x \in \{1,4,6,10\}\). Thus at most 5 input scenarios lead to a given output triple.
What remains to be proved is that several pairs of the 10 scenarios described above cannot both hold.
Recall that for any input triple (i, j, r) we always have \(1 \le i < j \le n\), and \(r\in [n-1]\).
Scenarios 1 and 5 cannot both hold, because that would imply \(r_5=i_1, j_5=r_1\), which would imply \(h'(r_5,j_5)=h'(i_1,r_1)=h'(i_5,r_5)\), which since \(r_5,i_5<n\) would imply \(i_5=j_5\), contradicting \(i_5 < j_5\).
Scenarios 1 and 6 cannot both hold, because that would imply \(a = i_{1}< j_{1} = h'(i_{1}, r_{1}) = c = h'(j_{6}, r_{6}) = i_{6} < j_{6} = a\).
Scenarios 2 and 7 cannot both hold, because that would imply that \(j_2 = h'(i_2,r_2) = h'(j_7,r_7) = i_7 < j_7 = i_2\), contradicting \(i_2 < j_2\).
Scenarios 1 and 4 cannot both hold, because that would imply \(j_4=i_1<n\) and \(r_1 = r_4\), which would imply \(h'(j_4,r_4)=h'(i_1,r_1)=h'(i_4,r_4)\), which since \(i_4,j_4 < n\) would imply \(j_4 = i_4\), contradicting \(i_4 < j_4\).
Scenarios 2 and 5 cannot both hold, because that would imply \(j_5=i_2<n, r_5=r_2\), which would imply \(h'(j_5,r_5)=h'(i_2,r_2)=h'(i_5,r_5)\), which since \(j_5<n\) would imply \(i_5=j_5\), contradicting \(i_5 < j_5\).
Scenarios 6 and 10 cannot both hold, because that would imply \(r_{10} = j_6< r_6 = i_{10} < n\), which would imply \(h'(r_{10},i_{10})=h'(j_6,r_6)=h'(j_{10},r_{10})\). This in turn would imply one of two things. If \(j_{10} < n\), then \(h'(r_{10},i_{10})=h'(j_{10},r_{10})\) would imply \(j_{10}=i_{10}\), contradicting \(i_{10} < j_{10}\). If on the other hand \(j_{10} = n\), then \(h'(r_{10},i_{10})=h'(j_{10},r_{10})\) would imply \(1+(r_{10} + i_{10} - 1 \text { mod } n) = 1 + (r_{10} \text { mod } n-1)\). Recalling that \(r_{10} < r_6 \le n-1\), we would get \(i_{10} - 1 = 0 \text { mod } n\). This would imply \(i_{10} = 1\), contradicting \(i_{10} > r_6 \ge 1\).
Scenarios 7 and 10 cannot both hold, because that would imply \(r_7=r_{10}<j_7=i_{10}\le n-1\), which would imply \(h'(i_{10},r_{10})=h'(j_7,r_7)=h'(j_{10},r_{10})\). This in turn would imply one of two things. If \(j_{10} < n\), then \(h'(i_{10},r_{10})=h'(j_{10},r_{10})\) would imply \(i_{10}=j_{10}\), contradicting \(i_{10} < j_{10}\). If, on the other hand, \(j_{10} = n\), then \(h'(i_{10},r_{10})=h'(j_{10},r_{10})\) would imply \(1+(i_{10} + r_{10} - 1 \text { mod } n) = 1 + (r_{10} \text { mod } n-1)\). Recalling that \(r_{10} < j_7 =i_{10}\le n-1\), we would get \(i_{10} - 1 = 0 \text { mod } n\). This would imply \(i_{10} = 1\), contradicting \(i_{10} > r_{10} \ge 1\).
Scenarios 5 and 6 cannot both hold, because that would imply \(j_6 = r_5 < j_5 = r_6 \le n-1\), and \(h'(i_5, r_5)=h'(j_6,r_6)\), which would imply \(h'(i_5,j_6) = h'(j_6,j_5)\), which since \(j_6 \le n-2\) would imply \(i_5=j_5\), contradicting \(i_5< j_5\).
Scenarios 1 and 10 cannot both hold, because that would imply \(r_1=i_{10} > i_1 = r_{10} \ge 1\), which would imply \(h'(r_{10},i_{10}) = h'(i_1,r_1) = h'(j_{10},r_{10})\). This would imply one of two things. If \(j_{10}< n\), and, recalling that \(r_{10}\le n\), this would imply \(i_{10} = j_{10}\), contradicting \(i_{10} < j_{10}\). If on the other hand, \(j_{10} = n\), this would imply \(1 + (r_{10} \text { mod } n- 1) = 1 + (r_{10} +i_{10} - 1 \text { mod } n)\), which would imply \(i_{10} = 1\), contradicting \(i_{10} > 1\).
Scenarios 4 and 6 cannot both hold, because that would imply \(j_4 = j_6< r_4 = r_6 < n\), which would imply \(h'(i_4,r_4)=h'(j_6,r_6) = h'(j_4,r_4)\), which recalling that \(j_4 < n\) would imply \(i_4 = j_4\), contradicting \(i_4 < j_4\).
Scenarios 4 and 7 cannot both hold, because that would imply \(j_4 = r_7< j_7 = r_4 < n\), which would imply \(h'(i_4,r_4)=h'(r_7,j_7) = h'(j_4,r_4)\), which recalling that \(j_4 < n\) would imply \(i_4 = j_4\), contradicting \(i_4 < j_4\).
Scenarios 4 and 10 cannot both hold, because that would imply \(i_4< j_4 = r_{10}< r_4 = i_{10} < j_{10}\), which would imply \(h'(i_4,i_{10}) = h'(i_4,r_{4}) = h'(j_{10},r_{10}) = h'(j_{10},j_{4})\). This would imply one of two things. If \(j_4 < n\), this would imply \(1 + ( i_4 + i_{10} + 1\text { mod } n) = 1 + ( j_4 + j_{10} + 1\text { mod } n)\), which would imply \((j_4 - i_4)+ (j_{10} - i_{10}) = 0 \text { mod } n\), which since \(j_4> i_4,j_{10} > i_{10}\) would imply \((j_4 - i_4)+ (j_{10} - i_{10}) = n\). This in turn would imply \(j_{10}= n+ (i_{10} - j_4)+ i_4> n+ i_4 > n\), since \(i_{10} - j_4 > 0\), and \(i_4 > 0\), contradicting \(j_{10} \le n\). If on the other hand \(j_4 = n\), this would imply \(1 + ( i_4 + i_{10} + 1\text { mod } n) = 1 + ( j_4 \text { mod } n- 1)\), which since \(j_4 = r_{10} < r_4 \le n-1\) would imply \((j_4 - i_4 - i_{10} + 1) \text { mod n} = 0\), which imply either \(j_4 - i_4 - i_{10} + 1 = 0\) or \(j_4 - i_4 - i_{10} + 1 = -n\). The 1st option would imply \(j_4 = i_{10} + i_4 - 1 \ge i_{10}\), contradicting \(j_4 < i_{10}\). The 2nd option would imply \(i_{10} = n+ 1 + j_4 - i_4\) > n, contradicting \(i_{10} < n\). \(\square\)
Remark 9
Note that we might have \(a = b\) in an triple (a, b, c) output by \({\mathcal {H}}^{'n}\). For example, if \(n = 4\), all 5 triples (1, 2, 3), (1, 3, 2), (2, 3, 2), (2, 3, 3), and (2, 3, 4) map to (2, 3, 1). Also, both (1, 2, 1) and (1, 4, 1) map to (1, 1, 2) whose first two components equal.
1.2 Proof of lower bound on C(n)
We will show
For \(r\in [n]\), let \({\varvec{p}}^{(*r)}\) be a minimizer that leads to \({\mathcal {W}}^{1,\ldots ,r-1,r+1,\ldots ,n+1}\). We would normally use \({\varvec{r}}^{(*r)}\) for this minimizer, but, to avoid confusions between \({\varvec{r}}\) and r, we avoid doing so. For \(i,j\in [n+1]\backslash \{r\}\), let \({{\varvec{p}}^{(*r)}}^{i,j}\) be the marginal of \({\varvec{p}}^{(*r)}\) for the sample space \({\Omega }^{i}\times {\Omega }^{j}\). Since \({\varvec{p}}^{(*r)}\) satisfies the constraints in (6), its marginal over \({\Omega }^{i}\) equals \({\varvec{p}}^{i}\).
Let \(h'(\cdot ,\cdot )\) be the map in (35). For each \(r \in [n-1]\), define the mass function over \({\Omega }^{1}\times \cdots \times {\Omega }^{n}\)
where \({{\varvec{p}}^{(*h'(i,r))}}^{i \mid r}\) satisfies \({{\varvec{p}}^{(*h'(i,r))}}^{i \mid r} {\varvec{p}}^{r} = {{\varvec{p}}^{(*h'(i,r))}}^{i, r}\). Note that \(h'(i,r) \notin \{i,r\}, \forall \ 1\le i\le n\), and \(r \in [n-1]\). Thus, \({{\varvec{p}}^{(*h'(i,r))}}^{i, r}\) and \({{\varvec{p}}^{(*h'(i,r))}}^{i \mid r}\) exist. Let \({{\varvec{q}}^{(r)}}^{i}\) be the marginal of \({\varvec{q}}^{(r)}\) over \({\Omega }^{i}\), and \({{\varvec{q}}^{(r)}}^{i,j}\) over \({\Omega }^{i} \times {\Omega }^{j}\).
By Lemma 2, we know that \({{\varvec{q}}^{(r)}}^{i}\) equals \({\varvec{p}}^{i}\) (given) for all \(i \in [n]\), and hence \({\varvec{q}}^{(r)}\) satisfies the optimization constraints in (6) for \({\mathcal {W}}^{1,\ldots ,n}\). Therefore, we can write
where \({{{\varvec{p}}}^{*}}^{i,j}\) is the bivariate marginal over \({\Omega }^{i}\times {\Omega }^{j}\) of the minimizer \({{\varvec{p}}}^{*}\) for \({\mathcal {W}}^{1,\ldots ,n}\).
We now bound each term in the inner most sum on the r.h.s. of (38) as
where \(i\ne r\), \(r\ne j\), and: (a) holds by Lemma 3; (b) holds because \({d}\) is symmetric; and (c) holds because, by Lemma 2, \({{\varvec{q}}^{(r)}}^{i,r}={{\varvec{p}}^{(*h'(i,r))}}^{i, r}\) and \({{\varvec{q}}^{(r)}}^{j,r}={{\varvec{p}}^{(*h'(j,r))}}^{j, r}\).
Bounding the r.h.s. of (38) using (39) - (41), we re-write the resulting inequality using the notation
where (a) each \(w_{(i,j,r)}\) represents one \(\left\langle {d}^{i,j},{{{\varvec{p}}}^{*}}^{i,j} \right\rangle _{\ell }^{\frac{1}{\ell }}\) on the l.h.s. of (38). Since \(h'(i,r) \notin \{i,r\}\), when \(i \ne r\) the mass \({{\varvec{p}}^{(*h'(i,r))}}^{i,r}\) exists; (b) each \(v_{(s,t,l)}\) represents \(\left\langle {d}^{s,t},{{\varvec{p}}^{(*l)}}^{s,t} \right\rangle _{\ell }^{\frac{1}{\ell }}\) if \(s \ne t\), and is zero if \(s = t\); and (c) we are implicitly assuming that the first two components of each triple on the r.h.s. of (42) are ordered, i.e. if e.g. \(r < i\) then \((r,i,h'(i,r))\) should be red as \((i,r,h'(i,r))\).
Finally, using this same compact notation, we write
and now we will show that (43) upper-bounds the r.h.s. of (42), finishing the proof.
First, by Lemma 3 and the symmetry of d, observe that the following inequalities are true
and,
as long as for each triple (a, b, c) in the above expressions, \(c \notin \{a,b\}\). We will use inequalities (44) and (45) to upper bound some of the terms on the r.h.s. of (42), and then we will show that the resulting sum can be upper bounded by (43). In particular, for each (i, j, r) considered in the r.h.s. of (42), we will apply inequalities (44) and (45) such that the terms \(v_{(a,b,c)}\) that we get after their use have triples (a, b, c) that match the triples in \({{\mathcal {H}}^{'}}^{n}(i,j,r)\), defined in Definition 8. To be concrete, for example, if \({{\mathcal {H}}^{'}}^{n}\) maps (i, j, r) to \(\{(i,r,h'(i,r)),(r,j,h'(j,r))\}\), then we do not apply (44) and (45), and we leave \(v_{(i,r,h'(i,r))} + v_{(r,j,h'(j,r))}\) as is on the r.h.s. of (42). If, for example, \({{\mathcal {H}}^{'}}^{n}\) maps (i, j, r) to \(\{(i,r,h'(i,r)),(i,j,h'(j,r)),(i,r,h'(j,r))\}\), then we leave the first term in \(v_{(i,r,h'(i,r))} + v_{(r,j,h'(j,r))}\) in the r.h.s. of (42) untouched, but we upper bound the second term using (45) to get \(v_{(i,r,h'(i,r))} + v_{(i,j,h'(j,r))} + v_{(i,r,h'(j,r))}\).
After proceeding in this fashion, and by Lemma 5, we know that all of the terms \(v_{(a,b,c)}\) that we obtain have triples (a, b, c) with \(c\ne \{a,b\}\), \(c\in [n-1]\), and \(1\le a \le b \le n\). Therefore, these terms are either zero (if \(a=b\)) or appear in (43). Also because of Lemma 5, each triple (a, b, c) with non-zero \(v_{(a,b,c)}\) will not appear more than 5 times. Therefore, the upper bound we build with the help of \(h'\) for the r.h.s of (42) can be upper bounded by (43).
1.3 Proof of upper bound on C(n)
Consider the following setup. Let \({\vert \Omega ^{i}\vert } = \vert \Omega \vert\) for all \(i \in [n]\), and \({\Omega }^{i}_{s} \in \mathbb {R}\) for all \(i \in [n]\), \(s \in [\vert \Omega \vert ]\). Define \({d}\) such that \({{d}^{i,j}_{s,t}}\) is \(\vert {\Omega }^{i}_{s} - {\Omega }^{j}_{t}\vert\), if \(s = t\), and infinity otherwise. Let \({\varvec{p}}^{i}_{s} = \frac{1}{\vert \Omega \vert }\) for all \(i \in [n]\), \(s \in [\vert \Omega \vert ]\).
Any optimal solution \({{\varvec{r}}}^{*}\) to the pairwise MMOT problem must have bivariate marginals that satisfy \({{{\varvec{r}}}^{*}}^{i,j}_{s,t} = \frac{1}{\vert \Omega \vert } \delta _{s,t}\), and thus \(\left\langle {d}^{i,j},{{{\varvec{r}}}^{*}}^{i,j} \right\rangle _{\ell }^{\frac{1}{\ell }}= \frac{1}{\vert \Omega \vert ^\ell }\Vert {\Omega }^{i} - {\Omega }^{j}\Vert _{\ell }\), where we interpret \({\Omega }^{i}\) has a vector in \(\mathbb {R}^{\vert \Omega \vert }\), and \(\Vert \cdot \Vert _{\ell }\) is the vector \(\ell\)-norm. Therefore, ignoring the factor \(\frac{1}{\vert \Omega \vert ^\ell }\), we only need to prove that 4. in Definition 4 holds with \(C(n) = n- 1\) when \({\mathcal {W}}^{1:n}\) is defined as \(\sum _{1\le i<j\le n} \Vert {\Omega }^{i} - {\Omega }^{j}\Vert _{\ell }\). This in turn is a standard result, whose proof (in a more general form) can be found e.g. in Example 2.4 in Kiss et al. (2018).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mi, L., Sheikholeslami, A. & Bento, J. A family of pairwise multi-marginal optimal transports that define a generalized metric. Mach Learn 112, 353–384 (2023). https://doi.org/10.1007/s10994-022-06280-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-022-06280-y