finentry\printfieldNOTE
Metric Dimension and Resolvability of Jaccard Spaces
Abstract
A subset of points in a metric space is said to resolve it if each point in the space is uniquely characterized by its distance to each point in the subset. In particular, resolving sets can be used to represent points in abstract metric spaces as Euclidean vectors. Importantly, due to the triangle inequality, points close by in the space are represented as vectors with similar coordinates, which may find applications in classification problems of symbolic objects under suitably chosen metrics. In this manuscript, we address the resolvability of Jaccard spaces, i.e., metric spaces of the form , where is the power set of a finite set , and Jac is the Jaccard distance between subsets of . Specifically, for different , , where denotes size (i.e., cardinality) and denotes the symmetric difference of sets. We combine probabilistic and linear algebra arguments to construct highly likely but nearly optimal (i.e., of minimal size) resolving sets of . In particular, we show that the metric dimension of , i.e., the minimum size of a resolving set of this space, is . In addition, we show that a much smaller subset of suffices to resolve, with high probability, all different pairs of subsets of of cardinality at most , up to a factor.
Keywords. Jaccard distance, metric dimension, metric space, multilateration, resolving set
1 Introduction
A metric space is an ordered-pair of the form , where is a nonempty set, and a function satisfying that , if and only if , and , for all . In particular, is non-negative, symmetric, and satisfies the triangular inequality. We say the metric space is finite when .
Resolvability extends the concept of trilateration of the plane to general metric spaces; in particular, it includes the vertex set of connected graphs endowed with shortest path distances between vertices—which is where the concept originated [5, 21, 8]. In a metric space , a non-empty set , with , is said to resolve it when the transformation
(1) |
is one-to-one. In particular, uniquely encodes points in as -dimensional real vectors; and, owing to the triangular inequality, proximate points in are encoded as vectors with similar coordinates. Resolving sets thus enable sound embeddings of metric spaces into Euclidean ones, which can be useful for generating numerical features of symbolic objects in statistical and machine learning tasks like regression or classification [24, 20].
One can think of a resolving set as a collection of “landmarks” in a metric space that uniquely identify the “location” of any point in that space by its distance to those landmarks. In that regard, resolvability serves as a form of “multi-lateration” of the space, similar to tri-lateration, although more than three landmarks may be needed to resolve a given metric space.
Irrespective of the metric space, resolving sets always exist, although they are never unique in non-trivial settings. This is because always resolves , and if resolves and , then also resolves it. So, finding a resolving set is straightforward. In contrast, finding a resolving set with the smallest possible size is usually challenging; in fact, it is an NP-complete problem in arbitrary finite metric spaces [11, 6]. Minimizing the size of a resolving set is nonetheless crucial to embedding the points in into a low-dimensional Euclidean space using transformations of the form (1). This motivates the notion of metric dimension, which is the size of the smallest resolving set of a metric space , denoted from now on as .
For a concise overview of resolvability and metric dimension in the context of graph theory, see [22]. Instead, for a comprehensive review of these and related concepts, see [23, 14].
A very limited number of studies have addressed the resolvability of non-graphical metric spaces in the literature [16, 2], as most efforts have focused on finite graphs [4]. Nevertheless, spaces with metric dimensions 1 or 2 have been characterized under general topological assumptions [16, 2]. It is also known that the metric dimension of a -dimensional subspace of with respect to the Euclidean distance is ; in particular, has metric dimension [2]. The hypersphere has also metric dimension . Additionally, the metric dimension of the hyperbolic space with respect to the metric , for all , is [2]. Likewise, the metric dimension of the -dimensional unit ball with respect to the metric , with , is [2].
In contrast, the systematic study of the resolvability and metric dimension of non-graphical, finite, metric spaces is essentially unexplored. In this paper, we study the resolvability of finite Jaccard metric spaces, i.e., metric spaces of the form , where denotes the power set of a finite set , and Jac is the Jaccard distance between subsets of [10]. Namely, for all ,
Jac is a metric in [7, 13]. (In the literature, for distinct , the quantity is referred to as the Jaccard similarity. This index is widely used in fields such as information retrieval, data mining, and natural language processing, among many others.)
Given that is finite in our setting, we may, in principle, estimate and find non-trivial resolving sets with the so-called Information Content Heuristic (ICH) [9]. In a general setting, the input of this algorithm is the (symmetric) distance matrix between all pairs of points in a metric space, and the output is a subset of columns that resolve it, which is determined greedily through an entropy maximization procedure. Unfortunately, however, in the context of Jaccard spaces, the ICH is infeasible even for moderate values of because of its time complexity.
Nevertheless, besides being of theoretical interest, learning to resolve optimally or nearly optimally Jaccard spaces may find applications in e.g. lexicon-based approaches to natural language processing (NLP). In the most basic implementation of this idea, would be the set of all words in a language and sentences represented as subsets of (aka, bag of words). The Jaccard distance is then a natural way to assess the similarity of sentences based on the words used, and a resolving set would induce a numerical encoding of sentences, mapping sentences with similar word content into vectors with similar coordinates, potentially providing low-dimensional feature vectors to learn to classify or regress sentences based on their lexicon [17].
1.1 Main Results
In what remains of this manuscript, is assumed to be a finite non-empty set.
In this section, we outline our key findings, with expanded statements and proofs provided in Section 2.
From now on, the Jaccard distance is the reference metric in ; in particular, e.g., statements like “ resolves ,” mean that “ resolves .” We also say that resolves when there exists such that .
We first provide a necessary condition for a set to resolve .
Proposition 1.1.
If resolves then separates the distinct elements of , and it covers all but possibly one element in .
The proof of the proposition can be found in Section 2.1. We note that these properties are necessary but not sufficient. For instance, if and then separates different elements in and also covers it. Nevertheless, is not resolving because . This counterexample can be easily generalized to sets of arbitrary size.
Next, we provide a lower bound on the size of any resolving subset of ; in particular, this is also a lower bound for .
Proposition 1.2.
If resolves then
The proof of the proposition can be found in Section 2.2.
To state our main two results we require the following definition.
Definition 1.1.
A random is said to have a Binomial distribution, in which case we write , when for each , and the events with are independent.
Clearly, if then ; namely, for .
Theorem 1.1.
If and are independent and identically distributed (i.i.d.), then, for each , resolves , with overwhelmingly high probability, as .
The proof of the theorem can be found in Section 2.5 and relies on auxiliary results in Sections 2.3 and 2.4.
In conjunction, Proposition 1.2 and Theorem 1.1 imply that
which characterizes the metric dimension of with respect to the Jaccard distance within a factor of . In particular, we can assert the following.
Corollary 1.1.
, as .
It turns out that, for any , the set resolves all pairs of subsets of with different cardinalities (see Lemma 2.1 ahead). So the crux of the proof of Theorem 1.1 lies in showing that the sets in resolve all possible pairs of equal size—with overwhelmingly high probability—when is large. We demonstrate this in Section 2.5.
In the context of potential NLP applications outlined in the Introduction, it is unclear whether the highly likely resolving set proposed in Theorem 1.1 is of any practical value for distinguishing between bags-of-words of different cardinalities. This is because the numerical encoding in (1) based on this set might differentiate such pairs solely based on the presence or absence of a single word or token, which seems too coarse for practical use in NLP classification (or regression) problems. Our following result addresses this issue by proposing a less contrived set, which is likely to resolve all pairs of bags-of-words of different cardinalities. Its proof can be found in Section 2.6.
Theorem 1.2.
Let . If and are i.i.d., then resolves all pairs of subsets of of different size, with overwhelmingly high probability, as .
As expected, the lower bound for the size of the set in Theorem 1.2 is asymptotically negligible compared to the one in Theorem 1.1; after all, the former set is only required to resolve pairs of subsets of with different cardinalities, which, as explained earlier, can be accomplished using just three subsets of (i.e., the empty set, and any singleton and its complement). Nevertheless, in practical situations—for instance, when representing social media posts as bags-of-words—more often than not, a random pair of posts would be associated with bags-of-words of different cardinality. In particular, in terms of the numerical encoding in (1), Theorem 1.2 suggests that Jaccard distances, as opposed to , should suffice in practice to encode posts effectively when the reference lexicon is sufficiently large. Our following result makes this intuition precise at the expense of limiting the size of bags-of-words one wishes to resolve.
Corollary 1.2.
Let . If and are i.i.d., then the set resolves all different pairs of subsets of of size at most , with overwhelmingly high probability, as .
2 Technical Results and Proofs
2.1 Necessary Conditions for Resolvability
In this section, we prove Proposition 1.1. Specifically, suppose that resolves . Next we show that the following properties applies:
-
(i)
For all with , there exists such that either and , or and .
-
(ii)
If , then covers , i.e., .
-
(iii)
If , then covers , or there exists such that .
To show the property (i), suppose by contradiction that there are distinct such that, for each , or . In the first case: , and in the second case: . In either case, could not possibly be resolving, which shows the first property.
To show the property (ii), suppose that there is , which does not belong to any of the sets in . Then, for each , , which is not possible. This shows the second property.
Finally, to show the property (iii), suppose there are distinct which do not belong to any of the sets in . Then, for each , , which is not possible and completes the proof of the proposition.
2.2 Metric Dimension lower bound
In this section, we prove Proposition 1.2.
Suppose that resolves . If , then by the Inclusion-Exclusion Principle, . Since , the range of , when restricted to sets such that , has size at most . In particular, due to the Pigeonhole Principle, we must have , i.e.:
(2) |
The right-most lower bound above should be a reasonable estimate of the best one (based on the Pigeon Principle) because is a slowly increasing function of , and , with , is maximized at (equivalently, ). To make the last numerator above more explicit, we use that [12, Exercise 24, §1.2.5]:
In particular, if then
The proposition is now direct from (2).
2.3 Resolving Subsets of of Different Cardinalities
Lemma 2.1.
For all and all , if then and are resolved by .
Proof.
Without any loss of generality assume that . Fix an and note that for each :
Define . Consider such that , and suppose that , for all . In particular, cannot be empty; otherwise, , implying that because Jac is a metric. However, the latter is not possible because . Likewise, is cannot be empty.
Moreover, if , then . In particular, must be in as otherwise , which is not possible. But then, , i.e., , which is not possible either. Instead, if and then, because , we must have that , i.e., , which is again not possible. Hence, there has to be an such that , implying that resolves and . The same conclusion applies if , which completes the proof of the lemma. ∎
2.4 Inner product Characterization of Equidistant Sets
Two sets are said equidistant from an when . In this case, is not useful to resolve from when , and we say that and collide in terms of their Jaccard distance to .
In this section, we characterize collisions in linear algebra terms by representing subsets of as binary vectors. We note that linear algebra characterizations have been used to study the metric dimension of Hypercube graphs [3] and Hamming graphs [15].
In what follows, we represent elements in as binary vectors of dimension . Namely, for , when , and when . (For instance, is represented by a vector of all ones, whereas by a vector of all zeros.) Additionally, for and , denotes the inner product between the binary vector associated with and the vector . Namely:
In what follows, we use product notation to denote set intersections. Namely, if then .
The next result characterizes equidistant sets in terms of inner products. This characterization will be used in Section 2.5.1, in the proof of Theorem 1.1, to assess the probability that two different subsets of , of the same size, collide in terms of their distance to a random subset of .
Lemma 2.2.
Let and define the vector . If then . Conversely, if and then .
Proof.
We show first that
(3) |
For this, observe that and ; from which the identity in equation (3) is immediate due to the bilinearity of inner products.
Since , to complete the proof, it suffices to show that if then if and only if . For this, note that and , for all . In particular, a simple algebra shows that is equivalent to having , that is, due to the bilinearity of inner products. ∎
We also want an inner product characterization of sets and that not only collide in terms of their Jaccard distance to a set but also to , the complement of . Our next result provides a necessary condition for both collisions to occur. This is characterization is used in Section 2.6.1 to show Theorem 1.2.
Corollary 2.1.
Let . If and then .
Proof.
If and then Lemma 2.2 implies that and , where and . Hence, due to the identity in equation (3), we have that
(4) |
But and ; in particular, we may rewrite the expressions within the curly parentheses above as follows: , and . Finally, substituting these two expressions back in equation (4), and after recognizing various terms cancellations, we obtain that
from which the Corollary follows. ∎
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
1 | 2 | 2 | 3 | 3 | 4 | 5 | 5 | 6 | 6 | 7 | 7 | 8 | 8 | |
1 | 1.5 | 2.0 | 2.33 | 2.66 | 3.5 | 4.4 | 3.87 | 4.3 | 5.8 | 5.9 | 6 | 6.4 | 7.4 |
2.5 Resolving Pairs of Subsets of of Equal Size
In what follows, and are i.i.d. random subsets. In particular, , which is consistent with the experimental results displayed in Table 1, and guided the selection of the parameter 1/2 in the Binomial distribution.
Define
In accordance with the probabilistic method, and to obtain an upper bound on the metric dimension of , we aim to find a such that the probability that does not resolve all distinct pairs of equal size is strictly less than one. If we can find such a , then there exists an with that resolves all different pairs of subsets of of equal size. In particular, due to Lemma 2.1, we could assert that . The challenge is to find as small as possible so that is a tight upper bound for the metric dimension of , and the following probability
(5) |
becomes asymptotically negligible as . Theorem 1.1 identifies a meeting this criterion.
2.5.1 Sizing the probability
In this section, we identify a in terms of , of the same order of magnitude as the asymptotic lower bound for in Proposition 1.2, such that .
Suppose there exists such that , , and for all . Then, per Lemma 2.2, for each , . But because . So , or equivalently:
But observe that , where
Consequently:
To bound the probability on the right-hand side above, consider a and the sets , and . Observe that and are non-empty, disjoint, and of the same size; let be said cardinality. Note that , and that are i.i.d. random variables. Further, since , are also i.i.d. As a result:
and
(6) |
But
where the big-O is direct from Stirling’s formula. On the other hand, Stirling’s formula also implies that . However, for a bona fide substitution of by in equation (6), one needs a stronger relationship between these two sequences. For this effect, observe that [19]:
In particular,
and from the inequality in equation (6) we see that
(7) |
The following result will let us handle the big-O term above.
Lemma 2.3.
If then , uniformly for all large enough and .
Proof.
It suffices to show that
(8) |
for all large enough and . For this, consider the function defined as , for . But note that
(9) |
where the second identity assumes that , in which case
In particular, and . Thus, as long as is large enough, , and equation (9) implies that is decreasing for and increasing for , which in turn implies that is maximized at or . Since , is maximized at ; in particular, the inequality in equation (8) is satisfied when , which shows the Lemma. ∎
2.6 Resolving Subsets of of Different Size
In this section, we prove Theorem 1.2.
After having characterized the asymptotic order of the metric dimension of , i.e., the asymptotically optimal size of resolving sets for , in this final section, we see how to resolve all pairs of subsets of of different size.
For this, consider the problem of resolving all distinct such that , using a set of the form
(10) |
where are i.i.d. with a distribution. It follows that
where
(11) |
2.6.1 Sizing the probability
Suppose that are such that , , and ; in particular, per Corollary 2.1, . But , , and . So, we may rewrite the last identity equivalently as follows:
Equivalently, if we define for each , the above identity is equivalent to
(12) |
where and are disjoint subsets of such that . Notably, for , the probability of the above event depends only on the quantities , , and , without regard to the specific identity of and , except for the constraints that and . So we may define as the probability of the event in (12)—when are such that , , and .
It follows from the above discussion that if
then
But note that the random vectors , with , are i.i.d. for any given . As a result:
(13) |
where the index is the inner sum above is such that and .
Lemma 2.4.
If and , with , then .
Proof.
Let be such that and ; in particular, . Then, for each :
where for the last inequality we have used the well-known Hoeffding’s inequality, and that has the same distribution as , where are independent random variables, with for , and for . Therefore, by selecting
(14) |
we obtain that
(15) |
But
(16) |
because the first factor above is an increasing function of , whereas the second factor is a decreasing function of . The lemma is now a direct consequence of the inequalities in (15)-(16). ∎
Remark 2.1.
The choice of in (14) is somewhat optimal when , which is a necessary condition for the upper-bound in (15) to be non-trivial. (The latter requires of course which, based on (16), can be guaranteed as soon as .) Indeed, from the last proof: , where for . But note that , with for ; hence is a critical point of . Moreover, since , is a local minimum when . In particular, since but when , is a local minimum of when .
Let be the upper bound for given in Lemma 2.4. It follows from (13) that
where, for an integer and , is the (upper) incomplete Gamma function. Finally, due to [18, Proposition 2.7], . Consequently,
where we have used the Stirling’s approximation and the exp-log transform. In particular, for any , if select so that , for instance, , then , which completes the proof of Theorem 1.2.
2.7 Resolving Comparatively Small Subsets of
In this section we prove Corollary 1.2, which is the consequence of arguments already used in the proofs of theorems 1.1 and 1.2. For this, let , and be an integer.
To show the Corollary, we reconsider the set in (10) with . By distinguishing pairs such that from , we find this time that
(17) |
where is the double-sum in (13), and is a truncated version of the summation in (6). Specifically
But, from the discussion in Section 2.6.1, we already know that . On the other hand, from the discussion in Section 2.5.1 that led to (7), we can say that
As a result
where for the last two asymptotic bounds we have use the constrains on and . The Corollary is now a direct consequence of the inequality in (17).
Acknowledgments. This work was partially funded by the NSF grant No. 1836914.
References
- [1] N. Alon and J. H. Spencer, The Probabilistic Method, 2nd edn., Wiley, 2004.
- [2] S. Bau and A. F. Beardon, The metric dimension of metric spaces, Comput. Methods Funct. Theory 13 (2013), 295–305.
- [3] A. F. Beardon, Resolving the Hypercube, Discrete Applied Mathematics 161 (2013), 1882–1887.
- [4] G. Chartrand et al., Resolvability in graphs and the metric dimension of a graph, Discrete Applied Mathematics 105 (2000), no. 1, 99–113.
- [5] P. Erdös, F. Harary, and W. T. Tutte, On the dimension of a graph, Mathematika 12 (1965), no. 2, 118–122.
- [6] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness, WH Freeman and Company, New York, 1979.
- [7] G. Gilbert, Distance between sets, Nature 239 (1972), no. 174.
- [8] F. Harary and R. A. Melter, On the metric dimension of a graph, Ars Combinatoria 2 (1976), no. 191-195, 1.
- [9] M. Hauptmann, R. Schmied, and C. Viehmann, Approximation complexity of metric dimension problem, Journal of Discrete Algorithms 14 (2012), 214–222.
- [10] P. Jaccard, Étude comparative de la distribution florale dans une portion des alpes et du jura, Bull. Société Vaudoise des Sciences Naturelles 37 (1901), no. 142, 547–579.
- [11] R. M. Karp, Reducibility among combinatorial problems, Complexity of Computer Computations, Springer, 1972. 85–103.
- [12] D. E. Knuth, The Art of Computer Programming, Vol. 1: Fundamental Algorithms, 3rd edn., Addison-Wesley, 1997.
- [13] S. Kosub, A note on the triangle inequality for the jaccard distance, Pattern Recognition Letters 120 (2019), 36–38.
- [14] D. Kuziak and I. G. Yero, Metric dimension related parameters in graphs: A survey on combinatorial, computational and applied results, arXiv preprint arXiv:2107.04877 (2021).
- [15] L. Laird et al., Resolvability of Hamming graphs, SIAM Journal on Discrete Mathematics 34 (2020), no. 4, 2063–2081.
- [16] G. Murphy, A metric basis characterization of Euclidean space, Pac. J. Math. 60 (1975), 159–163.
- [17] A. Paradise, Quantitative encoding of bags-of-words for sentiment and sarcasm detection in textual data, Master’s thesis, The University of Colorado, 2024.
- [18] I. Pinelis, Exact lower and upper bounds on the incomplete gamma function, Mathematical Inequalities & Applications 23 (2020), no. 4, 1261–1278.
- [19] H. Robbins, A remark on Stirling’s formula, The American Mathematical Monthly 62 (1955), no. 1, 26–29.
- [20] P. E. Ruth and M. E. Lladser, Levenshtein graphs: Resolvability, automorphisms & determining sets, Discrete Mathematics 346 (2023), no. 5, 113310.
- [21] P. J. Slater, Leaves of trees, Congressus Numerantium 14 (1975), no. 549-559, 37.
- [22] R. C. Tillquist, R. M. Frongillo, and M. E. Lladser, Metric Dimension, Scholarpedia 14 (2019), no. 10, 53881. Revision #190769.
- [23] R. C. Tillquist, R. M. Frongillo, and M. E. Lladser, Getting the lay of the land in discrete space: A survey of metric dimension and its applications, SIAM Review 65 (2023), no. 4, 919–962.
- [24] R. C. Tillquist and M. E. Lladser, Low-dimensional representation of genomic sequences, Journal of Mathematical Biology 79 (2019), no. 1, 1–29.