Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\xapptobibmacro

finentry\printfieldNOTE

Metric Dimension and Resolvability of Jaccard Spaces

Manuel E. Lladser Department of Applied Mathematics, University of Colorado, Boulder, USA Corresponding author: manuel.lladser@colorado.edu Alexander J. Paradise Department of Applied Mathematics, University of Colorado, Boulder, USA
Abstract

A subset of points in a metric space is said to resolve it if each point in the space is uniquely characterized by its distance to each point in the subset. In particular, resolving sets can be used to represent points in abstract metric spaces as Euclidean vectors. Importantly, due to the triangle inequality, points close by in the space are represented as vectors with similar coordinates, which may find applications in classification problems of symbolic objects under suitably chosen metrics. In this manuscript, we address the resolvability of Jaccard spaces, i.e., metric spaces of the form (2X,Jac)superscript2𝑋Jac(2^{X},\text{Jac})( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ), where 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT is the power set of a finite set X𝑋Xitalic_X, and Jac is the Jaccard distance between subsets of X𝑋Xitalic_X. Specifically, for different a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, Jac(a,b)=|aΔb|/|ab|Jac𝑎𝑏𝑎Δ𝑏𝑎𝑏\text{Jac}(a,b)=|a\Delta b|/|a\cup b|Jac ( italic_a , italic_b ) = | italic_a roman_Δ italic_b | / | italic_a ∪ italic_b |, where |||\cdot|| ⋅ | denotes size (i.e., cardinality) and ΔΔ\Deltaroman_Δ denotes the symmetric difference of sets. We combine probabilistic and linear algebra arguments to construct highly likely but nearly optimal (i.e., of minimal size) resolving sets of (2X,Jac)superscript2𝑋Jac(2^{X},\text{Jac})( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ). In particular, we show that the metric dimension of (2X,Jac)superscript2𝑋Jac(2^{X},\text{Jac})( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ), i.e., the minimum size of a resolving set of this space, is Θ(|X|/ln|X|)Θ𝑋𝑋\Theta(|X|/\ln|X|)roman_Θ ( | italic_X | / roman_ln | italic_X | ). In addition, we show that a much smaller subset of 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT suffices to resolve, with high probability, all different pairs of subsets of X𝑋Xitalic_X of cardinality at most |X|/ln|X|𝑋𝑋\sqrt{|X|}/\ln|X|square-root start_ARG | italic_X | end_ARG / roman_ln | italic_X |, up to a factor.

Keywords. Jaccard distance, metric dimension, metric space, multilateration, resolving set

1 Introduction

A metric space is an ordered-pair of the form (X,d)𝑋𝑑(X,d)( italic_X , italic_d ), where X𝑋Xitalic_X is a nonempty set, and d:X×X:𝑑𝑋𝑋d:X\times X\to\mathbb{R}italic_d : italic_X × italic_X → blackboard_R a function satisfying that d(x,y)=d(y,x)0𝑑𝑥𝑦𝑑𝑦𝑥0d(x,y)=d(y,x)\geq 0italic_d ( italic_x , italic_y ) = italic_d ( italic_y , italic_x ) ≥ 0, d(x,y)=0𝑑𝑥𝑦0d(x,y)=0italic_d ( italic_x , italic_y ) = 0 if and only if x=y𝑥𝑦x=yitalic_x = italic_y, and d(x,y)d(x,z)+d(z,y)𝑑𝑥𝑦𝑑𝑥𝑧𝑑𝑧𝑦d(x,y)\leq d(x,z)+d(z,y)italic_d ( italic_x , italic_y ) ≤ italic_d ( italic_x , italic_z ) + italic_d ( italic_z , italic_y ), for all x,y,zX𝑥𝑦𝑧𝑋x,y,z\in Xitalic_x , italic_y , italic_z ∈ italic_X. In particular, d𝑑ditalic_d is non-negative, symmetric, and satisfies the triangular inequality. We say the metric space is finite when |X|<+𝑋|X|<+\infty| italic_X | < + ∞.

Resolvability extends the concept of trilateration of the plane to general metric spaces; in particular, it includes the vertex set of connected graphs endowed with shortest path distances between vertices—which is where the concept originated [5, 21, 8]. In a metric space (X,d)𝑋𝑑(X,d)( italic_X , italic_d ), a non-empty set R={ri:iI}X𝑅conditional-setsubscript𝑟𝑖𝑖𝐼𝑋R=\{r_{i}:i\in I\}\subset Xitalic_R = { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i ∈ italic_I } ⊂ italic_X, with I={1,,|R|}𝐼1𝑅I=\big{\{}1,\ldots,|R|\big{\}}italic_I = { 1 , … , | italic_R | }, is said to resolve it when the transformation

d(x|R):=(d(x,ri))iI, for each xX,formulae-sequenceassign𝑑conditional𝑥𝑅subscript𝑑𝑥subscript𝑟𝑖𝑖𝐼 for each 𝑥𝑋d(x|R):=\big{(}d(x,r_{i})\big{)}_{i\in I},\text{ for each }x\in X,italic_d ( italic_x | italic_R ) := ( italic_d ( italic_x , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT , for each italic_x ∈ italic_X , (1)

is one-to-one. In particular, d(|R)d(\cdot|R)italic_d ( ⋅ | italic_R ) uniquely encodes points in X𝑋Xitalic_X as |R|𝑅|R|| italic_R |-dimensional real vectors; and, owing to the triangular inequality, proximate points in X𝑋Xitalic_X are encoded as vectors with similar coordinates. Resolving sets thus enable sound embeddings of metric spaces into Euclidean ones, which can be useful for generating numerical features of symbolic objects in statistical and machine learning tasks like regression or classification [24, 20].

One can think of a resolving set as a collection of “landmarks” in a metric space that uniquely identify the “location” of any point in that space by its distance to those landmarks. In that regard, resolvability serves as a form of “multi-lateration” of the space, similar to tri-lateration, although more than three landmarks may be needed to resolve a given metric space.

Irrespective of the metric space, resolving sets always exist, although they are never unique in non-trivial settings. This is because X𝑋Xitalic_X always resolves (X,d)𝑋𝑑(X,d)( italic_X , italic_d ), and if R𝑅Ritalic_R resolves (X,d)𝑋𝑑(X,d)( italic_X , italic_d ) and SR𝑅𝑆S\supset Ritalic_S ⊃ italic_R, then S𝑆Sitalic_S also resolves it. So, finding a resolving set is straightforward. In contrast, finding a resolving set with the smallest possible size is usually challenging; in fact, it is an NP-complete problem in arbitrary finite metric spaces [11, 6]. Minimizing the size of a resolving set is nonetheless crucial to embedding the points in X𝑋Xitalic_X into a low-dimensional Euclidean space using transformations of the form (1). This motivates the notion of metric dimension, which is the size of the smallest resolving set of a metric space (X,d)𝑋𝑑(X,d)( italic_X , italic_d ), denoted from now on as β(X,d)𝛽𝑋𝑑\beta(X,d)italic_β ( italic_X , italic_d ).

For a concise overview of resolvability and metric dimension in the context of graph theory, see [22]. Instead, for a comprehensive review of these and related concepts, see [23, 14].

A very limited number of studies have addressed the resolvability of non-graphical metric spaces in the literature [16, 2], as most efforts have focused on finite graphs [4]. Nevertheless, spaces with metric dimensions 1 or 2 have been characterized under general topological assumptions [16, 2]. It is also known that the metric dimension of a k𝑘kitalic_k-dimensional subspace of nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with respect to the Euclidean distance is (k+1)𝑘1(k+1)( italic_k + 1 ); in particular, (n,2)(\mathbb{R}^{n},\|\cdot\|_{2})( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) has metric dimension (n+1)𝑛1(n+1)( italic_n + 1 ) [2]. The hypersphere (𝕊n,2)(\mathbb{S}^{n},\|\cdot\|_{2})( blackboard_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) has also metric dimension (n+1)𝑛1(n+1)( italic_n + 1 ). Additionally, the metric dimension of the hyperbolic space nsuperscript𝑛\mathbb{H}^{n}blackboard_H start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with respect to the metric d(x,y):=xy𝑑x/xnassign𝑑𝑥𝑦superscriptsubscript𝑥𝑦differential-d𝑥subscript𝑥𝑛d(x,y):=\int_{x}^{y}dx/x_{n}italic_d ( italic_x , italic_y ) := ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT italic_d italic_x / italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, for all x,yn𝑥𝑦superscript𝑛x,y\in\mathbb{H}^{n}italic_x , italic_y ∈ blackboard_H start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, is (n+1)𝑛1(n+1)( italic_n + 1 ) [2]. Likewise, the metric dimension of the n𝑛nitalic_n-dimensional unit ball 𝔹nsuperscript𝔹𝑛\mathbb{B}^{n}blackboard_B start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with respect to the metric d(x,y):=xy2|dx|/(1x2)assign𝑑𝑥𝑦superscriptsubscript𝑥𝑦2𝑑𝑥1superscriptnorm𝑥2d(x,y):=\int_{x}^{y}2\,|dx|/(1-\|x\|^{2})italic_d ( italic_x , italic_y ) := ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT 2 | italic_d italic_x | / ( 1 - ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), with x,y𝔹n𝑥𝑦superscript𝔹𝑛x,y\in\mathbb{B}^{n}italic_x , italic_y ∈ blackboard_B start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, is (n+1)𝑛1(n+1)( italic_n + 1 ) [2].

In contrast, the systematic study of the resolvability and metric dimension of non-graphical, finite, metric spaces is essentially unexplored. In this paper, we study the resolvability of finite Jaccard metric spaces, i.e., metric spaces of the form (2X,Jac)superscript2𝑋Jac(2^{X},{\text{Jac}})( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ), where 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT denotes the power set of a finite set X𝑋Xitalic_X, and Jac is the Jaccard distance between subsets of X𝑋Xitalic_X [10]. Namely, for all a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT,

Jac(a,b):={|aΔb||ab|,ab;0,a=b.assignJac𝑎𝑏cases𝑎Δ𝑏𝑎𝑏𝑎𝑏0𝑎𝑏{\text{Jac}}(a,b):=\begin{cases}\frac{|a\Delta b|}{|a\cup b|},&a\neq b;\\ 0,&a=b.\end{cases}Jac ( italic_a , italic_b ) := { start_ROW start_CELL divide start_ARG | italic_a roman_Δ italic_b | end_ARG start_ARG | italic_a ∪ italic_b | end_ARG , end_CELL start_CELL italic_a ≠ italic_b ; end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_a = italic_b . end_CELL end_ROW

Jac is a metric in 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT [7, 13]. (In the literature, for distinct a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, the quantity 1Jac(a,b)=|ab|/|ab|1Jac𝑎𝑏𝑎𝑏𝑎𝑏1-{\text{Jac}}(a,b)=|a\cap b|/|a\cup b|1 - Jac ( italic_a , italic_b ) = | italic_a ∩ italic_b | / | italic_a ∪ italic_b | is referred to as the Jaccard similarity. This index is widely used in fields such as information retrieval, data mining, and natural language processing, among many others.)

Given that X𝑋Xitalic_X is finite in our setting, we may, in principle, estimate β(2X,Jac)𝛽superscript2𝑋Jac\beta(2^{X},{\text{Jac}})italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ) and find non-trivial resolving sets with the so-called Information Content Heuristic (ICH) [9]. In a general setting, the input of this algorithm is the (symmetric) distance matrix between all pairs of points in a metric space, and the output is a subset of columns that resolve it, which is determined greedily through an entropy maximization procedure. Unfortunately, however, in the context of Jaccard spaces, the ICH is infeasible even for moderate values of |X|𝑋|X|| italic_X | because of its O(23|X|)𝑂superscript23𝑋O(2^{3|X|})italic_O ( 2 start_POSTSUPERSCRIPT 3 | italic_X | end_POSTSUPERSCRIPT ) time complexity.

Nevertheless, besides being of theoretical interest, learning to resolve optimally or nearly optimally Jaccard spaces may find applications in e.g. lexicon-based approaches to natural language processing (NLP). In the most basic implementation of this idea, X𝑋Xitalic_X would be the set of all words in a language and sentences represented as subsets of X𝑋Xitalic_X (aka, bag of words). The Jaccard distance is then a natural way to assess the similarity of sentences based on the words used, and a resolving set would induce a numerical encoding of sentences, mapping sentences with similar word content into vectors with similar coordinates, potentially providing low-dimensional feature vectors to learn to classify or regress sentences based on their lexicon [17].

1.1 Main Results

In what remains of this manuscript, X𝑋Xitalic_X is assumed to be a finite non-empty set.

In this section, we outline our key findings, with expanded statements and proofs provided in Section 2.

From now on, the Jaccard distance is the reference metric in 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT; in particular, e.g., statements like R𝑅Ritalic_R resolves 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT,” mean that R2X𝑅superscript2𝑋R\subset 2^{X}italic_R ⊂ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT resolves (2X,Jac)superscript2𝑋Jac(2^{X},{\text{Jac}})( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ).” We also say that R𝑅Ritalic_R resolves a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT when there exists rR𝑟𝑅r\in Ritalic_r ∈ italic_R such that Jac(a,r)Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)\neq{\text{Jac}}(b,r)Jac ( italic_a , italic_r ) ≠ Jac ( italic_b , italic_r ).

We first provide a necessary condition for a set R𝑅Ritalic_R to resolve 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT.

Proposition 1.1.

If R𝑅Ritalic_R resolves 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT then R𝑅Ritalic_R separates the distinct elements of X𝑋{X}italic_X, and it covers all but possibly one element in X𝑋Xitalic_X.

The proof of the proposition can be found in Section 2.1. We note that these properties are necessary but not sufficient. For instance, if X={1,2,3,4}𝑋1234{X}=\{1,2,3,4\}italic_X = { 1 , 2 , 3 , 4 } and R={{1,2},{1,3},{1,4}}𝑅121314R=\big{\{}\{1,2\},\{1,3\},\{1,4\}\big{\}}italic_R = { { 1 , 2 } , { 1 , 3 } , { 1 , 4 } } then R𝑅Ritalic_R separates different elements in X𝑋{X}italic_X and also covers it. Nevertheless, R𝑅Ritalic_R is not resolving because Jac({1}|R)=(1/2,1/2,1/2)=Jac(X|R)Jacconditional1𝑅121212Jacconditional𝑋𝑅{\text{Jac}}(\{1\}|R)=(1/2,1/2,1/2)={\text{Jac}}({X}|R)Jac ( { 1 } | italic_R ) = ( 1 / 2 , 1 / 2 , 1 / 2 ) = Jac ( italic_X | italic_R ). This counterexample can be easily generalized to sets X𝑋Xitalic_X of arbitrary size.

Next, we provide a lower bound on the size of any resolving subset of 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT; in particular, this is also a lower bound for β(2X,Jac)𝛽superscript2𝑋Jac\beta(2^{X},{\text{Jac}})italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ).

Proposition 1.2.

If R𝑅Ritalic_R resolves 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT then

|R||X|(ln2)(1+o(1))2ln(|X|/2)1ln(|X|/2+1)|X|ln2ln|X|.𝑅𝑋21𝑜12𝑋21𝑋21similar-to𝑋2𝑋|R|\geq\frac{|X|(\ln 2)\left(1+o(1)\right)-2\ln\left(|X|/2\right)-1}{\ln\left(% |X|/2+1\right)}\sim\frac{|X|\ln 2}{\ln|X|}.| italic_R | ≥ divide start_ARG | italic_X | ( roman_ln 2 ) ( 1 + italic_o ( 1 ) ) - 2 roman_ln ( | italic_X | / 2 ) - 1 end_ARG start_ARG roman_ln ( | italic_X | / 2 + 1 ) end_ARG ∼ divide start_ARG | italic_X | roman_ln 2 end_ARG start_ARG roman_ln | italic_X | end_ARG .

The proof of the proposition can be found in Section 2.2.

To state our main two results we require the following definition.

Definition 1.1.

A random r2X𝑟superscript2𝑋r\in 2^{X}italic_r ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT is said to have a Binomial(X,1/2)𝑋12(X,1/2)( italic_X , 1 / 2 ) distribution, in which case we write rBinomial(X,1/2)similar-to𝑟Binomial𝑋12r\sim\text{Binomial}(X,1/2)italic_r ∼ Binomial ( italic_X , 1 / 2 ), when (xr)=1/2𝑥𝑟12{\mathbb{P}}(x\in r)=1/2blackboard_P ( italic_x ∈ italic_r ) = 1 / 2 for each xX𝑥𝑋x\in Xitalic_x ∈ italic_X, and the events [xr]delimited-[]𝑥𝑟[x\in r][ italic_x ∈ italic_r ] with xX𝑥𝑋x\in Xitalic_x ∈ italic_X are independent.

Clearly, if rBinomial(X,1/2)similar-to𝑟Binomial𝑋12r\sim\text{Binomial}(X,1/2)italic_r ∼ Binomial ( italic_X , 1 / 2 ) then |r|Binomial(|X|,1/2)similar-to𝑟Binomial𝑋12|r|\sim\text{Binomial}(|X|,1/2)| italic_r | ∼ Binomial ( | italic_X | , 1 / 2 ); namely, (|r|=k)=12|X|(|X|k)𝑟𝑘1superscript2𝑋binomial𝑋𝑘{\mathbb{P}}\big{(}|r|=k\big{)}=\frac{1}{2^{|X|}}{|X|\choose k}blackboard_P ( | italic_r | = italic_k ) = divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT end_ARG ( binomial start_ARG | italic_X | end_ARG start_ARG italic_k end_ARG ) for k=0,,|X|𝑘0𝑋k=0,\ldots,|X|italic_k = 0 , … , | italic_X |.

Theorem 1.1.

If k2ln(2e)|X|ln(|X|/2)𝑘22𝑒𝑋𝑋2k\geq\frac{2\ln(2e)|X|}{\ln(|X|/2)}italic_k ≥ divide start_ARG 2 roman_ln ( 2 italic_e ) | italic_X | end_ARG start_ARG roman_ln ( | italic_X | / 2 ) end_ARG and r1,,rkBinomial(X,1/2)similar-tosubscript𝑟1subscript𝑟𝑘Binomial𝑋12r_{1},\ldots,r_{k}\sim\text{Binomial}(X,1/2)italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ Binomial ( italic_X , 1 / 2 ) are independent and identically distributed (i.i.d.), then, for each xX𝑥𝑋x\in Xitalic_x ∈ italic_X, R:={,{x},X{x},r1,,rk}assign𝑅𝑥𝑋𝑥subscript𝑟1subscript𝑟𝑘R:=\big{\{}\emptyset,\{x\},X\setminus\{x\},r_{1},\ldots,r_{k}\big{\}}italic_R := { ∅ , { italic_x } , italic_X ∖ { italic_x } , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } resolves 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, with overwhelmingly high probability, as |X|𝑋|X|\to\infty| italic_X | → ∞.

The proof of the theorem can be found in Section 2.5 and relies on auxiliary results in Sections 2.3 and 2.4.

In conjunction, Proposition 1.2 and Theorem 1.1 imply that

(ln2)|X|ln(|X|/2)(1+o(1))β(2X,Jac)2ln(2e)|X|ln(|X|/2)(1+o(1)),2𝑋𝑋21𝑜1𝛽superscript2𝑋Jac22𝑒𝑋𝑋21𝑜1\frac{(\ln 2)|X|}{\ln(|X|/2)}\left(1+o(1)\right)\leq\beta(2^{X},{\text{Jac}})% \leq\frac{2\ln(2e)|X|}{\ln(|X|/2)}\left(1+o(1)\right),divide start_ARG ( roman_ln 2 ) | italic_X | end_ARG start_ARG roman_ln ( | italic_X | / 2 ) end_ARG ( 1 + italic_o ( 1 ) ) ≤ italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ) ≤ divide start_ARG 2 roman_ln ( 2 italic_e ) | italic_X | end_ARG start_ARG roman_ln ( | italic_X | / 2 ) end_ARG ( 1 + italic_o ( 1 ) ) ,

which characterizes the metric dimension of 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT with respect to the Jaccard distance within a factor of 2ln(2e)ln25.022𝑒25.0\frac{2\ln(2e)}{\ln 2}\approx 5.0divide start_ARG 2 roman_ln ( 2 italic_e ) end_ARG start_ARG roman_ln 2 end_ARG ≈ 5.0. In particular, we can assert the following.

Corollary 1.1.

β(2X,Jac)=Θ(|X|ln|X|)𝛽superscript2𝑋JacΘ𝑋𝑋\beta(2^{X},{\text{Jac}})=\Theta\left(\frac{|X|}{\ln|X|}\right)italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ) = roman_Θ ( divide start_ARG | italic_X | end_ARG start_ARG roman_ln | italic_X | end_ARG ), as |X|𝑋|X|\to\infty| italic_X | → ∞.

It turns out that, for any xX𝑥𝑋x\in Xitalic_x ∈ italic_X, the set {,{x},X{x}}𝑥𝑋𝑥\big{\{}\emptyset,\{x\},X\setminus\{x\}\big{\}}{ ∅ , { italic_x } , italic_X ∖ { italic_x } } resolves all pairs of subsets of X𝑋Xitalic_X with different cardinalities (see Lemma 2.1 ahead). So the crux of the proof of Theorem 1.1 lies in showing that the sets in {r1,,rk}subscript𝑟1subscript𝑟𝑘\{r_{1},\ldots,r_{k}\}{ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } resolve all possible pairs a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT of equal size—with overwhelmingly high probability—when |X|𝑋|X|| italic_X | is large. We demonstrate this in Section 2.5.

In the context of potential NLP applications outlined in the Introduction, it is unclear whether the highly likely resolving set proposed in Theorem 1.1 is of any practical value for distinguishing between bags-of-words of different cardinalities. This is because the numerical encoding in (1) based on this set might differentiate such pairs solely based on the presence or absence of a single word or token, which seems too coarse for practical use in NLP classification (or regression) problems. Our following result addresses this issue by proposing a less contrived set, which is likely to resolve all pairs of bags-of-words of different cardinalities. Its proof can be found in Section 2.6.

Theorem 1.2.

Let ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. If k(4+ϵ)|X|𝑘4italic-ϵ𝑋k\geq(4+\epsilon)\sqrt{|X|}italic_k ≥ ( 4 + italic_ϵ ) square-root start_ARG | italic_X | end_ARG and r1,,rkBinomial(X,1/2)similar-tosubscript𝑟1subscript𝑟𝑘Binomial𝑋12r_{1},\ldots,r_{k}\sim\text{Binomial}(X,1/2)italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ Binomial ( italic_X , 1 / 2 ) are i.i.d., then R:={r1,r1c,,rk,rkc}assign𝑅subscript𝑟1superscriptsubscript𝑟1𝑐subscript𝑟𝑘superscriptsubscript𝑟𝑘𝑐R:=\big{\{}r_{1},r_{1}^{c},\ldots,r_{k},r_{k}^{c}\big{\}}italic_R := { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } resolves all pairs of subsets of X𝑋Xitalic_X of different size, with overwhelmingly high probability, as |X|𝑋|X|\to\infty| italic_X | → ∞.

As expected, the lower bound for the size of the set R𝑅Ritalic_R in Theorem 1.2 is asymptotically negligible compared to the one in Theorem 1.1; after all, the former set is only required to resolve pairs of subsets of X𝑋Xitalic_X with different cardinalities, which, as explained earlier, can be accomplished using just three subsets of X𝑋Xitalic_X (i.e., the empty set, and any singleton and its complement). Nevertheless, in practical situations—for instance, when representing social media posts as bags-of-words—more often than not, a random pair of posts would be associated with bags-of-words of different cardinality. In particular, in terms of the numerical encoding in (1), Theorem 1.2 suggests that O(|X|)𝑂𝑋O(\sqrt{|X|})italic_O ( square-root start_ARG | italic_X | end_ARG ) Jaccard distances, as opposed to Θ(|X|/ln|X|)Θ𝑋𝑋\Theta\left(|X|/\ln|X|\right)roman_Θ ( | italic_X | / roman_ln | italic_X | ), should suffice in practice to encode posts effectively when the reference lexicon X𝑋Xitalic_X is sufficiently large. Our following result makes this intuition precise at the expense of limiting the size of bags-of-words one wishes to resolve.

Corollary 1.2.

Let 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1. If k(4+ϵ)|X|𝑘4italic-ϵ𝑋k\geq(4+\epsilon)\sqrt{|X|}italic_k ≥ ( 4 + italic_ϵ ) square-root start_ARG | italic_X | end_ARG and r1,,rkBinomial(X,1/2)similar-tosubscript𝑟1subscript𝑟𝑘Binomial𝑋12r_{1},\ldots,r_{k}\sim\text{Binomial}(X,1/2)italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ Binomial ( italic_X , 1 / 2 ) are i.i.d., then the set R:={r1,r1c,,rk,rkc}assign𝑅subscript𝑟1superscriptsubscript𝑟1𝑐subscript𝑟𝑘superscriptsubscript𝑟𝑘𝑐R:=\big{\{}r_{1},r_{1}^{c},\ldots,r_{k},r_{k}^{c}\big{\}}italic_R := { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } resolves all different pairs of subsets of X𝑋Xitalic_X of size at most (1ϵ)(lnπ)|X|ln|X|1italic-ϵ𝜋𝑋𝑋\frac{(1-\epsilon)(\ln\pi)\sqrt{|X|}}{\ln|X|}divide start_ARG ( 1 - italic_ϵ ) ( roman_ln italic_π ) square-root start_ARG | italic_X | end_ARG end_ARG start_ARG roman_ln | italic_X | end_ARG, with overwhelmingly high probability, as |X|𝑋|X|\to\infty| italic_X | → ∞.

2 Technical Results and Proofs

2.1 Necessary Conditions for Resolvability

In this section, we prove Proposition 1.1. Specifically, suppose that R𝑅Ritalic_R resolves 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. Next we show that the following properties applies:

  1. (i)

    For all x1,x2Xsubscript𝑥1subscript𝑥2𝑋x_{1},x_{2}\in{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X with x1x2subscript𝑥1subscript𝑥2x_{1}\neq x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, there exists rR𝑟𝑅r\in Ritalic_r ∈ italic_R such that either x1rsubscript𝑥1𝑟x_{1}\in ritalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_r and x2rsubscript𝑥2𝑟x_{2}\notin ritalic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∉ italic_r, or x1rsubscript𝑥1𝑟x_{1}\notin ritalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∉ italic_r and x2rsubscript𝑥2𝑟x_{2}\in ritalic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_r.

  2. (ii)

    If R𝑅\emptyset\notin R∅ ∉ italic_R, then R𝑅Ritalic_R covers X𝑋{X}italic_X, i.e., rRr=Xsubscript𝑟𝑅𝑟𝑋\bigcup_{r\in R}r={X}⋃ start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT italic_r = italic_X.

  3. (iii)

    If R𝑅\emptyset\in R∅ ∈ italic_R, then R𝑅Ritalic_R covers X𝑋{X}italic_X, or there exists xX𝑥𝑋x\in{X}italic_x ∈ italic_X such that rRr=X{x}subscript𝑟𝑅𝑟𝑋𝑥\bigcup_{r\in R}r={X}\setminus\{x\}⋃ start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT italic_r = italic_X ∖ { italic_x }.

To show the property (i), suppose by contradiction that there are distinct x1,x2Xsubscript𝑥1subscript𝑥2𝑋x_{1},x_{2}\in{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X such that, for each rR𝑟𝑅r\in Ritalic_r ∈ italic_R, {x1,x2}rsubscript𝑥1subscript𝑥2𝑟\{x_{1},x_{2}\}\subset r{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ⊂ italic_r or {x1,x2}Xrsubscript𝑥1subscript𝑥2𝑋𝑟\{x_{1},x_{2}\}\subset{X}\setminus r{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ⊂ italic_X ∖ italic_r. In the first case: Jac({x1},r)=11/|r|=Jac({x2},r)Jacsubscript𝑥1𝑟11𝑟Jacsubscript𝑥2𝑟{\text{Jac}}(\{x_{1}\},r)=1-1/|r|={\text{Jac}}(\{x_{2}\},r)Jac ( { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , italic_r ) = 1 - 1 / | italic_r | = Jac ( { italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , italic_r ), and in the second case: Jac({x1},r)=1=Jac({x2},r)Jacsubscript𝑥1𝑟1Jacsubscript𝑥2𝑟{\text{Jac}}(\{x_{1}\},r)=1={\text{Jac}}(\{x_{2}\},r)Jac ( { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , italic_r ) = 1 = Jac ( { italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , italic_r ). In either case, R𝑅Ritalic_R could not possibly be resolving, which shows the first property.

To show the property (ii), suppose that there is xX𝑥𝑋x\in{X}italic_x ∈ italic_X, which does not belong to any of the sets in R𝑅Ritalic_R. Then, for each rR𝑟𝑅r\in Ritalic_r ∈ italic_R, Jac({x},r)=1=Jac(,r)Jac𝑥𝑟1Jac𝑟{\text{Jac}}(\{x\},r)=1={\text{Jac}}(\emptyset,r)Jac ( { italic_x } , italic_r ) = 1 = Jac ( ∅ , italic_r ), which is not possible. This shows the second property.

Finally, to show the property (iii), suppose there are distinct x1,x2Xsubscript𝑥1subscript𝑥2𝑋x_{1},x_{2}\in{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X which do not belong to any of the sets in R𝑅Ritalic_R. Then, for each rR𝑟𝑅r\in Ritalic_r ∈ italic_R, Jac({x1},r)=1=Jac({x2},r)Jacsubscript𝑥1𝑟1Jacsubscript𝑥2𝑟{\text{Jac}}(\{x_{1}\},r)=1={\text{Jac}}(\{x_{2}\},r)Jac ( { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , italic_r ) = 1 = Jac ( { italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , italic_r ), which is not possible and completes the proof of the proposition.

2.2 Metric Dimension lower bound

In this section, we prove Proposition 1.2.

Suppose that R𝑅Ritalic_R resolves 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. If c,rX𝑐𝑟𝑋c,r\subset Xitalic_c , italic_r ⊂ italic_X, then by the Inclusion-Exclusion Principle, Jac(c,r)=1|cr||c|+|r||cr|Jac𝑐𝑟1𝑐𝑟𝑐𝑟𝑐𝑟{\text{Jac}}(c,r)=1-\frac{|c\cap r|}{|c|+|r|-|c\cap r|}Jac ( italic_c , italic_r ) = 1 - divide start_ARG | italic_c ∩ italic_r | end_ARG start_ARG | italic_c | + | italic_r | - | italic_c ∩ italic_r | end_ARG. Since 0|cr||c|0𝑐𝑟𝑐0\leq|c\cap r|\leq|c|0 ≤ | italic_c ∩ italic_r | ≤ | italic_c |, the range of Jac(|R){\text{Jac}}(\cdot|R)Jac ( ⋅ | italic_R ), when restricted to sets c𝑐citalic_c such that |c|=n𝑐𝑛|c|=n| italic_c | = italic_n, has size at most (n+1)|R|superscript𝑛1𝑅(n+1)^{|R|}( italic_n + 1 ) start_POSTSUPERSCRIPT | italic_R | end_POSTSUPERSCRIPT. In particular, due to the Pigeonhole Principle, we must have (|X|n)(n+1)|R|binomial𝑋𝑛superscript𝑛1𝑅\binom{|{X}|}{n}\leq(n+1)^{|R|}( FRACOP start_ARG | italic_X | end_ARG start_ARG italic_n end_ARG ) ≤ ( italic_n + 1 ) start_POSTSUPERSCRIPT | italic_R | end_POSTSUPERSCRIPT, i.e.:

|R|max0<n<|X|ln(|X|n)ln(n+1)ln(|X||X|/2)ln(|X|/2+1).𝑅subscript0𝑛𝑋binomial𝑋𝑛𝑛1binomial𝑋𝑋2𝑋21|R|\geq\max\limits_{0<n<|{X}|}\frac{\ln{|{X}|\choose n}}{\ln(n+1)}\geq\frac{% \ln{|{X}|\choose\lfloor|X|/2\rfloor}}{\ln\big{(}|X|/2+1\big{)}}.| italic_R | ≥ roman_max start_POSTSUBSCRIPT 0 < italic_n < | italic_X | end_POSTSUBSCRIPT divide start_ARG roman_ln ( binomial start_ARG | italic_X | end_ARG start_ARG italic_n end_ARG ) end_ARG start_ARG roman_ln ( italic_n + 1 ) end_ARG ≥ divide start_ARG roman_ln ( binomial start_ARG | italic_X | end_ARG start_ARG ⌊ | italic_X | / 2 ⌋ end_ARG ) end_ARG start_ARG roman_ln ( | italic_X | / 2 + 1 ) end_ARG . (2)

The right-most lower bound above should be a reasonable estimate of the best one (based on the Pigeon Principle) because ln(n+1)𝑛1\ln(n+1)roman_ln ( italic_n + 1 ) is a slowly increasing function of n𝑛nitalic_n, and (|X|n)binomial𝑋𝑛{|{X}|\choose n}( binomial start_ARG | italic_X | end_ARG start_ARG italic_n end_ARG ), with 0<n<|X|0𝑛𝑋0<n<|X|0 < italic_n < | italic_X |, is maximized at n=|X|/2𝑛𝑋2n=\lfloor|X|/2\rflooritalic_n = ⌊ | italic_X | / 2 ⌋ (equivalently, n=|X|/2𝑛𝑋2n=\lceil|X|/2\rceilitalic_n = ⌈ | italic_X | / 2 ⌉). To make the last numerator above more explicit, we use that [12, Exercise 24, §1.2.5]:

nnen1n!nn+1en1, for n1.formulae-sequencesuperscript𝑛𝑛superscript𝑒𝑛1𝑛superscript𝑛𝑛1superscript𝑒𝑛1 for 𝑛1\frac{n^{n}}{e^{n-1}}\leq n!\leq\frac{n^{n+1}}{e^{n-1}},\text{ for }n\geq 1.divide start_ARG italic_n start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT end_ARG ≤ italic_n ! ≤ divide start_ARG italic_n start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT end_ARG , for italic_n ≥ 1 .

In particular, if n=|X|/2𝑛𝑋2n=\lfloor|X|/2\rflooritalic_n = ⌊ | italic_X | / 2 ⌋ then

ln(|X||X|/2)binomial𝑋𝑋2\displaystyle\ln\binom{|{X}|}{\lfloor|X|/2\rfloor}roman_ln ( FRACOP start_ARG | italic_X | end_ARG start_ARG ⌊ | italic_X | / 2 ⌋ end_ARG ) |X|ln|X|(n+1)ln(n)(|X|n+1)ln(|X|n)1absent𝑋𝑋𝑛1𝑛𝑋𝑛1𝑋𝑛1\displaystyle\geq|X|\ln|X|-(n+1)\ln(n)-\big{(}|X|-n+1\big{)}\ln\big{(}|X|-n% \big{)}-1≥ | italic_X | roman_ln | italic_X | - ( italic_n + 1 ) roman_ln ( italic_n ) - ( | italic_X | - italic_n + 1 ) roman_ln ( | italic_X | - italic_n ) - 1
=|X|{ln2+n|X|ln(|X|2n)+|X|n|X|ln(|X|2|X|2n)}ln(n(|X|n))1absent𝑋2𝑛𝑋𝑋2𝑛𝑋𝑛𝑋𝑋2𝑋2𝑛𝑛𝑋𝑛1\displaystyle=|X|\left\{\ln 2+\frac{n}{|X|}\ln\left(\frac{|X|}{2n}\right)+% \frac{|X|-n}{|X|}\ln\left(\frac{|X|}{2|X|-2n}\right)\right\}-\ln\Big{(}n\big{(% }|X|-n\big{)}\Big{)}-1= | italic_X | { roman_ln 2 + divide start_ARG italic_n end_ARG start_ARG | italic_X | end_ARG roman_ln ( divide start_ARG | italic_X | end_ARG start_ARG 2 italic_n end_ARG ) + divide start_ARG | italic_X | - italic_n end_ARG start_ARG | italic_X | end_ARG roman_ln ( divide start_ARG | italic_X | end_ARG start_ARG 2 | italic_X | - 2 italic_n end_ARG ) } - roman_ln ( italic_n ( | italic_X | - italic_n ) ) - 1
|X|{ln2+o(1)}2ln(|X|2)1.absent𝑋2𝑜12𝑋21\displaystyle\geq|X|\big{\{}\ln 2+o(1)\big{\}}-2\ln\left(\frac{|X|}{2}\right)-1.≥ | italic_X | { roman_ln 2 + italic_o ( 1 ) } - 2 roman_ln ( divide start_ARG | italic_X | end_ARG start_ARG 2 end_ARG ) - 1 .

The proposition is now direct from (2).

2.3 Resolving Subsets of X𝑋Xitalic_X of Different Cardinalities

Lemma 2.1.

For all xX𝑥𝑋x\in Xitalic_x ∈ italic_X and all a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, if |a||b|𝑎𝑏|a|\neq|b|| italic_a | ≠ | italic_b | then a𝑎aitalic_a and b𝑏bitalic_b are resolved by R={,{x},X{x}}𝑅𝑥𝑋𝑥R=\big{\{}\emptyset,\{x\},X\setminus\{x\}\big{\}}italic_R = { ∅ , { italic_x } , italic_X ∖ { italic_x } }.

Proof.

Without any loss of generality assume that |X|>1𝑋1|X|>1| italic_X | > 1. Fix an xX𝑥𝑋x\in Xitalic_x ∈ italic_X and note that for each c2X𝑐superscript2𝑋c\in 2^{X}italic_c ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT:

Jac(c,{x})Jac𝑐𝑥\displaystyle{\text{Jac}}(c,\{x\})Jac ( italic_c , { italic_x } ) =1{1|c|,xc;0,xc;absent1cases1𝑐𝑥𝑐0𝑥𝑐\displaystyle=1-\begin{cases}\frac{1}{|c|},&x\in c;\\ 0,&x\notin c;\end{cases}= 1 - { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG | italic_c | end_ARG , end_CELL start_CELL italic_x ∈ italic_c ; end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_x ∉ italic_c ; end_CELL end_ROW
Jac(c,X{x})Jac𝑐𝑋𝑥\displaystyle{\text{Jac}}(c,X\setminus\{x\})Jac ( italic_c , italic_X ∖ { italic_x } ) =1{|c|1|X|,xc;|c||X|1,xc.absent1cases𝑐1𝑋𝑥𝑐𝑐𝑋1𝑥𝑐\displaystyle=1-\begin{cases}\frac{|c|-1}{|X|},&x\in c;\\ \frac{|c|}{|X|-1},&x\notin c.\end{cases}= 1 - { start_ROW start_CELL divide start_ARG | italic_c | - 1 end_ARG start_ARG | italic_X | end_ARG , end_CELL start_CELL italic_x ∈ italic_c ; end_CELL end_ROW start_ROW start_CELL divide start_ARG | italic_c | end_ARG start_ARG | italic_X | - 1 end_ARG , end_CELL start_CELL italic_x ∉ italic_c . end_CELL end_ROW

Define R:={,{x},X{x}}assign𝑅𝑥𝑋𝑥R:=\big{\{}\emptyset,\{x\},X\setminus\{x\}\big{\}}italic_R := { ∅ , { italic_x } , italic_X ∖ { italic_x } }. Consider a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT such that |a||b|𝑎𝑏|a|\neq|b|| italic_a | ≠ | italic_b |, and suppose that Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ), for all rR𝑟𝑅r\in Ritalic_r ∈ italic_R. In particular, a𝑎aitalic_a cannot be empty; otherwise, Jac(b,)=Jac(a,)=0Jac𝑏Jac𝑎0{\text{Jac}}(b,\emptyset)={\text{Jac}}(a,\emptyset)=0Jac ( italic_b , ∅ ) = Jac ( italic_a , ∅ ) = 0, implying that b=𝑏b=\emptysetitalic_b = ∅ because Jac is a metric. However, the latter is not possible because |a||b|𝑎𝑏|a|\neq|b|| italic_a | ≠ | italic_b |. Likewise, b𝑏bitalic_b is cannot be empty.

Moreover, if xa𝑥𝑎x\in aitalic_x ∈ italic_a, then Jac(b,x)=1|a|1<1Jac𝑏𝑥1superscript𝑎11{\text{Jac}}(b,{x})=1-|a|^{-1}<1Jac ( italic_b , italic_x ) = 1 - | italic_a | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT < 1. In particular, x𝑥xitalic_x must be in b𝑏bitalic_b as otherwise Jac(b,x)=1Jac𝑏𝑥1{\text{Jac}}(b,{x})=1Jac ( italic_b , italic_x ) = 1, which is not possible. But then, 1|b|1=1|a|11superscript𝑏11superscript𝑎11-|b|^{-1}=1-|a|^{-1}1 - | italic_b | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 - | italic_a | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, i.e., |b|=|a|𝑏𝑎|b|=|a|| italic_b | = | italic_a |, which is not possible either. Instead, if xa𝑥𝑎x\notin aitalic_x ∉ italic_a and xb𝑥𝑏x\notin bitalic_x ∉ italic_b then, because Jac(a,X{x})=Jac(b,X{x})Jac𝑎𝑋𝑥Jac𝑏𝑋𝑥{\text{Jac}}(a,X\setminus\{x\})={\text{Jac}}(b,X\setminus\{x\})Jac ( italic_a , italic_X ∖ { italic_x } ) = Jac ( italic_b , italic_X ∖ { italic_x } ), we must have that 1|a||X|1=1|b||X|11𝑎𝑋11𝑏𝑋11-\frac{|a|}{|X|-1}=1-\frac{|b|}{|X|-1}1 - divide start_ARG | italic_a | end_ARG start_ARG | italic_X | - 1 end_ARG = 1 - divide start_ARG | italic_b | end_ARG start_ARG | italic_X | - 1 end_ARG, i.e., |a|=|b|𝑎𝑏|a|=|b|| italic_a | = | italic_b |, which is again not possible. Hence, there has to be an rR𝑟𝑅r\in Ritalic_r ∈ italic_R such that Jac(a,r)Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)\neq{\text{Jac}}(b,r)Jac ( italic_a , italic_r ) ≠ Jac ( italic_b , italic_r ), implying that R𝑅Ritalic_R resolves a𝑎aitalic_a and b𝑏bitalic_b. The same conclusion applies if xb𝑥𝑏x\in bitalic_x ∈ italic_b, which completes the proof of the lemma. ∎

2.4 Inner product Characterization of Equidistant Sets

Two sets a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT are said equidistant from an r2X𝑟superscript2𝑋r\in 2^{X}italic_r ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT when Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ). In this case, r𝑟ritalic_r is not useful to resolve a𝑎aitalic_a from b𝑏bitalic_b when ab𝑎𝑏a\neq bitalic_a ≠ italic_b, and we say that a𝑎aitalic_a and b𝑏bitalic_b collide in terms of their Jaccard distance to r𝑟ritalic_r.

In this section, we characterize collisions in linear algebra terms by representing subsets of X𝑋Xitalic_X as binary vectors. We note that linear algebra characterizations have been used to study the metric dimension of Hypercube graphs [3] and Hamming graphs [15].

In what follows, we represent elements in 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT as binary vectors of dimension |X|𝑋|X|| italic_X |. Namely, for a2X𝑎superscript2𝑋a\in 2^{X}italic_a ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, a(x)=1𝑎𝑥1a(x)=1italic_a ( italic_x ) = 1 when xa𝑥𝑎x\in aitalic_x ∈ italic_a, and a(x)=0𝑎𝑥0a(x)=0italic_a ( italic_x ) = 0 when xa𝑥𝑎x\notin aitalic_x ∉ italic_a. (For instance, X𝑋{X}italic_X is represented by a vector of all ones, whereas \emptyset by a vector of all zeros.) Additionally, for r2X𝑟superscript2𝑋r\in 2^{X}italic_r ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT and z|X|𝑧superscript𝑋z\in\mathbb{R}^{|{X}|}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT, r,z𝑟𝑧\langle r,z\rangle⟨ italic_r , italic_z ⟩ denotes the inner product between the binary vector associated with r𝑟ritalic_r and the vector z𝑧zitalic_z. Namely:

r,z:=xXr(x)z(x).assign𝑟𝑧subscript𝑥𝑋𝑟𝑥𝑧𝑥\langle r,z\rangle:=\sum_{x\in{X}}r(x)\cdot z(x).⟨ italic_r , italic_z ⟩ := ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_r ( italic_x ) ⋅ italic_z ( italic_x ) .

In what follows, we use product notation to denote set intersections. Namely, if a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT then ab:=(ab)assign𝑎𝑏𝑎𝑏ab:=(a\cap b)italic_a italic_b := ( italic_a ∩ italic_b ).

The next result characterizes equidistant sets in terms of inner products. This characterization will be used in Section 2.5.1, in the proof of Theorem 1.1, to assess the probability that two different subsets of X𝑋Xitalic_X, of the same size, collide in terms of their distance to a random subset of X𝑋Xitalic_X.

Lemma 2.2.

Let a,b,r2X𝑎𝑏𝑟superscript2𝑋a,b,r\in 2^{X}italic_a , italic_b , italic_r ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT and define the vector z:=(|r|+|b|)a(|r|+|a|)bassign𝑧𝑟𝑏𝑎𝑟𝑎𝑏z:=\big{(}|r|+|b|\big{)}\,a-\big{(}|r|+|a|\big{)}\,bitalic_z := ( | italic_r | + | italic_b | ) italic_a - ( | italic_r | + | italic_a | ) italic_b. If Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) then r,z=0𝑟𝑧0\langle r,z\rangle=0⟨ italic_r , italic_z ⟩ = 0. Conversely, if r𝑟r\neq\emptysetitalic_r ≠ ∅ and r,z=0𝑟𝑧0\langle r,z\rangle=0⟨ italic_r , italic_z ⟩ = 0 then Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ).

Proof.

We show first that

r,z=(|r|+|b|)|ar|(|r|+|a|)|br|.𝑟𝑧𝑟𝑏𝑎𝑟𝑟𝑎𝑏𝑟\langle r,z\rangle=\big{(}|r|+|b|\big{)}\cdot|ar|-\big{(}|r|+|a|\big{)}\cdot|% br|.⟨ italic_r , italic_z ⟩ = ( | italic_r | + | italic_b | ) ⋅ | italic_a italic_r | - ( | italic_r | + | italic_a | ) ⋅ | italic_b italic_r | . (3)

For this, observe that |ar|=a,r𝑎𝑟𝑎𝑟|ar|=\langle a,r\rangle| italic_a italic_r | = ⟨ italic_a , italic_r ⟩ and |br|=b,r𝑏𝑟𝑏𝑟|br|=\langle b,r\rangle| italic_b italic_r | = ⟨ italic_b , italic_r ⟩; from which the identity in equation (3) is immediate due to the bilinearity of inner products.

Since ,z=0𝑧0\langle\emptyset,z\rangle=0⟨ ∅ , italic_z ⟩ = 0, to complete the proof, it suffices to show that if r𝑟r\neq\emptysetitalic_r ≠ ∅ then Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) if and only if r,z=0𝑟𝑧0\langle r,z\rangle=0⟨ italic_r , italic_z ⟩ = 0. For this, note that 1Jac(c,r)=|cr|/|cr|1Jac𝑐𝑟𝑐𝑟𝑐𝑟1-{\text{Jac}}(c,r)=|cr|/|c\cup r|1 - Jac ( italic_c , italic_r ) = | italic_c italic_r | / | italic_c ∪ italic_r | and |cr|=|c|+|r|c,r𝑐𝑟𝑐𝑟𝑐𝑟|c\cup r|=|c|+|r|-\langle c,r\rangle| italic_c ∪ italic_r | = | italic_c | + | italic_r | - ⟨ italic_c , italic_r ⟩, for all c2X𝑐superscript2𝑋c\in 2^{X}italic_c ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. In particular, a simple algebra shows that Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) is equivalent to having (|r|+|b|)a,r(|r|+|a|)b,r=0𝑟𝑏𝑎𝑟𝑟𝑎𝑏𝑟0(|r|+|b|)\,\langle a,r\rangle-(|r|+|a|)\,\langle b,r\rangle=0( | italic_r | + | italic_b | ) ⟨ italic_a , italic_r ⟩ - ( | italic_r | + | italic_a | ) ⟨ italic_b , italic_r ⟩ = 0, that is, z,r=0𝑧𝑟0\langle z,r\rangle=0⟨ italic_z , italic_r ⟩ = 0 due to the bilinearity of inner products. ∎

We also want an inner product characterization of sets a𝑎aitalic_a and b𝑏bitalic_b that not only collide in terms of their Jaccard distance to a set r𝑟ritalic_r but also to rcsuperscript𝑟𝑐r^{c}italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, the complement of r𝑟ritalic_r. Our next result provides a necessary condition for both collisions to occur. This is characterization is used in Section 2.6.1 to show Theorem 1.2.

Corollary 2.1.

Let a,b,r2X𝑎𝑏𝑟superscript2𝑋a,b,r\in 2^{X}italic_a , italic_b , italic_r ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. If Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) and Jac(a,rc)=Jac(b,rc)Jac𝑎superscript𝑟𝑐Jac𝑏superscript𝑟𝑐{\text{Jac}}(a,r^{c})={\text{Jac}}(b,r^{c})Jac ( italic_a , italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = Jac ( italic_b , italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) then (|rc||r|)(|br||ar|)=|rc|(|b||a|)superscript𝑟𝑐𝑟𝑏𝑟𝑎𝑟superscript𝑟𝑐𝑏𝑎\big{(}|r^{c}|-|r|\big{)}\cdot\big{(}|br|-|ar|\big{)}=|r^{c}|\cdot\big{(}|b|-|% a|\big{)}( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | - | italic_r | ) ⋅ ( | italic_b italic_r | - | italic_a italic_r | ) = | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | ⋅ ( | italic_b | - | italic_a | ).

Proof.

If Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) and Jac(a,rc)=Jac(b,rc)Jac𝑎superscript𝑟𝑐Jac𝑏superscript𝑟𝑐{\text{Jac}}(a,r^{c})={\text{Jac}}(b,r^{c})Jac ( italic_a , italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = Jac ( italic_b , italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) then Lemma 2.2 implies that r,z1=0𝑟subscript𝑧10\langle r,z_{1}\rangle=0⟨ italic_r , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ = 0 and rc,z2=0superscript𝑟𝑐subscript𝑧20\langle r^{c},z_{2}\rangle=0⟨ italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ = 0, where z1:=(|r|+|b|)a(|r|+|a|)bassignsubscript𝑧1𝑟𝑏𝑎𝑟𝑎𝑏z_{1}:=\big{(}|r|+|b|\big{)}\,a-\big{(}|r|+|a|\big{)}\,bitalic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := ( | italic_r | + | italic_b | ) italic_a - ( | italic_r | + | italic_a | ) italic_b and z2:=(|rc|+|b|)a(|rc|+|a|)bassignsubscript𝑧2superscript𝑟𝑐𝑏𝑎superscript𝑟𝑐𝑎𝑏z_{2}:=\big{(}|r^{c}|+|b|\big{)}\,a-\big{(}|r^{c}|+|a|\big{)}\,bitalic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := ( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | + | italic_b | ) italic_a - ( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | + | italic_a | ) italic_b. Hence, due to the identity in equation (3), we have that

00\displaystyle 0 =r,z1+rc,z2absent𝑟subscript𝑧1superscript𝑟𝑐subscript𝑧2\displaystyle=\langle r,z_{1}\rangle+\langle r^{c},z_{2}\rangle= ⟨ italic_r , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ + ⟨ italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩
=(|r|+|b|)|ar|(|r|+|a|)|br|+(|rc|+|b|)|rca|(|rc|+|a|)|rcb|absent𝑟𝑏𝑎𝑟𝑟𝑎𝑏𝑟superscript𝑟𝑐𝑏superscript𝑟𝑐𝑎superscript𝑟𝑐𝑎superscript𝑟𝑐𝑏\displaystyle=\big{(}|r|+|b|\big{)}\cdot|ar|-\big{(}|r|+|a|\big{)}\cdot|br|+% \big{(}|r^{c}|+|b|\big{)}\cdot|r^{c}a|-\big{(}|r^{c}|+|a|\big{)}|r^{c}b|= ( | italic_r | + | italic_b | ) ⋅ | italic_a italic_r | - ( | italic_r | + | italic_a | ) ⋅ | italic_b italic_r | + ( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | + | italic_b | ) ⋅ | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | - ( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | + | italic_a | ) | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b |
=|r|(|ar||br|)+|rc|{|rca||rcb|}+|ar||b||br||a|+{|rca||b||rcb||a|}.absent𝑟𝑎𝑟𝑏𝑟superscript𝑟𝑐superscript𝑟𝑐𝑎superscript𝑟𝑐𝑏𝑎𝑟𝑏𝑏𝑟𝑎superscript𝑟𝑐𝑎𝑏superscript𝑟𝑐𝑏𝑎\displaystyle=|r|\cdot\big{(}|ar|-|br|\big{)}+|r^{c}|\cdot\big{\{}|r^{c}a|-|r^% {c}b|\big{\}}+|ar|\cdot|b|-|br|\cdot|a|+\big{\{}|r^{c}a|\cdot|b|-|r^{c}b|\cdot% |a|\big{\}}.= | italic_r | ⋅ ( | italic_a italic_r | - | italic_b italic_r | ) + | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | ⋅ { | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | - | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | } + | italic_a italic_r | ⋅ | italic_b | - | italic_b italic_r | ⋅ | italic_a | + { | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | ⋅ | italic_b | - | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | ⋅ | italic_a | } . (4)

But |rca|=|a||ar|superscript𝑟𝑐𝑎𝑎𝑎𝑟|r^{c}a|=|a|-|ar|| italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | = | italic_a | - | italic_a italic_r | and |rcb|=|b||br|superscript𝑟𝑐𝑏𝑏𝑏𝑟|r^{c}b|=|b|-|br|| italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | = | italic_b | - | italic_b italic_r |; in particular, we may rewrite the expressions within the curly parentheses above as follows: |rca||rcb|=(|br||ar|)+|a||b|superscript𝑟𝑐𝑎superscript𝑟𝑐𝑏𝑏𝑟𝑎𝑟𝑎𝑏|r^{c}a|-|r^{c}b|=\big{(}|br|-|ar|\big{)}+|a|-|b|| italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | - | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | = ( | italic_b italic_r | - | italic_a italic_r | ) + | italic_a | - | italic_b |, and |rca||b||rcb||a|=|a||br||b||ar|superscript𝑟𝑐𝑎𝑏superscript𝑟𝑐𝑏𝑎𝑎𝑏𝑟𝑏𝑎𝑟|r^{c}a|\cdot|b|-|r^{c}b|\cdot|a|=|a|\cdot|br|-|b|\cdot|ar|| italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | ⋅ | italic_b | - | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | ⋅ | italic_a | = | italic_a | ⋅ | italic_b italic_r | - | italic_b | ⋅ | italic_a italic_r |. Finally, substituting these two expressions back in equation (4), and after recognizing various terms cancellations, we obtain that

0=(|rc||r|)(|br||ar|)|rc|(|b||a|),0superscript𝑟𝑐𝑟𝑏𝑟𝑎𝑟superscript𝑟𝑐𝑏𝑎0=\big{(}|r^{c}|-|r|\big{)}\cdot\big{(}|br|-|ar|\big{)}-|r^{c}|\cdot\big{(}|b|% -|a|\big{)},0 = ( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | - | italic_r | ) ⋅ ( | italic_b italic_r | - | italic_a italic_r | ) - | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | ⋅ ( | italic_b | - | italic_a | ) ,

from which the Corollary follows. ∎

|X|𝑋|X|| italic_X | 1 2 3 4 5 6 7 8 9 10 11 12 13 14
|R|𝑅|R|| italic_R | 1 2 2 3 3 4 5 5 6 6 7 7 8 8
1|R|rR|r|1𝑅subscript𝑟𝑅𝑟\frac{1}{|R|}\sum\limits_{r\in R}|r|divide start_ARG 1 end_ARG start_ARG | italic_R | end_ARG ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R end_POSTSUBSCRIPT | italic_r | 1 1.5 2.0 2.33 2.66 3.5 4.4 3.87 4.3 5.8 5.9 6 6.4 7.4
Table 1: Upper bounds for β(2X,Jac)𝛽superscript2𝑋Jac\beta(2^{X},{\text{Jac}})italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ) obtained by the ICH for a range of sizes of X𝑋Xitalic_X. The middle row is the size of the resolving set R𝑅Ritalic_R found by the ICH. The bottom row is the average size of the sets in R𝑅Ritalic_R, which often seem approximately equal to |X|/2𝑋2|X|/2| italic_X | / 2.

2.5 Resolving Pairs of Subsets of X𝑋Xitalic_X of Equal Size

In this section, we prove Theorem 1.1 using the probabilistic method [1].

In what follows, k1𝑘1k\geq 1italic_k ≥ 1 and r1,,rksubscript𝑟1subscript𝑟𝑘r_{1},\ldots,r_{k}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are i.i.d. Bernoulli(X,1/2)Bernoulli𝑋12\text{Bernoulli}(X,1/2)Bernoulli ( italic_X , 1 / 2 ) random subsets. In particular, 𝔼|ri|=|X|/2𝔼subscript𝑟𝑖𝑋2\mathbb{E}|r_{i}|=|X|/2blackboard_E | italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = | italic_X | / 2, which is consistent with the experimental results displayed in Table 1, and guided the selection of the parameter 1/2 in the Binomial distribution.

Define

R1:={r1,,rk}.assignsubscript𝑅1subscript𝑟1subscript𝑟𝑘R_{1}:=\{r_{1},\ldots,r_{k}\}.italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } .

In accordance with the probabilistic method, and to obtain an upper bound on the metric dimension of 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, we aim to find a k𝑘kitalic_k such that the probability that R1subscript𝑅1R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT does not resolve all distinct pairs a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT of equal size is strictly less than one. If we can find such a k𝑘kitalic_k, then there exists an R2X𝑅superscript2𝑋R\subset 2^{X}italic_R ⊂ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT with |R|=k𝑅𝑘|R|=k| italic_R | = italic_k that resolves all different pairs of subsets of X𝑋Xitalic_X of equal size. In particular, due to Lemma 2.1, we could assert that β(2X,Jac)(3+k)𝛽superscript2𝑋Jac3𝑘\beta(2^{X},{\text{Jac}})\leq(3+k)italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ) ≤ ( 3 + italic_k ). The challenge is to find k𝑘kitalic_k as small as possible so that (k+3)𝑘3(k+3)( italic_k + 3 ) is a tight upper bound for the metric dimension of 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, and the following probability

Σ1:=(a,b2X, with |a|=|b| but ab, such that rR1:Jac(a,r)=Jac(b,r))\Sigma_{1}:={\mathbb{P}}\left(\exists\,a,b\in 2^{X},\text{ with }|a|=|b|\text{% but }a\neq b,\text{ such that }\forall r\in R_{1}:{\text{Jac}}(a,r)={\text{% Jac}}(b,r)\right)roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := blackboard_P ( ∃ italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , with | italic_a | = | italic_b | but italic_a ≠ italic_b , such that ∀ italic_r ∈ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) ) (5)

becomes asymptotically negligible as |X|𝑋|X|\to\infty| italic_X | → ∞. Theorem 1.1 identifies a k𝑘kitalic_k meeting this criterion.

2.5.1 Sizing the probability Σ1subscriptΣ1\Sigma_{1}roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

In this section, we identify a k𝑘kitalic_k in terms of |X|𝑋|X|| italic_X |, of the same order of magnitude as the asymptotic lower bound for β(2X,Jac)𝛽superscript2𝑋Jac\beta(2^{X},{\text{Jac}})italic_β ( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ) in Proposition 1.2, such that Σ1=o(1)subscriptΣ1𝑜1\Sigma_{1}=o(1)roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_o ( 1 ).

Suppose there exists a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT such that |a|=|b|𝑎𝑏|a|=|b|| italic_a | = | italic_b |, ab𝑎𝑏a\neq bitalic_a ≠ italic_b, and Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) for all rR1𝑟subscript𝑅1r\in R_{1}italic_r ∈ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then, per Lemma 2.2, for each rR1𝑟subscript𝑅1r\in R_{1}italic_r ∈ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, (|r|+|a|)ab,r=0𝑟𝑎𝑎𝑏𝑟0\big{(}|r|+|a|\big{)}\cdot\langle a-b,r\rangle=0( | italic_r | + | italic_a | ) ⋅ ⟨ italic_a - italic_b , italic_r ⟩ = 0. But |a|>0𝑎0|a|>0| italic_a | > 0 because ab𝑎𝑏a\neq bitalic_a ≠ italic_b. So ab,r=0𝑎𝑏𝑟0\langle a-b,r\rangle=0⟨ italic_a - italic_b , italic_r ⟩ = 0, or equivalently:

z,r=0, with z:=(abcacb).formulae-sequence𝑧𝑟0assign with 𝑧𝑎superscript𝑏𝑐superscript𝑎𝑐𝑏\langle z,r\rangle=0,\text{ with }z:=(ab^{c}-a^{c}b).⟨ italic_z , italic_r ⟩ = 0 , with italic_z := ( italic_a italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - italic_a start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b ) .

But observe that z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z, where

𝒵:={z{0,±1}|X| such that xXz(x)=0 and xX|z(x)|2}.assign𝒵𝑧superscript0plus-or-minus1𝑋 such that subscript𝑥𝑋𝑧𝑥0 and subscript𝑥𝑋𝑧𝑥2\mathcal{Z}:=\left\{z\in\big{\{}0,\pm 1\big{\}}^{|X|}\text{ such that }\sum_{x% \in X}z(x)=0\text{ and }\sum_{x\in X}|z(x)|\geq 2\right\}.caligraphic_Z := { italic_z ∈ { 0 , ± 1 } start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT such that ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT italic_z ( italic_x ) = 0 and ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X end_POSTSUBSCRIPT | italic_z ( italic_x ) | ≥ 2 } .

Consequently:

Σ1(z𝒵 such that rR1:z,r=0).\Sigma_{1}\leq{\mathbb{P}}\big{(}\exists\,z\in\mathcal{Z}\text{ such that }% \forall r\in R_{1}:\langle z,r\rangle=0\big{)}.roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ blackboard_P ( ∃ italic_z ∈ caligraphic_Z such that ∀ italic_r ∈ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : ⟨ italic_z , italic_r ⟩ = 0 ) .

To bound the probability on the right-hand side above, consider a z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z and the sets I:={xX such that z(x)=+1}assign𝐼𝑥𝑋 such that 𝑧𝑥1I:=\{x\in X\text{ such that }z(x)=+1\}italic_I := { italic_x ∈ italic_X such that italic_z ( italic_x ) = + 1 }, and J:={xX such that z(x)=1}assign𝐽𝑥𝑋 such that 𝑧𝑥1J:=\{x\in X\text{ such that }z(x)=-1\}italic_J := { italic_x ∈ italic_X such that italic_z ( italic_x ) = - 1 }. Observe that I𝐼Iitalic_I and J𝐽Jitalic_J are non-empty, disjoint, and of the same size; let i𝑖iitalic_i be said cardinality. Note that 1i|X|/21𝑖𝑋21\leq i\leq\lfloor|X|/2\rfloor1 ≤ italic_i ≤ ⌊ | italic_X | / 2 ⌋, and that |Ir1|,,|Irk|,|Jr1|,,|Jrk|𝐼subscript𝑟1𝐼subscript𝑟𝑘𝐽subscript𝑟1𝐽subscript𝑟𝑘|I\cap r_{1}|,\ldots,|I\cap r_{k}|,|J\cap r_{1}|,\ldots,|J\cap r_{k}|| italic_I ∩ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , … , | italic_I ∩ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | , | italic_J ∩ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , … , | italic_J ∩ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | are i.i.d. Binomial(i,1/2)Binomial𝑖12\text{Binomial}(i,1/2)Binomial ( italic_i , 1 / 2 ) random variables. Further, since z,rt=|Irt||Jrt|𝑧subscript𝑟𝑡𝐼subscript𝑟𝑡𝐽subscript𝑟𝑡\langle z,r_{t}\rangle=|I\cap r_{t}|-|J\cap r_{t}|⟨ italic_z , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ = | italic_I ∩ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | - | italic_J ∩ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT |, z,r1,,z,rk𝑧subscript𝑟1𝑧subscript𝑟𝑘\langle z,r_{1}\rangle,\ldots,\langle z,r_{k}\rangle⟨ italic_z , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ , … , ⟨ italic_z , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟩ are also i.i.d. As a result:

(z,rt=0)=j=1i(ij)2(12)2i=(12)2ij=1i(ij)(iij)=(12)2i{(2ii)1},𝑧subscript𝑟𝑡0superscriptsubscript𝑗1𝑖superscriptbinomial𝑖𝑗2superscript122𝑖superscript122𝑖superscriptsubscript𝑗1𝑖binomial𝑖𝑗binomial𝑖𝑖𝑗superscript122𝑖binomial2𝑖𝑖1{\mathbb{P}}\big{(}\langle z,r_{t}\rangle=0\big{)}=\sum_{j=1}^{i}{i\choose j}^% {2}\left(\frac{1}{2}\right)^{2i}=\left(\frac{1}{2}\right)^{2i}\sum_{j=1}^{i}{i% \choose j}{i\choose i-j}=\left(\frac{1}{2}\right)^{2i}\left\{{2i\choose i}-1% \right\},blackboard_P ( ⟨ italic_z , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ = 0 ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( binomial start_ARG italic_i end_ARG start_ARG italic_j end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( binomial start_ARG italic_i end_ARG start_ARG italic_j end_ARG ) ( binomial start_ARG italic_i end_ARG start_ARG italic_i - italic_j end_ARG ) = ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT { ( binomial start_ARG 2 italic_i end_ARG start_ARG italic_i end_ARG ) - 1 } ,

and

Σ1i=1|X|/2(|X|i,i,|X|2i){(2ii)(12)2i}k.subscriptΣ1superscriptsubscript𝑖1𝑋2binomial𝑋𝑖𝑖𝑋2𝑖superscriptbinomial2𝑖𝑖superscript122𝑖𝑘\Sigma_{1}\leq\sum_{i=1}^{\lfloor|X|/2\rfloor}{|X|\choose i,i,|X|-2i}\left\{{2% i\choose i}\left(\frac{1}{2}\right)^{2i}\right\}^{k}.roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ | italic_X | / 2 ⌋ end_POSTSUPERSCRIPT ( binomial start_ARG | italic_X | end_ARG start_ARG italic_i , italic_i , | italic_X | - 2 italic_i end_ARG ) { ( binomial start_ARG 2 italic_i end_ARG start_ARG italic_i end_ARG ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (6)

But

(|X|i,i,|X|2i)|X|2i(i!)2=O((|X|e/i)2ii),binomial𝑋𝑖𝑖𝑋2𝑖superscript𝑋2𝑖superscript𝑖2𝑂superscript𝑋𝑒𝑖2𝑖𝑖{|X|\choose i,i,|X|-2i}\leq\frac{|X|^{2i}}{(i!)^{2}}=O\left(\frac{\big{(}|X|e/% i\big{)}^{2i}}{i}\right),( binomial start_ARG | italic_X | end_ARG start_ARG italic_i , italic_i , | italic_X | - 2 italic_i end_ARG ) ≤ divide start_ARG | italic_X | start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_i ! ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_O ( divide start_ARG ( | italic_X | italic_e / italic_i ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_i end_ARG ) ,

where the big-O is direct from Stirling’s formula. On the other hand, Stirling’s formula also implies that (2ii)(12)2i1iπsimilar-tobinomial2𝑖𝑖superscript122𝑖1𝑖𝜋{2i\choose i}\big{(}\frac{1}{2}\big{)}^{2i}\sim\frac{1}{\sqrt{i\pi}}( binomial start_ARG 2 italic_i end_ARG start_ARG italic_i end_ARG ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT ∼ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_i italic_π end_ARG end_ARG. However, for a bona fide substitution of (2ii)(12)2ibinomial2𝑖𝑖superscript122𝑖{2i\choose i}\big{(}\frac{1}{2}\big{)}^{2i}( binomial start_ARG 2 italic_i end_ARG start_ARG italic_i end_ARG ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT by 1iπ1𝑖𝜋\frac{1}{\sqrt{i\pi}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_i italic_π end_ARG end_ARG in equation (6), one needs a stronger relationship between these two sequences. For this effect, observe that [19]:

exp{112i+1}<i!2πii+1/2ei<exp{112i}, for all i1.formulae-sequence112𝑖1𝑖2𝜋superscript𝑖𝑖12superscript𝑒𝑖112𝑖 for all 𝑖1\exp\left\{\frac{1}{12i+1}\right\}<\frac{i!}{\sqrt{2\pi}\,i^{i+1/2}\,e^{-i}}<% \exp\left\{\frac{1}{12i}\right\},\text{ for all }i\geq 1.roman_exp { divide start_ARG 1 end_ARG start_ARG 12 italic_i + 1 end_ARG } < divide start_ARG italic_i ! end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG italic_i start_POSTSUPERSCRIPT italic_i + 1 / 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_ARG < roman_exp { divide start_ARG 1 end_ARG start_ARG 12 italic_i end_ARG } , for all italic_i ≥ 1 .

In particular,

(2ii)(12)2iexp{136i24i(12i+1)}iπ1iπ,binomial2𝑖𝑖superscript122𝑖136𝑖24𝑖12𝑖1𝑖𝜋1𝑖𝜋{2i\choose i}\Big{(}\frac{1}{2}\Big{)}^{2i}\leq\frac{\exp\Big{\{}\frac{1-36i}{% 24i(12i+1)}\Big{\}}}{\sqrt{i\pi}}\leq\frac{1}{\sqrt{i\pi}},( binomial start_ARG 2 italic_i end_ARG start_ARG italic_i end_ARG ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT ≤ divide start_ARG roman_exp { divide start_ARG 1 - 36 italic_i end_ARG start_ARG 24 italic_i ( 12 italic_i + 1 ) end_ARG } end_ARG start_ARG square-root start_ARG italic_i italic_π end_ARG end_ARG ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_i italic_π end_ARG end_ARG ,

and from the inequality in equation (6) we see that

Σ1=|X|(|X|1)2k+O(1πk/2i=2|X|/2(|X|e/i)2iik/21i).subscriptΣ1𝑋𝑋1superscript2𝑘𝑂1superscript𝜋𝑘2superscriptsubscript𝑖2𝑋2superscript𝑋𝑒𝑖2𝑖superscript𝑖𝑘21𝑖\Sigma_{1}=\frac{|X|\big{(}|X|-1\big{)}}{2^{k}}+O\left(\frac{1}{\pi^{k/2}}\sum% _{i=2}^{\lfloor|X|/2\rfloor}\frac{\big{(}|X|e/i\big{)}^{2i}}{i^{k/2}}\cdot% \frac{1}{i}\right).roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG | italic_X | ( | italic_X | - 1 ) end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ | italic_X | / 2 ⌋ end_POSTSUPERSCRIPT divide start_ARG ( | italic_X | italic_e / italic_i ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_i start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ) . (7)

The following result will let us handle the big-O term above.

Lemma 2.3.

If k2|X|ln(2e)ln(|X|/2)𝑘2𝑋2𝑒𝑋2k\geq\frac{2|X|\ln(2e)}{\ln(|X|/2)}italic_k ≥ divide start_ARG 2 | italic_X | roman_ln ( 2 italic_e ) end_ARG start_ARG roman_ln ( | italic_X | / 2 ) end_ARG then (|X|e/i)2iik/2=O(1)superscript𝑋𝑒𝑖2𝑖superscript𝑖𝑘2𝑂1\frac{(|X|e/i)^{2i}}{i^{k/2}}=O(1)divide start_ARG ( | italic_X | italic_e / italic_i ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_i start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG = italic_O ( 1 ), uniformly for all |X|𝑋|X|| italic_X | large enough and 2i|X|/22𝑖𝑋22\leq i\leq\lfloor|X|/2\rfloor2 ≤ italic_i ≤ ⌊ | italic_X | / 2 ⌋.

Proof.

It suffices to show that

k4iln(|X|e/i)ln(i),𝑘4𝑖𝑋𝑒𝑖𝑖k\geq\frac{4i\ln\big{(}|X|e/i\big{)}}{\ln(i)},italic_k ≥ divide start_ARG 4 italic_i roman_ln ( | italic_X | italic_e / italic_i ) end_ARG start_ARG roman_ln ( italic_i ) end_ARG , (8)

for all |X|𝑋|X|| italic_X | large enough and 2i|X|/22𝑖𝑋22\leq i\leq\lfloor|X|/2\rfloor2 ≤ italic_i ≤ ⌊ | italic_X | / 2 ⌋. For this, consider the function defined as f(τ):=4τln(|X|e/τ)/ln(τ)assign𝑓𝜏4𝜏𝑋𝑒𝜏𝜏f(\tau):=4\tau\ln\big{(}|X|e/\tau\big{)}/\ln(\tau)italic_f ( italic_τ ) := 4 italic_τ roman_ln ( | italic_X | italic_e / italic_τ ) / roman_ln ( italic_τ ), for 2τ|X|/22𝜏𝑋22\leq\tau\leq|X|/22 ≤ italic_τ ≤ | italic_X | / 2. But note that

f(τ)=4ln(|X|e)ln(τ)ln2(τ)ln(|X|e){ln(τ)}2=4ln(τ0τ)ln(ττ1){ln(τ)}2,superscript𝑓𝜏4𝑋𝑒𝜏superscript2𝜏𝑋𝑒superscript𝜏24subscript𝜏0𝜏𝜏subscript𝜏1superscript𝜏2f^{\prime}(\tau)=4\frac{\ln(|X|e)\ln(\tau)-\ln^{2}(\tau)-\ln(|X|e)}{\big{\{}% \ln(\tau)\big{\}}^{2}}=4\frac{\ln\left(\frac{\tau_{0}}{\tau}\right)\cdot\ln% \left(\frac{\tau}{\tau_{1}}\right)}{\big{\{}\ln(\tau)\big{\}}^{2}},italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = 4 divide start_ARG roman_ln ( | italic_X | italic_e ) roman_ln ( italic_τ ) - roman_ln start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_τ ) - roman_ln ( | italic_X | italic_e ) end_ARG start_ARG { roman_ln ( italic_τ ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = 4 divide start_ARG roman_ln ( divide start_ARG italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ) ⋅ roman_ln ( divide start_ARG italic_τ end_ARG start_ARG italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG { roman_ln ( italic_τ ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (9)

where the second identity assumes that |X|>e3𝑋superscript𝑒3|X|>e^{3}| italic_X | > italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, in which case

τ0subscript𝜏0\displaystyle\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT :=exp{1+ln|X|2(1141+ln|X|)}=exp{1+O(1ln|X|)};assignabsent1𝑋21141𝑋1𝑂1𝑋\displaystyle:=\exp\left\{\frac{1+\ln|X|}{2}\left(1-\sqrt{1-\frac{4}{1+\ln|X|}% }\right)\right\}=\exp\left\{1+O\left(\frac{1}{\ln|X|}\right)\right\};:= roman_exp { divide start_ARG 1 + roman_ln | italic_X | end_ARG start_ARG 2 end_ARG ( 1 - square-root start_ARG 1 - divide start_ARG 4 end_ARG start_ARG 1 + roman_ln | italic_X | end_ARG end_ARG ) } = roman_exp { 1 + italic_O ( divide start_ARG 1 end_ARG start_ARG roman_ln | italic_X | end_ARG ) } ;
τ1subscript𝜏1\displaystyle\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT :=exp{1+ln|X|2(1+141+ln|X|)}=exp{ln|X|+O(1ln|X|)}.assignabsent1𝑋21141𝑋𝑋𝑂1𝑋\displaystyle:=\exp\left\{\frac{1+\ln|X|}{2}\left(1+\sqrt{1-\frac{4}{1+\ln|X|}% }\right)\right\}=\exp\left\{\ln|X|+O\left(\frac{1}{\ln|X|}\right)\right\}.:= roman_exp { divide start_ARG 1 + roman_ln | italic_X | end_ARG start_ARG 2 end_ARG ( 1 + square-root start_ARG 1 - divide start_ARG 4 end_ARG start_ARG 1 + roman_ln | italic_X | end_ARG end_ARG ) } = roman_exp { roman_ln | italic_X | + italic_O ( divide start_ARG 1 end_ARG start_ARG roman_ln | italic_X | end_ARG ) } .

In particular, τ0esimilar-tosubscript𝜏0𝑒\tau_{0}\sim eitalic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_e and τ1|X|similar-tosubscript𝜏1𝑋\tau_{1}\sim|X|italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ | italic_X |. Thus, as long as |X|𝑋|X|| italic_X | is large enough, 2<τ0<|X|/2<τ12subscript𝜏0𝑋2subscript𝜏12<\tau_{0}<|X|/2<\tau_{1}2 < italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < | italic_X | / 2 < italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and equation (9) implies that f(τ)𝑓𝜏f(\tau)italic_f ( italic_τ ) is decreasing for τ[2,τ0]𝜏2subscript𝜏0\tau\in[2,\tau_{0}]italic_τ ∈ [ 2 , italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] and increasing for τ[τ0,|X|/2]𝜏subscript𝜏0𝑋2\tau\in[\tau_{0},|X|/2]italic_τ ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , | italic_X | / 2 ], which in turn implies that f(τ)𝑓𝜏f(\tau)italic_f ( italic_τ ) is maximized at τ=2𝜏2\tau=2italic_τ = 2 or τ=|X|/2𝜏𝑋2\tau=|X|/2italic_τ = | italic_X | / 2. Since f(2)f(|X|/2)much-less-than𝑓2𝑓𝑋2f(2)\ll f\big{(}|X|/2\big{)}italic_f ( 2 ) ≪ italic_f ( | italic_X | / 2 ), f𝑓fitalic_f is maximized at τ=|X|/2𝜏𝑋2\tau=|X|/2italic_τ = | italic_X | / 2; in particular, the inequality in equation (8) is satisfied when kf(|X|/2)𝑘𝑓𝑋2k\geq f\big{(}|X|/2\big{)}italic_k ≥ italic_f ( | italic_X | / 2 ), which shows the Lemma. ∎

Finally, due to equation (7), if k𝑘kitalic_k satisfies the condition in Lemma 2.3 then

Σ1=O(|X|22k)+O(1πk/2i=2|X|/21i)=o(1)+O(ln|X|πk/2)=o(1),subscriptΣ1𝑂superscript𝑋2superscript2𝑘𝑂1superscript𝜋𝑘2superscriptsubscript𝑖2𝑋21𝑖𝑜1𝑂𝑋superscript𝜋𝑘2𝑜1\Sigma_{1}=O\left(\frac{|X|^{2}}{2^{k}}\right)+O\left(\frac{1}{\pi^{k/2}}\sum_% {i=2}^{\lfloor|X|/2\rfloor}\frac{1}{i}\right)=o(1)+O\left(\frac{\ln|X|}{\pi^{k% /2}}\right)=o(1),roman_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_O ( divide start_ARG | italic_X | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG ) + italic_O ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ | italic_X | / 2 ⌋ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_i end_ARG ) = italic_o ( 1 ) + italic_O ( divide start_ARG roman_ln | italic_X | end_ARG start_ARG italic_π start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ) = italic_o ( 1 ) ,

where, for the middle identity, we have used that the harmonic series grows logarithmic with the number of terms. This completes the proof of Theorem 1.1.

2.6 Resolving Subsets of X𝑋Xitalic_X of Different Size

In this section, we prove Theorem 1.2.

After having characterized the asymptotic order of the metric dimension of (2X,Jac)superscript2𝑋Jac(2^{X},{\text{Jac}})( 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , Jac ), i.e., the asymptotically optimal size of resolving sets for 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, in this final section, we see how to resolve all pairs of subsets of X𝑋Xitalic_X of different size.

For this, consider the problem of resolving all distinct a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT such that |a|<|b|𝑎𝑏|a|<|b|| italic_a | < | italic_b |, using a set of the form

R2={r1,r1c,,rk,rkc},subscript𝑅2subscript𝑟1superscriptsubscript𝑟1𝑐subscript𝑟𝑘superscriptsubscript𝑟𝑘𝑐R_{2}=\{r_{1},r_{1}^{c},\ldots,r_{k},r_{k}^{c}\},italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT } , (10)

where r1,,rksubscript𝑟1subscript𝑟𝑘r_{1},\ldots,r_{k}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are i.i.d. with a Binomial(X,1/2)Binomial𝑋12\text{Binomial}(X,1/2)Binomial ( italic_X , 1 / 2 ) distribution. It follows that

(R2 does not resolve all distinct a,b2X such that |a|,|b||X|/2)Σ2,{\mathbb{P}}\big{(}R_{2}\text{ does not resolve all distinct }a,b\in 2^{X}% \text{ such that }|a|,|b|\leq|X|/2\big{)}\leq\Sigma_{2},blackboard_P ( italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT does not resolve all distinct italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT such that | italic_a | , | italic_b | ≤ | italic_X | / 2 ) ≤ roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where

Σ2:=(a,b2X, with |a|<|b||X|2, such that rR2:Jac(a,r)=Jac(b,r)).\Sigma_{2}:={\mathbb{P}}\left(\exists\,a,b\in 2^{X},\text{ with }|a|<|b|\leq% \frac{|X|}{2},\text{ such that }\forall r\in R_{2}:{\text{Jac}}(a,r)={\text{% Jac}}(b,r)\right).roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT := blackboard_P ( ∃ italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , with | italic_a | < | italic_b | ≤ divide start_ARG | italic_X | end_ARG start_ARG 2 end_ARG , such that ∀ italic_r ∈ italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) ) . (11)

2.6.1 Sizing the probability Σ2subscriptΣ2\Sigma_{2}roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Suppose that a,b,r2X𝑎𝑏𝑟superscript2𝑋a,b,r\in 2^{X}italic_a , italic_b , italic_r ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT are such that |a|<|b|𝑎𝑏|a|<|b|| italic_a | < | italic_b |, Jac(a,r)=Jac(b,r)Jac𝑎𝑟Jac𝑏𝑟{\text{Jac}}(a,r)={\text{Jac}}(b,r)Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ), and Jac(a,rc)=Jac(b,rc)Jac𝑎superscript𝑟𝑐Jac𝑏superscript𝑟𝑐{\text{Jac}}(a,r^{c})={\text{Jac}}(b,r^{c})Jac ( italic_a , italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) = Jac ( italic_b , italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ); in particular, per Corollary 2.1, |rc|(|b||a|)=(|rc||r|)(|br||ar|)superscript𝑟𝑐𝑏𝑎superscript𝑟𝑐𝑟𝑏𝑟𝑎𝑟|r^{c}|\cdot\big{(}|b|-|a|\big{)}=\big{(}|r^{c}|-|r|\big{)}\cdot\big{(}|br|-|% ar|\big{)}| italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | ⋅ ( | italic_b | - | italic_a | ) = ( | italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | - | italic_r | ) ⋅ ( | italic_b italic_r | - | italic_a italic_r | ). But |rc|=|X||r|superscript𝑟𝑐𝑋𝑟|r^{c}|=|X|-|r|| italic_r start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | = | italic_X | - | italic_r |, |b||a|=|acb||bca|𝑏𝑎superscript𝑎𝑐𝑏superscript𝑏𝑐𝑎|b|-|a|=|a^{c}b|-|b^{c}a|| italic_b | - | italic_a | = | italic_a start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | - | italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a |, and |br||ar|=|acbr||bcar|𝑏𝑟𝑎𝑟superscript𝑎𝑐𝑏𝑟superscript𝑏𝑐𝑎𝑟|br|-|ar|=|a^{c}br|-|b^{c}ar|| italic_b italic_r | - | italic_a italic_r | = | italic_a start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b italic_r | - | italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a italic_r |. So, we may rewrite the last identity equivalently as follows:

(|acb||bca|)(1|r||X|)=(|acbr||bcar|)(12|r||X|).superscript𝑎𝑐𝑏superscript𝑏𝑐𝑎1𝑟𝑋superscript𝑎𝑐𝑏𝑟superscript𝑏𝑐𝑎𝑟12𝑟𝑋\big{(}|a^{c}b|-|b^{c}a|\big{)}\cdot\left(1-\frac{|r|}{|{X}|}\right)=\big{(}|a% ^{c}br|-|b^{c}ar|\big{)}\cdot\left(1-\frac{2|r|}{|{X}|}\right).( | italic_a start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b | - | italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a | ) ⋅ ( 1 - divide start_ARG | italic_r | end_ARG start_ARG | italic_X | end_ARG ) = ( | italic_a start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b italic_r | - | italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a italic_r | ) ⋅ ( 1 - divide start_ARG 2 | italic_r | end_ARG start_ARG | italic_X | end_ARG ) .

Equivalently, if we define Δc(r):=|c|2|cr|assignsubscriptΔ𝑐𝑟𝑐2𝑐𝑟\Delta_{c}(r):=|c|-2|cr|roman_Δ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_r ) := | italic_c | - 2 | italic_c italic_r | for each c2X𝑐superscript2𝑋c\in 2^{X}italic_c ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, the above identity is equivalent to

ΔX(r)|X|Δv(r)Δu(r)|u||v|=1,subscriptΔ𝑋𝑟𝑋subscriptΔ𝑣𝑟subscriptΔ𝑢𝑟𝑢𝑣1\frac{\Delta_{X}(r)}{|X|}\cdot\frac{\Delta_{v}(r)-\Delta_{u}(r)}{|u|-|v|}=1,divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_r ) end_ARG start_ARG | italic_X | end_ARG ⋅ divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r ) - roman_Δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_r ) end_ARG start_ARG | italic_u | - | italic_v | end_ARG = 1 , (12)

where u:=acbassign𝑢superscript𝑎𝑐𝑏u:=a^{c}bitalic_u := italic_a start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_b and v:=bcaassign𝑣superscript𝑏𝑐𝑎v:=b^{c}aitalic_v := italic_b start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_a are disjoint subsets of X𝑋Xitalic_X such that |u|>|v|𝑢𝑣|u|>|v|| italic_u | > | italic_v |. Notably, for rBinomial(X,1/2)similar-to𝑟Binomial𝑋12r\sim\text{Binomial}(X,1/2)italic_r ∼ Binomial ( italic_X , 1 / 2 ), the probability of the above event depends only on the quantities |u|𝑢|u|| italic_u |, |v|𝑣|v|| italic_v |, and |X|𝑋|X|| italic_X |, without regard to the specific identity of u𝑢uitalic_u and v𝑣vitalic_v, except for the constraints that uv=𝑢𝑣uv=\emptysetitalic_u italic_v = ∅ and |u|>|v|𝑢𝑣|u|>|v|| italic_u | > | italic_v |. So we may define ρ(i,j,X)𝜌𝑖𝑗𝑋\rho(i,j,X)italic_ρ ( italic_i , italic_j , italic_X ) as the probability of the event in (12)—when (u,v)2X×2X𝑢𝑣superscript2𝑋superscript2𝑋(u,v)\in 2^{X}\times 2^{X}( italic_u , italic_v ) ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT are such that uv=𝑢𝑣uv=\emptysetitalic_u italic_v = ∅, |u|=i>j=|v|𝑢𝑖𝑗𝑣|u|=i>j=|v|| italic_u | = italic_i > italic_j = | italic_v |, and rBinomial(X,1/2)similar-to𝑟Binomial𝑋12r\sim\text{Binomial}(X,1/2)italic_r ∼ Binomial ( italic_X , 1 / 2 ).

It follows from the above discussion that if

𝒫:={(u,v)2X×2X such that uv= and |v|<|u|},assign𝒫𝑢𝑣superscript2𝑋superscript2𝑋 such that 𝑢𝑣 and 𝑣𝑢\mathcal{P}:=\left\{(u,v)\in 2^{X}\times 2^{X}\text{ such that }uv=\emptyset% \text{ and }|v|<|u|\right\},caligraphic_P := { ( italic_u , italic_v ) ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT such that italic_u italic_v = ∅ and | italic_v | < | italic_u | } ,

then

Σ2((u,v)𝒫 such that t{1,,k}:ΔX(rt)|X|Δv(rt)Δu(rt)|u||v|=1).\Sigma_{2}\leq{\mathbb{P}}\left(\exists\,(u,v)\in\mathcal{P}\text{ such that }% \forall t\in\{1,\ldots,k\}:\frac{\Delta_{X}(r_{t})}{|X|}\cdot\frac{\Delta_{v}(% r_{t})-\Delta_{u}(r_{t})}{|u|-|v|}=1\right).roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ blackboard_P ( ∃ ( italic_u , italic_v ) ∈ caligraphic_P such that ∀ italic_t ∈ { 1 , … , italic_k } : divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_X | end_ARG ⋅ divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_Δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_u | - | italic_v | end_ARG = 1 ) .

But note that the random vectors (Δu(rt),Δv(rt),ΔX(rt))subscriptΔ𝑢subscript𝑟𝑡subscriptΔ𝑣subscript𝑟𝑡subscriptΔ𝑋subscript𝑟𝑡\big{(}\Delta_{u}(r_{t}),\Delta_{v}(r_{t}),\Delta_{X}(r_{t})\big{)}( roman_Δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), with t=1,,k𝑡1𝑘t=1,\ldots,kitalic_t = 1 , … , italic_k, are i.i.d. for any given (u,v)𝒫𝑢𝑣𝒫(u,v)\in\mathcal{P}( italic_u , italic_v ) ∈ caligraphic_P. As a result:

Σ2i=1|X|j(|X|i,j,|X|ij)ρk(i,j,X),subscriptΣ2superscriptsubscript𝑖1𝑋subscript𝑗binomial𝑋𝑖𝑗𝑋𝑖𝑗superscript𝜌𝑘𝑖𝑗𝑋\Sigma_{2}\leq\sum_{i=1}^{|X|}\sum_{j}{|X|\choose i,j,|X|-i-j}\,\rho^{k}\big{(% }i,j,X\big{)},roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( binomial start_ARG | italic_X | end_ARG start_ARG italic_i , italic_j , | italic_X | - italic_i - italic_j end_ARG ) italic_ρ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_i , italic_j , italic_X ) , (13)

where the index j𝑗jitalic_j is the inner sum above is such that 0j<i0𝑗𝑖0\leq j<i0 ≤ italic_j < italic_i and (i+j)|X|𝑖𝑗𝑋(i+j)\leq|X|( italic_i + italic_j ) ≤ | italic_X |.

Lemma 2.4.

If 1i|X|1𝑖𝑋1\leq i\leq|X|1 ≤ italic_i ≤ | italic_X | and 0j<i0𝑗𝑖0\leq j<i0 ≤ italic_j < italic_i, with (i+j)|X|𝑖𝑗𝑋(i+j)\leq|X|( italic_i + italic_j ) ≤ | italic_X |, then ρ(i,j,X)4exp(|X|2)𝜌𝑖𝑗𝑋4𝑋2\rho(i,j,X)\leq 4\exp\left(-\frac{\sqrt{|X|}}{2}\right)italic_ρ ( italic_i , italic_j , italic_X ) ≤ 4 roman_exp ( - divide start_ARG square-root start_ARG | italic_X | end_ARG end_ARG start_ARG 2 end_ARG ).

Proof.

Let (u,v)𝒫𝑢𝑣𝒫(u,v)\in\mathcal{P}( italic_u , italic_v ) ∈ caligraphic_P be such that |u|=i𝑢𝑖|u|=i| italic_u | = italic_i and j=|v|𝑗𝑣j=|v|italic_j = | italic_v |; in particular, i>j𝑖𝑗i>jitalic_i > italic_j. Then, for each τ>0𝜏0\tau>0italic_τ > 0:

ρ(i,j,X)𝜌𝑖𝑗𝑋\displaystyle\rho(i,j,X)italic_ρ ( italic_i , italic_j , italic_X ) =(ΔX(r)(Δv(r)Δu(r))=(ij)|X|)absentsubscriptΔ𝑋𝑟subscriptΔ𝑣𝑟subscriptΔ𝑢𝑟𝑖𝑗𝑋\displaystyle={\mathbb{P}}\left(\Delta_{X}(r)\cdot\big{(}\Delta_{v}(r)-\Delta_% {u}(r)\big{)}=(i-j)\cdot|X|\right)= blackboard_P ( roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_r ) ⋅ ( roman_Δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r ) - roman_Δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_r ) ) = ( italic_i - italic_j ) ⋅ | italic_X | )
(|ΔX(r)|2τ|X|)+(|Δu(r)Δv(r)|(ij)|X|2τ)absentsubscriptΔ𝑋𝑟2𝜏𝑋subscriptΔ𝑢𝑟subscriptΔ𝑣𝑟𝑖𝑗𝑋2𝜏\displaystyle\leq{\mathbb{P}}\left(|\Delta_{X}(r)|\geq\sqrt{2\tau|X|}\right)+{% \mathbb{P}}\left(|\Delta_{u}(r)-\Delta_{v}(r)|\geq(i-j)\sqrt{\frac{|X|}{2\tau}% }\right)≤ blackboard_P ( | roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_r ) | ≥ square-root start_ARG 2 italic_τ | italic_X | end_ARG ) + blackboard_P ( | roman_Δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_r ) - roman_Δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r ) | ≥ ( italic_i - italic_j ) square-root start_ARG divide start_ARG | italic_X | end_ARG start_ARG 2 italic_τ end_ARG end_ARG )
2{exp(τ)+exp((ij)2|X|4(i+j)τ)},absent2𝜏superscript𝑖𝑗2𝑋4𝑖𝑗𝜏\displaystyle\leq 2\left\{\exp\left(-\tau\right)+\exp\left(-\frac{(i-j)^{2}|X|% }{4(i+j)\tau}\right)\right\},≤ 2 { roman_exp ( - italic_τ ) + roman_exp ( - divide start_ARG ( italic_i - italic_j ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_X | end_ARG start_ARG 4 ( italic_i + italic_j ) italic_τ end_ARG ) } ,

where for the last inequality we have used the well-known Hoeffding’s inequality, and that 2(Δu(r)Δv(r))2subscriptΔ𝑢𝑟subscriptΔ𝑣𝑟2\big{(}\Delta_{u}(r)-\Delta_{v}(r)\big{)}2 ( roman_Δ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_r ) - roman_Δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_r ) ) has the same distribution as k=1i+j(Zk𝔼(Zk))superscriptsubscript𝑘1𝑖𝑗subscript𝑍𝑘𝔼subscript𝑍𝑘\sum_{k=1}^{i+j}\big{(}Z_{k}-\mathbb{E}(Z_{k})\big{)}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i + italic_j end_POSTSUPERSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E ( italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ), where Z1,,Zi+jsubscript𝑍1subscript𝑍𝑖𝑗Z_{1},\ldots,Z_{i+j}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT are independent random variables, with ZkBernoulli(1/2)similar-tosubscript𝑍𝑘Bernoulli12Z_{k}\sim\text{Bernoulli}(1/2)italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ Bernoulli ( 1 / 2 ) for 1ki1𝑘𝑖1\leq k\leq i1 ≤ italic_k ≤ italic_i, and (Zk)Bernoulli(1/2)similar-tosubscript𝑍𝑘Bernoulli12(-Z_{k})\sim\text{Bernoulli}(1/2)( - italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ Bernoulli ( 1 / 2 ) for i<ki+j𝑖𝑘𝑖𝑗i<k\leq i+jitalic_i < italic_k ≤ italic_i + italic_j. Therefore, by selecting

τ:=ij2|X|i+jassign𝜏𝑖𝑗2𝑋𝑖𝑗\tau:=\frac{i-j}{2}\sqrt{\frac{|X|}{i+j}}italic_τ := divide start_ARG italic_i - italic_j end_ARG start_ARG 2 end_ARG square-root start_ARG divide start_ARG | italic_X | end_ARG start_ARG italic_i + italic_j end_ARG end_ARG (14)

we obtain that

ρ(i,j,X)4eτ.𝜌𝑖𝑗𝑋4superscript𝑒𝜏\rho(i,j,X)\leq 4e^{-\tau}.italic_ρ ( italic_i , italic_j , italic_X ) ≤ 4 italic_e start_POSTSUPERSCRIPT - italic_τ end_POSTSUPERSCRIPT . (15)

But

τ=i1j/i1+j/i|X|2|X|2,𝜏𝑖1𝑗𝑖1𝑗𝑖𝑋2𝑋2\tau=\sqrt{i}\cdot\frac{1-j/i}{\sqrt{1+j/i}}\cdot\frac{\sqrt{|X|}}{2}\geq\frac% {\sqrt{|X|}}{2},italic_τ = square-root start_ARG italic_i end_ARG ⋅ divide start_ARG 1 - italic_j / italic_i end_ARG start_ARG square-root start_ARG 1 + italic_j / italic_i end_ARG end_ARG ⋅ divide start_ARG square-root start_ARG | italic_X | end_ARG end_ARG start_ARG 2 end_ARG ≥ divide start_ARG square-root start_ARG | italic_X | end_ARG end_ARG start_ARG 2 end_ARG , (16)

because the first factor above is an increasing function of i𝑖iitalic_i, whereas the second factor is a decreasing function of j/i𝑗𝑖j/iitalic_j / italic_i. The lemma is now a direct consequence of the inequalities in (15)-(16). ∎

Remark 2.1.

The choice of τ𝜏\tauitalic_τ in (14) is somewhat optimal when τ1𝜏1\tau\geq 1italic_τ ≥ 1, which is a necessary condition for the upper-bound in (15) to be non-trivial. (The latter requires of course τ2ln2𝜏22\tau\geq 2\ln 2italic_τ ≥ 2 roman_ln 2 which, based on (16), can be guaranteed as soon as |X|8𝑋8|X|\geq 8| italic_X | ≥ 8.) Indeed, from the last proof: ρ(i,j,X)2f(t)𝜌𝑖𝑗𝑋2𝑓𝑡\rho(i,j,X)\leq 2f(t)italic_ρ ( italic_i , italic_j , italic_X ) ≤ 2 italic_f ( italic_t ), where f(t):=et+eτ2/tassign𝑓𝑡superscript𝑒𝑡superscript𝑒superscript𝜏2𝑡f(t):=e^{-t}+e^{-\tau^{2}/t}italic_f ( italic_t ) := italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT - italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_t end_POSTSUPERSCRIPT for t>0𝑡0t>0italic_t > 0. But note that f(t)=(g(τ2/t)g(t))/tsuperscript𝑓𝑡𝑔superscript𝜏2𝑡𝑔𝑡𝑡f^{\prime}(t)=\left(g(\tau^{2}/t)-g(t)\right)/titalic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = ( italic_g ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_t ) - italic_g ( italic_t ) ) / italic_t, with g(t):=tetassign𝑔𝑡𝑡superscript𝑒𝑡g(t):=te^{-t}italic_g ( italic_t ) := italic_t italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT for t>0𝑡0t>0italic_t > 0; hence t=τ𝑡𝜏t=\tauitalic_t = italic_τ is a critical point of f(t)𝑓𝑡f(t)italic_f ( italic_t ). Moreover, since f′′(t)=2et(1t1)superscript𝑓′′𝑡2superscript𝑒𝑡1superscript𝑡1f^{\prime\prime}\big{(}t\big{)}=2\,e^{-t}\left(1-t^{-1}\right)italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_t ) = 2 italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT ( 1 - italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), t=τ𝑡𝜏t=\tauitalic_t = italic_τ is a local minimum when τ>1𝜏1\tau>1italic_τ > 1. In particular, since f′′′(1)=0superscript𝑓′′′10f^{\prime\prime\prime}(1)=0italic_f start_POSTSUPERSCRIPT ′ ′ ′ end_POSTSUPERSCRIPT ( 1 ) = 0 but f′′′′(1)=2e1>0superscript𝑓′′′′12superscript𝑒10f^{\prime\prime\prime\prime}(1)=2e^{-1}>0italic_f start_POSTSUPERSCRIPT ′ ′ ′ ′ end_POSTSUPERSCRIPT ( 1 ) = 2 italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT > 0 when τ=1𝜏1\tau=1italic_τ = 1, t=τ𝑡𝜏t=\tauitalic_t = italic_τ is a local minimum of f(t)𝑓𝑡f(t)italic_f ( italic_t ) when τ1𝜏1\tau\geq 1italic_τ ≥ 1.

Let ρXsubscript𝜌𝑋\rho_{X}italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT be the upper bound for ρ(i,j,X)𝜌𝑖𝑗𝑋\rho(i,j,X)italic_ρ ( italic_i , italic_j , italic_X ) given in Lemma 2.4. It follows from (13) that

Σ2subscriptΣ2\displaystyle\Sigma_{2}roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ρXki=1|X|j(|X|i,j,|X|ij)absentsubscriptsuperscript𝜌𝑘𝑋superscriptsubscript𝑖1𝑋subscript𝑗binomial𝑋𝑖𝑗𝑋𝑖𝑗\displaystyle\leq\rho^{k}_{X}\sum_{i=1}^{|X|}\sum_{j}{|X|\choose i,j,|X|-i-j}≤ italic_ρ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( binomial start_ARG | italic_X | end_ARG start_ARG italic_i , italic_j , | italic_X | - italic_i - italic_j end_ARG )
ρXki=1|X|j|X|i+ji!j!absentsubscriptsuperscript𝜌𝑘𝑋superscriptsubscript𝑖1𝑋subscript𝑗superscript𝑋𝑖𝑗𝑖𝑗\displaystyle\leq\rho^{k}_{X}\sum_{i=1}^{|X|}\sum_{j}\frac{|X|^{i+j}}{i!\,j!}≤ italic_ρ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG | italic_X | start_POSTSUPERSCRIPT italic_i + italic_j end_POSTSUPERSCRIPT end_ARG start_ARG italic_i ! italic_j ! end_ARG
ρXk(i=1|X||X|ii!)2absentsubscriptsuperscript𝜌𝑘𝑋superscriptsuperscriptsubscript𝑖1𝑋superscript𝑋𝑖𝑖2\displaystyle\leq\rho^{k}_{X}\left(\sum_{i=1}^{|X|}\frac{|X|^{i}}{i!}\right)^{2}≤ italic_ρ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT divide start_ARG | italic_X | start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_i ! end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=ρXke2|X|(Γ(|X|+1,|X|)|X|!)2,absentsubscriptsuperscript𝜌𝑘𝑋superscript𝑒2𝑋superscriptΓ𝑋1𝑋𝑋2\displaystyle=\rho^{k}_{X}e^{2|X|}\,\left(\frac{\Gamma\left(|X|+1,|X|\right)}{% |X|!}\right)^{2},= italic_ρ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 | italic_X | end_POSTSUPERSCRIPT ( divide start_ARG roman_Γ ( | italic_X | + 1 , | italic_X | ) end_ARG start_ARG | italic_X | ! end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where, for an integer n>0𝑛0n>0italic_n > 0 and x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R, Γ(n,x):=(n1)!exi=0n1xk(n1)!=xtn1et𝑑tassignΓ𝑛𝑥𝑛1superscript𝑒𝑥superscriptsubscript𝑖0𝑛1superscript𝑥𝑘𝑛1superscriptsubscript𝑥superscript𝑡𝑛1superscript𝑒𝑡differential-d𝑡\Gamma(n,x):=(n-1)!\,e^{-x}\sum\limits_{i=0}^{n-1}\frac{x^{k}}{(n-1)!}=\int% \limits_{x}^{\infty}t^{n-1}e^{-t}\,dtroman_Γ ( italic_n , italic_x ) := ( italic_n - 1 ) ! italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT divide start_ARG italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n - 1 ) ! end_ARG = ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT italic_d italic_t is the (upper) incomplete Gamma function. Finally, due to [18, Proposition 2.7], Γ(|X|+1,|X|)=O(|X||X|e|X|)Γ𝑋1𝑋𝑂superscript𝑋𝑋superscript𝑒𝑋\Gamma(|X|+1,|X|)=O\big{(}|X|^{|X|}e^{-|X|}\big{)}roman_Γ ( | italic_X | + 1 , | italic_X | ) = italic_O ( | italic_X | start_POSTSUPERSCRIPT | italic_X | end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - | italic_X | end_POSTSUPERSCRIPT ). Consequently,

Σ2=O(ρXk|X|2|X|(|X|!)2)=O(ρXke2|X||X|)=O(e2|X|+kln(ρX)|X|),subscriptΣ2𝑂superscriptsubscript𝜌𝑋𝑘superscript𝑋2𝑋superscript𝑋2𝑂superscriptsubscript𝜌𝑋𝑘superscript𝑒2𝑋𝑋𝑂superscript𝑒2𝑋𝑘subscript𝜌𝑋𝑋\Sigma_{2}=O\left(\frac{\rho_{X}^{k}\,|X|^{2|X|}}{(|X|!)^{2}}\right)=O\left(% \frac{\rho_{X}^{k}e^{2|X|}}{|X|}\right)=O\left(\frac{e^{2|X|+k\ln(\rho_{X})}}{% |X|}\right),roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_O ( divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_X | start_POSTSUPERSCRIPT 2 | italic_X | end_POSTSUPERSCRIPT end_ARG start_ARG ( | italic_X | ! ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = italic_O ( divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT 2 | italic_X | end_POSTSUPERSCRIPT end_ARG start_ARG | italic_X | end_ARG ) = italic_O ( divide start_ARG italic_e start_POSTSUPERSCRIPT 2 | italic_X | + italic_k roman_ln ( italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG | italic_X | end_ARG ) ,

where we have used the Stirling’s approximation and the exp-log transform. In particular, for any ϵ<1italic-ϵ1\epsilon<1italic_ϵ < 1, if select k𝑘kitalic_k so that 2|X|+kln(ρX)ϵln|X|2𝑋𝑘subscript𝜌𝑋italic-ϵ𝑋2|X|+k\ln(\rho_{X})\leq\epsilon\ln|X|2 | italic_X | + italic_k roman_ln ( italic_ρ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) ≤ italic_ϵ roman_ln | italic_X |, for instance, k=4|X|2ϵln|X||X|4ln24|X|𝑘4𝑋2italic-ϵ𝑋𝑋42similar-to4𝑋k=\left\lceil\frac{4|X|-2\epsilon\ln|X|}{\sqrt{|X|}-4\ln 2}\right\rceil\sim 4% \sqrt{|X|}italic_k = ⌈ divide start_ARG 4 | italic_X | - 2 italic_ϵ roman_ln | italic_X | end_ARG start_ARG square-root start_ARG | italic_X | end_ARG - 4 roman_ln 2 end_ARG ⌉ ∼ 4 square-root start_ARG | italic_X | end_ARG, then Σ2=o(1)subscriptΣ2𝑜1\Sigma_{2}=o(1)roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_o ( 1 ), which completes the proof of Theorem 1.2.

2.7 Resolving Comparatively Small Subsets of X𝑋Xitalic_X

In this section we prove Corollary 1.2, which is the consequence of arguments already used in the proofs of theorems 1.1 and 1.2. For this, let 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1, and 1W(1ϵ)(lnπ)|X|/ln|X|1𝑊1italic-ϵ𝜋𝑋𝑋1\leq W\leq(1-\epsilon)(\ln\pi)\sqrt{|X|}/\ln|X|1 ≤ italic_W ≤ ( 1 - italic_ϵ ) ( roman_ln italic_π ) square-root start_ARG | italic_X | end_ARG / roman_ln | italic_X | be an integer.

To show the Corollary, we reconsider the set R2subscript𝑅2R_{2}italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (10) with k(4+ϵ)|X|𝑘4italic-ϵ𝑋k\geq(4+\epsilon)\sqrt{|X|}italic_k ≥ ( 4 + italic_ϵ ) square-root start_ARG | italic_X | end_ARG. By distinguishing pairs a,b2X𝑎𝑏superscript2𝑋a,b\in 2^{X}italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT such that |a|=|b|𝑎𝑏|a|=|b|| italic_a | = | italic_b | from |a||b|𝑎𝑏|a|\neq|b|| italic_a | ≠ | italic_b |, we find this time that

(a,b2X with ab such that rR2:Jac(a,r)=Jac(b,r))Σ2+Σ3,{\mathbb{P}}\left(\exists\,a,b\in 2^{X}\text{ with }a\neq b\text{ such that }% \forall r\in R_{2}:{\text{Jac}}(a,r)={\text{Jac}}(b,r)\right)\leq\Sigma_{2}+% \Sigma_{3},blackboard_P ( ∃ italic_a , italic_b ∈ 2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT with italic_a ≠ italic_b such that ∀ italic_r ∈ italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : Jac ( italic_a , italic_r ) = Jac ( italic_b , italic_r ) ) ≤ roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , (17)

where Σ2subscriptΣ2\Sigma_{2}roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the double-sum in (13), and Σ3subscriptΣ3\Sigma_{3}roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is a truncated version of the summation in (6). Specifically

Σ3:=i=1W(|X|i,i,|X|2i){(2ii)(12)2i}k.assignsubscriptΣ3superscriptsubscript𝑖1𝑊binomial𝑋𝑖𝑖𝑋2𝑖superscriptbinomial2𝑖𝑖superscript122𝑖𝑘\Sigma_{3}:=\sum_{i=1}^{W}{|X|\choose i,i,|X|-2i}\left\{{2i\choose i}\left(% \frac{1}{2}\right)^{2i}\right\}^{k}.roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ( binomial start_ARG | italic_X | end_ARG start_ARG italic_i , italic_i , | italic_X | - 2 italic_i end_ARG ) { ( binomial start_ARG 2 italic_i end_ARG start_ARG italic_i end_ARG ) ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT .

But, from the discussion in Section 2.6.1, we already know that Σ2=o(1)subscriptΣ2𝑜1\Sigma_{2}=o(1)roman_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_o ( 1 ). On the other hand, from the discussion in Section 2.5.1 that led to (7), we can say that

Σ3=O(i=1W|X|2i(i!)2(iπ)k/2).subscriptΣ3𝑂superscriptsubscript𝑖1𝑊superscript𝑋2𝑖superscript𝑖2superscript𝑖𝜋𝑘2\Sigma_{3}=O\left(\sum_{i=1}^{W}\frac{|X|^{2i}}{(i!)^{2}\,(i\pi)^{k/2}}\right).roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_O ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT divide start_ARG | italic_X | start_POSTSUPERSCRIPT 2 italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_i ! ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_i italic_π ) start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ) .

As a result

Σ3=O(|X|2Wπk/2i=1W1(i!)2)=O(|X|2Wπk/2)=O(|X|2Wπ2|X|)=O(π2ϵ|X|),subscriptΣ3𝑂superscript𝑋2𝑊superscript𝜋𝑘2superscriptsubscript𝑖1𝑊1superscript𝑖2𝑂superscript𝑋2𝑊superscript𝜋𝑘2𝑂superscript𝑋2𝑊superscript𝜋2𝑋𝑂superscript𝜋2italic-ϵ𝑋\Sigma_{3}=O\left(\frac{|X|^{2W}}{\pi^{k/2}}\sum_{i=1}^{W}\frac{1}{(i!)^{2}}% \right)=O\left(\frac{|X|^{2W}}{\pi^{k/2}}\right)=O\left(\frac{|X|^{2W}}{\pi^{2% \sqrt{|X|}}}\right)=O\left(\pi^{-2\epsilon\sqrt{|X|}}\right),roman_Σ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_O ( divide start_ARG | italic_X | start_POSTSUPERSCRIPT 2 italic_W end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( italic_i ! ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = italic_O ( divide start_ARG | italic_X | start_POSTSUPERSCRIPT 2 italic_W end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG ) = italic_O ( divide start_ARG | italic_X | start_POSTSUPERSCRIPT 2 italic_W end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUPERSCRIPT 2 square-root start_ARG | italic_X | end_ARG end_POSTSUPERSCRIPT end_ARG ) = italic_O ( italic_π start_POSTSUPERSCRIPT - 2 italic_ϵ square-root start_ARG | italic_X | end_ARG end_POSTSUPERSCRIPT ) ,

where for the last two asymptotic bounds we have use the constrains on k𝑘kitalic_k and W𝑊Witalic_W. The Corollary is now a direct consequence of the inequality in (17).

Acknowledgments. This work was partially funded by the NSF grant No. 1836914.

References

  • [1] N. Alon and J. H. Spencer, The Probabilistic Method, 2nd edn., Wiley, 2004.
  • [2] S. Bau and A. F. Beardon, The metric dimension of metric spaces, Comput. Methods Funct. Theory 13 (2013), 295–305.
  • [3] A. F. Beardon, Resolving the Hypercube, Discrete Applied Mathematics 161 (2013), 1882–1887.
  • [4] G. Chartrand et al., Resolvability in graphs and the metric dimension of a graph, Discrete Applied Mathematics 105 (2000), no. 1, 99–113.
  • [5] P. Erdös, F. Harary, and W. T. Tutte, On the dimension of a graph, Mathematika 12 (1965), no. 2, 118–122.
  • [6] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness, WH Freeman and Company, New York, 1979.
  • [7] G. Gilbert, Distance between sets, Nature 239 (1972), no. 174.
  • [8] F. Harary and R. A. Melter, On the metric dimension of a graph, Ars Combinatoria 2 (1976), no. 191-195, 1.
  • [9] M. Hauptmann, R. Schmied, and C. Viehmann, Approximation complexity of metric dimension problem, Journal of Discrete Algorithms 14 (2012), 214–222.
  • [10] P. Jaccard, Étude comparative de la distribution florale dans une portion des alpes et du jura, Bull. Société Vaudoise des Sciences Naturelles 37 (1901), no. 142, 547–579.
  • [11] R. M. Karp, Reducibility among combinatorial problems, Complexity of Computer Computations, Springer, 1972. 85–103.
  • [12] D. E. Knuth, The Art of Computer Programming, Vol. 1: Fundamental Algorithms, 3rd edn., Addison-Wesley, 1997.
  • [13] S. Kosub, A note on the triangle inequality for the jaccard distance, Pattern Recognition Letters 120 (2019), 36–38.
  • [14] D. Kuziak and I. G. Yero, Metric dimension related parameters in graphs: A survey on combinatorial, computational and applied results, arXiv preprint arXiv:2107.04877 (2021).
  • [15] L. Laird et al., Resolvability of Hamming graphs, SIAM Journal on Discrete Mathematics 34 (2020), no. 4, 2063–2081.
  • [16] G. Murphy, A metric basis characterization of Euclidean space, Pac. J. Math. 60 (1975), 159–163.
  • [17] A. Paradise, Quantitative encoding of bags-of-words for sentiment and sarcasm detection in textual data, Master’s thesis, The University of Colorado, 2024.
  • [18] I. Pinelis, Exact lower and upper bounds on the incomplete gamma function, Mathematical Inequalities & Applications 23 (2020), no. 4, 1261–1278.
  • [19] H. Robbins, A remark on Stirling’s formula, The American Mathematical Monthly 62 (1955), no. 1, 26–29.
  • [20] P. E. Ruth and M. E. Lladser, Levenshtein graphs: Resolvability, automorphisms & determining sets, Discrete Mathematics 346 (2023), no. 5, 113310.
  • [21] P. J. Slater, Leaves of trees, Congressus Numerantium 14 (1975), no. 549-559, 37.
  • [22] R. C. Tillquist, R. M. Frongillo, and M. E. Lladser, Metric Dimension, Scholarpedia 14 (2019), no. 10, 53881. Revision #190769.
  • [23] R. C. Tillquist, R. M. Frongillo, and M. E. Lladser, Getting the lay of the land in discrete space: A survey of metric dimension and its applications, SIAM Review 65 (2023), no. 4, 919–962.
  • [24] R. C. Tillquist and M. E. Lladser, Low-dimensional representation of genomic sequences, Journal of Mathematical Biology 79 (2019), no. 1, 1–29.