Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{..}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures. Although there is an information-theoretic lower bound of \({\mathcal {B}}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain runs of| \(\ell >1\) consecutive elements, one that occurs in many practical situations. Let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{..}u]\). Let also \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements. This paper yields the following insights and contributions related to \(\mathsf {rank}\)/\(\mathsf {select}\) succinct data structures:
We introduce new compressibility measures for sets, including:
\({\mathcal {B}}_1(g,n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\), and
\({\mathcal {B}}_2(r, g, n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\),
such that \({\mathcal {B}}_2(r, g, n,u)\le {\mathcal {B}}_1(g,n,u)\le {\mathcal {B}}(n, u)\).
We give data structures that use space close to bounds \({\mathcal {B}}_1(g,n,u)\) and \({\mathcal {B}}_2(r, g, n,u)\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in \(\mathrm {O}(1)\) time.
We provide additional measures involving entropy-coding run lengths and gaps between items, and data structures to support \(\mathsf {rank}\) and \(\mathsf {select}\) using space close to these measures.
A Proofs from Sect. 3
1.1 A.1 Proof of Theorem 2
Let us consider a cbv \(C_{\!S}= \mathbf {0}^{z_1}\mathbf {1}^{l_1}\mathbf {0}^{z_2}\mathbf {1}^{l_2}\cdots \mathbf {0}^{z_g}\mathbf {1}^{l_g}\mathbf {0}^{z_{g+1}}\) (for \(z_1,z_{g+1} \ge 0\), \(z_2,\ldots ,z_g, l_1,\ldots , l_g >0\)) of length \(u\), with \(n\) \(\mathbf {1}\)s grouped into g 1-runs. This corresponds to a set \(S \) of \(n\subseteq U\) elements arranged in g maximal runs. Think of \(C_{\!S}\) as consisting of \(g+1\) distinguishable “bins”, each of the form \(\mathbf {0}^{z_i}\mathbf {1}^{l_i}\), except for the last bin that contains only \(\mathbf {0}\)s (and can be empty). Let us count how many ways there are to distribute the \(n\) \(\mathbf {1}\)s among the first g bins, and the \(u- n\) \(\mathbf {0}\)s among all \(g+1\) bins.
For counting the number of ways in which the \(\mathbf {1}\) bits can be distributed among the first g bins, note that each bin must have at least a \(\mathbf {1}\). This leaves only \(n-g\) \(\mathbf {1}\)s, which can be distributed in \({n-g+g-1 \atopwithdelims ()g-1} = {n-1 \atopwithdelims ()g-1}\) different ways. An alternative way to get this is to count the number of compositions of the integer \(n\) into g parts: each such composition is an ordered tuple \(\langle m_1,\ldots , m_g \rangle \) such that \(m_i>0\) and \(m_1 + \cdots m_g = n\). It turns out that this number is \({n- 1 \atopwithdelims ()g-1}\) [9, see Corollary 5.3].
Similarly, for counting the number of ways in which the \(u-n\) \(\mathbf {0}\)s can be distributed among \(g+1\) bins, recall that each bin must contain at least a \(\mathbf {0}\). This is to separate it from the previous \(\mathbf {1}\)-run in the bit vector. The only exceptions are the first and last bins, as the bit vector does not necessarily starts and ends with \(\mathbf {0}\)s. To reduce the number of particular cases, we prefix the bit vector with a dummy \(\mathbf {0}\), increasing the universe size to \(u+1\). For the \(\mathbf {0}\)s at the end of \(C_{\!S}\), on the other hand, we consider two cases:
\(C_{\!S}\) finishes in \(\mathbf {1}\) in \(C_{\!S}\): the \(\mathbf {0}\)s must be distributed among g distinguishable bins. Since each bin must have at least a \(\mathbf {0}\), we are left with \(u+1-n-g\) \(\mathbf {0}\)s, which can be distributed in \({u+ 1 - n- g +g- 1 \atopwithdelims ()g-1} = {u- n\atopwithdelims ()g-1}\) different ways.
\(C_{\!S}\) finishes in \(\mathbf {0}\): in this case, we append an additional bin (we now have \(g+1\) of them) that can contain only \(\mathbf {0}\)s. Since each of the \(g+1\) bins must contain at least a \(\mathbf {0}\), we are left with \(u+1-n-(g+1)\) \(\mathbf {0}\)s, which can be distributed in \({u- n\atopwithdelims ()g}\) different ways.
From (a) and (b) we obtain \({u- n\atopwithdelims ()g-1}+{u- n\atopwithdelims ()g}={u-n+1 \atopwithdelims ()g}\), which is the total number of ways of distributing the \(\mathbf {0}\)s into \(g+1\) bins.
Combining the results from items 1 and 2, we obtain \({n-1 \atopwithdelims ()g-1}{u-n+ 1 \atopwithdelims ()g}\) different characteristic bit vectors of length \(u\) with \(n\) \(\mathbf {1}\)s arranged in g runs. \(\square \)
1.2 A.2 Proof of Theorem 3
For proving it we need the following result:
Lemma 4
There are \({n-g-1 \atopwithdelims ()r-1}{g \atopwithdelims ()r}\) distinct compositions of an integer \(n\) into g parts \(\langle m_1,\ldots , m_g \rangle \), such that \(m_i > 0\) for all i, \(m_1+\cdots +m_g = n\), and exactly \(r\le g\) of these \(m_i\) are \(\ge 2\).
Consider g distinct originally-empty bins \(G_1, \ldots , G_g\), and \(n\) identical balls. For \(i=1,\ldots , g\), let \(m_i\) be the size of \(G_i\) (initially \(m_i=0\)). Since \(m_i>0\) must hold, we assign a single ball to each bin. From the \(n-g\) remaining balls, we assign another ball to the first r bins \(G_1,\ldots , G_r\) (since r parts are \(\ge 2\)). The remaining \(n-g-r\) can be distributed into these r bins in \({n-g-r+r-1 \atopwithdelims ()r-1} = {n-g-1 \atopwithdelims ()r-1}\) distinct ways. Now, consider that the r bins of size \(\ge 2\) are not necessarily \(G_1,\ldots ,G_r\), but can be any of them. There are \({g \atopwithdelims ()r}\) ways of choosing r bins out of g, hence the lemma follows. \(\square \)
Proof of Theorem 3
As proved in Lemma 4, an integer \(n\) has \({n-g-1 \atopwithdelims ()r-1}{g \atopwithdelims ()r}\) distinct compositions into g parts, such that r of these parts have at least 2 elements. These correspond to the number of ways of distributing the \(\mathbf {1}\)s into the characteristic bit vector \(C_{\!S}[1{..}u]\), accomplishing the imposed restrictions. This must be combined with the different ways to distribute the \(u- n\) \(\mathbf {0}\)s, which is \({u-n+1 \atopwithdelims ()g}\) according to the proof of Theorem 2. This completes the proof. \(\square \)
