Abstract
Representing a static set of integers S, \(|S| = n\) from a finite universe \(U = [1{..}u]\) is a fundamental task in computer science. Our concern is to represent S in small space while supporting the operations of \(\mathsf {rank}\) and \(\mathsf {select}\) on S; if S is viewed as its characteristic vector, the problem becomes that of representing a bit-vector, which is arguably the most fundamental building block of succinct data structures. Although there is an information-theoretic lower bound of \({\mathcal {B}}(n, u)= \lg {u\atopwithdelims ()n}\) bits on the space needed to represent S, this applies to worst-case (random) sets S, and sets found in practical applications are compressible. We focus on the case where elements of S contain runs of| \(\ell >1\) consecutive elements, one that occurs in many practical situations. Let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) denote the class of \({u\atopwithdelims ()n}\) distinct sets of \(n\) elements over the universe \([1{..}u]\). Let also \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}\) contain the sets whose \(n\) elements are arranged in \(g \le n\) runs of \(\ell _i \ge 1\) consecutive elements from U for \(i=1,\ldots , g\), and let \({\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}\subset {\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}\) contain all sets that consist of g runs, such that \(r \le g\) of them have at least 2 elements. This paper yields the following insights and contributions related to \(\mathsf {rank}\)/\(\mathsf {select}\) succinct data structures:
-
We introduce new compressibility measures for sets, including:
-
\({\mathcal {B}}_1(g,n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g}|} = \lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-1 \atopwithdelims ()g-1}}\), and
-
\({\mathcal {B}}_2(r, g, n,u)= \lg {|{\mathcal {C}}^{{\scriptscriptstyle (}n{\scriptscriptstyle )}}_{g,r}|} =\lg {{u-n+1 \atopwithdelims ()g}} + \lg {{n-g-1 \atopwithdelims ()r-1}} + \lg {{g\atopwithdelims ()r}}\),
such that \({\mathcal {B}}_2(r, g, n,u)\le {\mathcal {B}}_1(g,n,u)\le {\mathcal {B}}(n, u)\).
-
-
We give data structures that use space close to bounds \({\mathcal {B}}_1(g,n,u)\) and \({\mathcal {B}}_2(r, g, n,u)\) and support \(\mathsf {rank}\) and \(\mathsf {select}\) in \(\mathrm {O}(1)\) time.
-
We provide additional measures involving entropy-coding run lengths and gaps between items, and data structures to support \(\mathsf {rank}\) and \(\mathsf {select}\) using space close to these measures.
Similar content being viewed by others
Notes
For example, if we choose every element in \(U \) to be in \(S \) with probability 0.5, then \(\texttt {GAP}(S) \sim 0.81u\), less than the Shannon lower bound for \(S \).
Since \(\texttt {GAP}(S)\) and \(\texttt {RLE}(S)\) are not achievable, this statement is imprecise.
\([k\not \in {\hat{L}} ]\) is Iverson brackets notation, which equals 1 iff \(k\not \in {\hat{L}} \) is true, 0 otherwise.
References
Andersson, A., Thorup, M.: Dynamic ordered sets with exponential search trees. J. ACM 54(3), 13 (2007)
Arroyuelo, D., Raman, R.: Adaptive succinctness. In: Proceedings of the 26th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 11811, pp. 467–481. Springer (2019)
Arroyuelo, D., Oyarzún, M., González, S., Sepulveda, V.: Hybrid compression of inverted lists for reordered document collections. Inf. Process. Manag. 54(6), 1308–1324 (2018)
Barbay, J.: From time to space: fast algorithms that yield small and fast data structures. In: Space-Efficient Data Structures, Streams, and Algorithms—Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, LNCS 8066, pp. 97–111. Springer (2013)
Blandford, D.K., Blelloch, G.E.: Dictionaries using variable-length keys and data, with applications. In: Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1–10. SIAM (2005)
Blandford, D.K., Blelloch, G.E.: Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms 4(2), 17:1-17:25 (2008)
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)
Boldi, P., Vigna, S.: The webgraph framework II: codes for the world-wide web. In: Proceedings of the Data Compression Conference (DCC), p. 528 (2004)
Bona, M.: A Walk Through Combinatorics: An Introduction to Enumeration and Graph Theory, 4th edn. World Scientific, Singapore (2016)
Bookstein, A., Klein, S.T.: Construction of optimal graphs for bit-vector compression. In: Proceedings of the 13th International Conference on Research and Development in Information Retrieval (SIGIR), pp. 327–342 (1990)
Cafagna, F., Böhlen, M.H.: Disjoint interval partitioning. VLDB J. 26(3), 447–466 (2017)
Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. In: Proceedings of the 24th International Conference on Data Engineering (ICDE), pp. 893–902 (2008)
Chen, Y., Chen, Y.: Decomposing DAGs into spanning trees: a new way to compress transitive closures. In: Proceedings of the 27th International Conference on Data Engineering (ICDE), pp. 1007–1018 (2011)
Chen, Y., Shen, W.: An efficient method to evaluate intersections on big data sets. Theoret. Comput. Sci. 647, 1–21 (2016)
Clark, D.R., Munro, J.I.: Efficient suffix trees on secondary storage (extended abstract). In: Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 383–391 (1996)
Clark, D.: Compact pat trees. Ph.D. thesis, University of Waterloo (1997)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2006)
de Berg, M., Cheong, O., van Kreveld, M.J., Overmars, M.H.: Computational Geometry: Algorithms and Applications, 3rd edn. Springer, New York (2008)
Delpratt, O., Rahman, N., Raman, R.: Engineering the LOUDS succinct tree representation. In: Proceedings of the 5th International Workshop on Experimental Algorithms (WEA), pp. 134–145 (2006)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752. ACM/SIAM (2000)
Dignös, A., Böhlen, M.H., Gamper, J.: Overlap interval partition join. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 1459–1470 (2014)
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)
Estivill-Castro, V., Wood, D.: A survey of adaptive sorting algorithms. ACM Comput. Surv. 24(4), 441–476 (1992)
Foschini, L., Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. ACM Trans. Algorithms 2(4), 611–639 (2006)
Fraenkel, A.S., Klein, S.T.: Novel compression of sparse bit-strings—preliminary report. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words, NATO ASI Series (Series F: Computer and Systems Sciences), vol. 12. Springer (1985)
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-Runs bounded space. J. ACM 67(1), 2:1-2:54 (2020)
Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)
Golomb, S.: Run-length encodings (corresp.). IEEE Trans. Inf. Theory 12(3), 399–401 (1966)
Golynski, A., Raman, R., Rao, S.S.: On the redundancy of succinct data structures. In: Proceedings of the 11th Scandinavian Workshop on Algorithm Theory (SWAT), LNCS 5124, pp. 148–159. Springer (2008)
Golynski, A., Orlandi, A., Raman, R., Rao, S.S.: Optimal indexes for sparse bit vectors. Algorithmica 69(4), 906–924 (2014)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841–850 (2003)
Gupta, A., Hon, W.K., Shah, R., Vitter, J.S.: Compressed data structures: dictionaries and data-aware measures. Theoret. Comput. Sci. 387(3), 313–331 (2007)
Huo, H., Chen, L., Zhao, H., Vitter, J.S., Nekrich, Y., Yu, Q.: A data-aware FM-index. In: Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 10–23 (2015)
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th Annual Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989)
Jakobsson, M.: Huffman coding in bit-vector compression. Inf. Process. Lett. 7(6), 304–307 (1978)
Jansson, J., Sadakane, K., Sung, W.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619–631 (2012)
Johnson, D.S., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 13–23 (2004)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40–66 (2005)
Moffat, A., Zobel, J.: Parameterised compression for sparse bitmaps. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 274–285 (1992)
Navarro, G.: Compact Data Structures—A Practical Approach. Cambridge University Press, Cambridge (2016)
o Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Proceedings of the 15th Annual European Symposium on Algorithms (ESA), LNCS 4698, pp. 371–382. Springer (2007)
Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 273–282 (2014)
Pǎtraşcu, M., Thorup, M.: Time-space trade-offs for predecessor search. In: Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pp. 232–240 (2006)
Pǎtraşcu, M., Viola, E.: Cell-probe lower bounds for succinct partial sums. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 117–122 (2010)
Pǎtraşcu, M.: Succincter. In: Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 305–313 (2008)
Pibiri, G.E., Venturini, R.: Techniques for inverted index compression. ACM Comput. Surv. 53(6), 125:1-125:36 (2021)
Quinlan, A.R., Robins, G., Hall, I.M., Skadron, K., Layer, R.M.: Binary Interval Search: a scalable algorithm for counting interval intersections. Bioinformatics 29(1), 1–7 (2012)
Rahman, N., Raman, R.: Rank and select operations on binary strings. In: Encyclopedia of Algorithms (2008)
Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007)
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1230–1239 (2006)
Soo, M.D., Snodgrass, R.T., Jensen, C.S.: Efficient evaluation of the valid-time natural join. In: Proceedings of the 10th International Conference on Data Engineering (ICDE), pp. 282–292 (1994)
Acknowledgements
The first author was funded by ANID—Millennium Science Initiative Program—-Code ICN17_002.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Proofs from Sect. 3
A Proofs from Sect. 3
1.1 A.1 Proof of Theorem 2
Proof
Let us consider a cbv \(C_{\!S}= \mathbf {0}^{z_1}\mathbf {1}^{l_1}\mathbf {0}^{z_2}\mathbf {1}^{l_2}\cdots \mathbf {0}^{z_g}\mathbf {1}^{l_g}\mathbf {0}^{z_{g+1}}\) (for \(z_1,z_{g+1} \ge 0\), \(z_2,\ldots ,z_g, l_1,\ldots , l_g >0\)) of length \(u\), with \(n\) \(\mathbf {1}\)s grouped into g 1-runs. This corresponds to a set \(S \) of \(n\subseteq U\) elements arranged in g maximal runs. Think of \(C_{\!S}\) as consisting of \(g+1\) distinguishable “bins”, each of the form \(\mathbf {0}^{z_i}\mathbf {1}^{l_i}\), except for the last bin that contains only \(\mathbf {0}\)s (and can be empty). Let us count how many ways there are to distribute the \(n\) \(\mathbf {1}\)s among the first g bins, and the \(u- n\) \(\mathbf {0}\)s among all \(g+1\) bins.
-
1.
For counting the number of ways in which the \(\mathbf {1}\) bits can be distributed among the first g bins, note that each bin must have at least a \(\mathbf {1}\). This leaves only \(n-g\) \(\mathbf {1}\)s, which can be distributed in \({n-g+g-1 \atopwithdelims ()g-1} = {n-1 \atopwithdelims ()g-1}\) different ways. An alternative way to get this is to count the number of compositions of the integer \(n\) into g parts: each such composition is an ordered tuple \(\langle m_1,\ldots , m_g \rangle \) such that \(m_i>0\) and \(m_1 + \cdots m_g = n\). It turns out that this number is \({n- 1 \atopwithdelims ()g-1}\) [9, see Corollary 5.3].
-
2.
Similarly, for counting the number of ways in which the \(u-n\) \(\mathbf {0}\)s can be distributed among \(g+1\) bins, recall that each bin must contain at least a \(\mathbf {0}\). This is to separate it from the previous \(\mathbf {1}\)-run in the bit vector. The only exceptions are the first and last bins, as the bit vector does not necessarily starts and ends with \(\mathbf {0}\)s. To reduce the number of particular cases, we prefix the bit vector with a dummy \(\mathbf {0}\), increasing the universe size to \(u+1\). For the \(\mathbf {0}\)s at the end of \(C_{\!S}\), on the other hand, we consider two cases:
-
(a)
\(C_{\!S}\) finishes in \(\mathbf {1}\) in \(C_{\!S}\): the \(\mathbf {0}\)s must be distributed among g distinguishable bins. Since each bin must have at least a \(\mathbf {0}\), we are left with \(u+1-n-g\) \(\mathbf {0}\)s, which can be distributed in \({u+ 1 - n- g +g- 1 \atopwithdelims ()g-1} = {u- n\atopwithdelims ()g-1}\) different ways.
-
(b)
\(C_{\!S}\) finishes in \(\mathbf {0}\): in this case, we append an additional bin (we now have \(g+1\) of them) that can contain only \(\mathbf {0}\)s. Since each of the \(g+1\) bins must contain at least a \(\mathbf {0}\), we are left with \(u+1-n-(g+1)\) \(\mathbf {0}\)s, which can be distributed in \({u- n\atopwithdelims ()g}\) different ways.
From (a) and (b) we obtain \({u- n\atopwithdelims ()g-1}+{u- n\atopwithdelims ()g}={u-n+1 \atopwithdelims ()g}\), which is the total number of ways of distributing the \(\mathbf {0}\)s into \(g+1\) bins.
-
(a)
Combining the results from items 1 and 2, we obtain \({n-1 \atopwithdelims ()g-1}{u-n+ 1 \atopwithdelims ()g}\) different characteristic bit vectors of length \(u\) with \(n\) \(\mathbf {1}\)s arranged in g runs. \(\square \)
1.2 A.2 Proof of Theorem 3
For proving it we need the following result:
Lemma 4
There are \({n-g-1 \atopwithdelims ()r-1}{g \atopwithdelims ()r}\) distinct compositions of an integer \(n\) into g parts \(\langle m_1,\ldots , m_g \rangle \), such that \(m_i > 0\) for all i, \(m_1+\cdots +m_g = n\), and exactly \(r\le g\) of these \(m_i\) are \(\ge 2\).
Proof
Consider g distinct originally-empty bins \(G_1, \ldots , G_g\), and \(n\) identical balls. For \(i=1,\ldots , g\), let \(m_i\) be the size of \(G_i\) (initially \(m_i=0\)). Since \(m_i>0\) must hold, we assign a single ball to each bin. From the \(n-g\) remaining balls, we assign another ball to the first r bins \(G_1,\ldots , G_r\) (since r parts are \(\ge 2\)). The remaining \(n-g-r\) can be distributed into these r bins in \({n-g-r+r-1 \atopwithdelims ()r-1} = {n-g-1 \atopwithdelims ()r-1}\) distinct ways. Now, consider that the r bins of size \(\ge 2\) are not necessarily \(G_1,\ldots ,G_r\), but can be any of them. There are \({g \atopwithdelims ()r}\) ways of choosing r bins out of g, hence the lemma follows. \(\square \)
Proof of Theorem 3
As proved in Lemma 4, an integer \(n\) has \({n-g-1 \atopwithdelims ()r-1}{g \atopwithdelims ()r}\) distinct compositions into g parts, such that r of these parts have at least 2 elements. These correspond to the number of ways of distributing the \(\mathbf {1}\)s into the characteristic bit vector \(C_{\!S}[1{..}u]\), accomplishing the imposed restrictions. This must be combined with the different ways to distribute the \(u- n\) \(\mathbf {0}\)s, which is \({u-n+1 \atopwithdelims ()g}\) according to the proof of Theorem 2. This completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Arroyuelo, D., Raman, R. Adaptive Succinctness. Algorithmica 84, 694–718 (2022). https://doi.org/10.1007/s00453-021-00872-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-021-00872-1