Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Compressed data structures: Dictionaries and data-aware measures

Published: 20 November 2007 Publication History

Abstract

In this paper, we propose measures for compressed data structures, in which space usage is measured in a data-aware manner. In particular, we consider the fundamental dictionary problem on set data, where the task is to construct a data structure for representing a set S of n items out of a universe U={0,...,u-1} and supporting various queries on S. We use a well-known data-aware measure for set data called gap to bound the space of our data structures. We describe a novel dictionary structure that requires gap+O(nlog(u/n)/logn)+O(nloglog(u/n)) bits. Under the RAM model, our dictionary supports membership, rank, and predecessor queries in nearly optimal time, matching the time bound of Andersson and Thorup's predecessor structure [A. Andersson, M. Thorup, Tight(er) worst-case bounds on dynamic searching and priority queues, in: ACM Symposium on Theory of Computing, STOC, 2000], while simultaneously improving upon their space usage. We support select queries even faster in O(loglogn) time. Our dictionary structure uses exactly gap bits in the leading term (i.e., the constant factor is 1) and answers queries in near-optimal time. When seen from the worst-case perspective, we present the first O(nlog(u/n))-bit dictionary structure that supports these queries in near-optimal time under the RAM model. We also build a dictionary which requires the same space and supports membership, select, and partial rank queries even more quickly in O(loglogn) time. We go on to show that for many (real-world) datasets, data-aware methods lead to a worthwhile compression over combinatorial methods. To the best of our knowledge, these are the first results that achieve data-aware space usage and retain near-optimal time.

References

[1]
A. Andersson, M. Thorup, Tight(er) worst-case bounds on dynamic searching and priority queues, in: ACM Symposium on Theory of Computing, STOC, 2000
[2]
D. Blandford, G. Blelloch, Compact representations of ordered sets, in: Proceedings of the ACM¿SIAM Symposium on Discrete Algorithms, January 2004
[3]
D. Blandford, G. Blelloch, Dictionaries using variable-length keys and data, with applications, in: Proceedings of the ACM¿SIAM Symposium on Discrete Algorithms, January 2005
[4]
P. Beame, F. Fich, Optimal bounds for the predecessor problem, in: ACM Symposium on Theory of Computing, STOC, 1999, pp. 295-304
[5]
Brodnik, A. and Munro, I., Membership in constant time and almost-minimum space. SIAM Journal on Computing. v28 i5. 1627-1640.
[6]
Bell, T.C., Moffat, A., Nevill-Manning, C.G., Witten, I.H. and Zobel, J., Data compression in full-text retrieval systems. Journal of the American Society for Information Science. v44 i9. 508-531.
[7]
P. Crescenzi, L. Dardini, R. Grossi, IP address lookup made fast and simple, in: European Symposium on Algorithms, ESA, 1999, pp. 65-76
[8]
Elias, P., Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory. vIT-21. 194-203.
[9]
Fredman, M.L. and Willard, D.E., Surpassing the information theoretic bound with fusion trees. Journal of Computer and System Sciences. v47 i3. 424-436.
[10]
R. Grossi, A. Gupta, J.S. Vitter, High-order entropy-compressed text indexes, in: Proceedings of the ACM¿SIAM Symposium on Discrete Algorithms, January 2003
[11]
Greene, D.H. and Knuth, D.E., Mathematics for the Analysis of Algorithms. 1981. Birkhäuser, Boston.
[12]
Grossi, R. and Vitter, J.S., Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the ACM Symposium on Theory of Computing, vol. 32.
[13]
G. Jacobson, Succinct static data structures, Technical Report CMU-CS-89-112, Dept. of Computer Science, Carnegie¿Mellon University, January 1989
[14]
S.T. Klein, D. Shapira, Searching in compressed dictionaries, in: Data Compression Conference, DCC, 2002
[15]
Mäkinen, V. and Navarro, G., Rank and select revisited and extended. Theoretical Computer Science.
[16]
Munro, J.I., Tables. Foundations of Software Technology and Theoretical Computer Science. v16. 37-42.
[17]
Pagh, R., Low redundancy in static dictionaries with O(1) worst case lookup time. In: Lecture Notes in Computer Science, vol. 1644. Springer-Verlag. pp. 595-604.
[18]
M. Pa-tra-cu, M. Thorup, Time'space trade-offs for predecessor search, in: Proceedings of the ACM Symposium on Theory of Computing, 2006, pp. 232-240
[19]
R. Raman, V. Raman, S.S. Rao, Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, in: ACM-SIAM Symposium on Discrete Algorithms, 2002, pp. 233-242
[20]
K. Sadakane, R. Grossi, Squeezing succinct data structures into entropy bounds, in: ACM-SIAM Symposium on Discrete Algorithms, SODA, 2006, pp. 1230-1239
[21]
van Emde Boas, P., Kaas, R. and Zijlstra, E., Design and implementation of an efficient priority queue. Mathematical Systems Theory. v10. 99-127.
[22]
Willard, D.E., New trie data structures which support very fast search operations. Journal of Computer and System Sciences. v28 i3. 379-394.
[23]
Witten, I.H., Moffat, A. and Bell, T.C., Managing Gigabytes: Compressing and Indexing Documents and Images. 1999. second ed. Morgan Kaufmann Publishers, Los Altos, CA.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

Elsevier Science Publishers Ltd.

United Kingdom

Publication History

Published: 20 November 2007

Author Tags

  1. BSGAP
  2. Compressed
  3. Dictionary problem
  4. Gap encoding
  5. Predecessor
  6. Rank
  7. Select

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Learned Approach to Design Compressed Rank/Select Data StructuresACM Transactions on Algorithms10.1145/352406018:3(1-28)Online publication date: 11-Oct-2022
  • (2022)Adaptive SuccinctnessAlgorithmica10.1007/s00453-021-00872-184:3(694-718)Online publication date: 1-Mar-2022
  • (2019)Adaptive SuccinctnessString Processing and Information Retrieval10.1007/978-3-030-32686-9_33(467-481)Online publication date: 7-Oct-2019
  • (2017)Range selection and predecessor queries in data aware space and timeJournal of Discrete Algorithms10.1016/j.jda.2017.01.00243:C(18-25)Online publication date: 1-Mar-2017
  • (2016)Efficient dynamic range minimum queryTheoretical Computer Science10.1016/j.tcs.2016.07.002656:PB(108-117)Online publication date: 20-Dec-2016
  • (2015)Optimal Lower and Upper Bounds for Representing SequencesACM Transactions on Algorithms10.1145/262933911:4(1-21)Online publication date: 13-Apr-2015
  • (2014)Optimal Indexes for Sparse Bit VectorsAlgorithmica10.1007/s00453-013-9767-269:4(906-924)Online publication date: 1-Aug-2014
  • (2012)Improved address-calculation coding of integer arraysProceedings of the 19th international conference on String Processing and Information Retrieval10.1007/978-3-642-34109-0_22(205-216)Online publication date: 21-Oct-2012
  • (2011)Inverted indexes for phrases and stringsProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009992(555-564)Online publication date: 24-Jul-2011
  • (2011)Layered label propagationProceedings of the 20th international conference on World wide web10.1145/1963405.1963488(587-596)Online publication date: 28-Mar-2011
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media