Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2213556.2213586acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

The wavelet trie: maintaining an indexed sequence of strings in compressed space

Published: 21 May 2012 Publication History

Abstract

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence.

References

[1]
D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84--97, 2010.
[2]
D. Benoit, E. D. Demaine, J. I. Munro, R. Raman, V. Raman, and S. S. Rao. Representing trees of higher degree. Algorithmica, 43(4):275--292, 2005.
[3]
F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176--187. Springer, 2008.
[4]
M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19--51, 1997.
[5]
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194--203, 1975.
[6]
P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Inf. Comput., 207(8):849--866, 2009.
[7]
P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In PODS, pages 181--190, 2008.
[8]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), 2009.
[9]
L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Trans. on Algorithms, 2(4):611--639, 2006.
[10]
M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. Journal of Computer and System Sciences, 47(3):424--436, Dec. 1993.
[11]
T. Gagie, G. Navarro, and S. J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science, to appear.
[12]
R. González and G. Navarro. Rank/Select on dynamic compressed sequences and applications. Theor. Comput. Sci., 410(43):4414--4422, 2009.
[13]
R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850, 2003.
[14]
R. Grossi and G. Ottaviano. Fast compressed tries through path decompositions. In ALENEX, 2012.
[15]
G. Jacobson. Space-efficient static trees and graphs. In FOCS, pages 549--554, 1989.
[16]
S. Lee and K. Park. Dynamic compressed representation of texts with rank/select. JCSE, 3(1):15--26, 2009.
[17]
V. Mäkinen and G. Navarro. Position-restricted substring searching. In LATIN, pages 703--714, 2006.
[18]
V. Mäkinen and G. Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3), 2008.
[19]
D. R. Morrison. Patricia - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514--534, 1968.
[20]
D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX, 2007.
[21]
M. H. Overmars. The design of dynamic data structures, volume 156 of Lecture Notes in Computer Science. Springer-Verlag, 1983.
[22]
R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding n-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007.
[23]
J. S. Vitter. Algorithms and Data Structures for External Memory. Foundations and trends in theoretical computer science. Now Publishers, 2008.

Cited By

View all
  • (2023)Design of Real-Time Multiplayer Word Game for the Android Platform Using Firebase and Fuzzy Logic2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA)10.1109/IISA59645.2023.10345901(1-8)Online publication date: 10-Jul-2023
  • (2020) Data structures based on k -mers for querying large collections of sequencing data sets Genome Research10.1101/gr.260604.11931:1(1-12)Online publication date: 16-Dec-2020
  • (2019)A Compact Rank/Select Data Structure for the Streaming Model2019 38th International Conference of the Chilean Computer Science Society (SCCC)10.1109/SCCC49216.2019.8966418(1-7)Online publication date: Nov-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
May 2012
332 pages
ISBN:9781450312486
DOI:10.1145/2213556
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compressed sequences
  2. indexing
  3. wavelet tree
  4. wavelet trie

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '12
Sponsor:

Acceptance Rates

PODS '12 Paper Acceptance Rate 26 of 101 submissions, 26%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)7
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Design of Real-Time Multiplayer Word Game for the Android Platform Using Firebase and Fuzzy Logic2023 14th International Conference on Information, Intelligence, Systems & Applications (IISA)10.1109/IISA59645.2023.10345901(1-8)Online publication date: 10-Jul-2023
  • (2020) Data structures based on k -mers for querying large collections of sequencing data sets Genome Research10.1101/gr.260604.11931:1(1-12)Online publication date: 16-Dec-2020
  • (2019)A Compact Rank/Select Data Structure for the Streaming Model2019 38th International Conference of the Chilean Computer Science Society (SCCC)10.1109/SCCC49216.2019.8966418(1-7)Online publication date: Nov-2019
  • (2019)Succinct Representations in Collaborative Filtering: A Case Study using Wavelet Tree on 1,000 Cores2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)10.1109/PDCAT46702.2019.00083(427-432)Online publication date: Dec-2019
  • (2018)The Case for Learned Index StructuresProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196909(489-504)Online publication date: 27-May-2018
  • (2018)Dynamic compression schemes for graph coloringBioinformatics10.1093/bioinformatics/bty63235:3(407-414)Online publication date: 18-Jul-2018
  • (2016)SequencesCompact Data Structures10.1017/CBO9781316588284.007(120-166)Online publication date: 5-Sep-2016
  • (2016)Random access to Fibonacci encoded filesDiscrete Applied Mathematics10.1016/j.dam.2015.11.003212:C(115-128)Online publication date: 30-Oct-2016
  • (2015)Wavelet trees meet suffix treesProceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms10.5555/2722129.2722168(572-591)Online publication date: 4-Jan-2015
  • (2015)Computing the Burrows-Wheeler transform in place and in small spaceJournal of Discrete Algorithms10.1016/j.jda.2015.01.00432:C(44-52)Online publication date: 1-May-2015
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media