Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Optimized succinct data structures for massive data

Published: 01 November 2014 Publication History

Abstract

Succinct data structures provide the same functionality as their corresponding traditional data structure in compact space. We improve on functions rank and select, which are the basic building blocks of FM-indexes and other succinct data structures. First, we present a cache-optimal, uncompressed bitvector representation that outperforms all existing approaches. Next, we improve, in both space and time, on a recent result by Navarro and Providel on compressed bitvectors. Last, we show techniques to perform rank and select on 64-bit words that are up to three times faster than existing methods. In our experimental evaluation, we first show how our improvements affect cache and runtime performance of both operations on data sets larger than commonly used in the evaluation of succinct data structures. Our experiments show that our improvements to these basic operations significantly improve the runtime performance and compression effectiveness of FM-indexes on small and large data sets. To our knowledge, our improvements result in FM-indexes that are either smaller or faster than all current state of the art implementations. Copyright © 2013 John Wiley & Sons, Ltd.

References

[1]
Hon W-K, Shah R, Vitter JS .Compression, indexing, and retrieval for massive string data. In Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching CPM, New York, NY, USA, 2010; pp.260-274.
[2]
Culpepper JS, Petri M, Scholer F .Efficient in-memory top-k document retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR, Portland, OR, USA, 2012; pp.225-234.
[3]
Mäkinen V, Navarro G, Sirén J, Välimäki N .Storage and retrieval of individual genomes. In Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology RECOMB, Tucson, AZ, USA, 2009; pp.121-137.
[4]
Gog S .Compressed suffix trees: Design, construction, and applications. Ph.D. Thesis, Ulm University, Ulm, Germany, 2011.
[5]
Ohlebusch E, Fischer J, Gog S. CST++. In Proceedings of the 17th International Symposium on String Processing and Information Retrieval SPIRE, Los Cabos, Mexico, 2010; pp.322-333.
[6]
Ferragina P, Manzini G .Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science FOCS, Redondo Beach, California, USA, 2000; pp.390-398.
[7]
Ferragina P, Manzini G, Mäkinen V, Navarro G .An alphabet-friendly FM-index. In Proceedings of the 11th International Conference on String Processing and Information Retrieval SPIRE, Padova, Italy, 2004; pp.150-160.
[8]
Haque IS, Pande VS, Walters WP .Anatomy of high-performance 2D similarity calculations. Journal of Chemical Information and Modeling 2011; Volume 51 Issue 9: pp.2345-2351.
[9]
Suciu A, Cobarzan P, Marton K .The never ending problem of counting bits efficiently. In Proceedings of the 10th Roedunet International Conference ROEDUNET, Iasi, Romania, 2011; pp.1-4.
[10]
Knuth D .The Art of Computer Programming, Volume 4a, The: Combinatorial Algorithms, Part 1. Addison-Wesley: Reading, Massachusetts, 2011.
[11]
Vigna S .Broadword implementation of rank/select queries. In Proceedings of 7th Won Experimental Algorithms WEA, Provincetown, MA, USA, 2008; pp.154-168.
[12]
Munro I .Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science FSTTCS, Hyderabad, India, 1996; pp.37-42.
[13]
Clark DR .Compact Pat Trees. Ph.D. Thesis, University of Waterloo, 1996.
[14]
Navarro G, Providel E .Fast, small, simple rank/select on bitmaps. In Proceedings of the 11th International Symposium on Experimental Algorithms SEA, Bordeaux, France, 2012; pp.295-306.
[15]
Ferragina P, González R, Navarro G, Venturini R .Compressed text indexes: from theory to practice. ACM Journal of Experimental Algorithmics 2008; Volume 13: pp.1-31.
[16]
González R, Grabowski S, Mäkinen V, Navarro G .Practical implementation of rank and select queries. In Proceedings of 4th Workshop on Experimental and Efficient Algorithms WEA, Santorini Island, Greece, 2005; pp.27-38.
[17]
Grossi R, Gupta A, Vitter JS .High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms SODA, Baltimore, Maryland, USA, 2003; pp.841-850.
[18]
Navarro G .Wavelet trees for all. In Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching CPM, Helsinki, Finland, 2012; pp.2-26.
[19]
Jacobson GJ .Succinct static data structures. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1988. AAI8918056.
[20]
Raman R, Raman V, Rao SS .Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms SODA, San Francisco, CA, USA, 2002; pp.233-242.
[21]
Claude F, Navarro G .Practical rank/select queries over arbitrary sequences. InProceedings of the 15th International Conference on String Processing and Information Retrieval SPIRE, Melbourne, Australia, 2008; pp.176-187.
[22]
Pagh R .Low redundancy in static dictionaries with O1 worst case lookup time. TEchnical Report RS-98-28, BRICS, Department of Computer Science, University of Aarhus, Midtbyen, Aarhus, Denmark, 1998.
[23]
Okanohara D, Sadakane K .Practical entropy-compressed rank/select dictionary. In Proceedings of the Workshop on Algorithm Engineering and Experiments ALENEX, New Orleans, Louisiana, USA, 2007.
[24]
Elias P .Efficient storage and retrieval by content and address of static files. Journal of the ACM 1974; Volume 21 Issue 2: pp.246-260.
[25]
Manber U, Myers EW .Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 1993; Volume 22 Issue 5: pp.935-948.
[26]
Navarro G, Mäkinen V .Compressed full-text indexes. ACM Computing Surveys 2007; Volume 39 Issue 1: pp.1-31.
[27]
Burrows M, Wheeler DJ .A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.
[28]
Mäkinen V, Navarro G .Succinct suffix arrays based on run-length encoding. InProceedings of the 16th Annual Symposium on Combinatorial Pattern Matching CPM, Jeju Island, Korea, 2005; pp.45-56.
[29]
Kärkkäinen J, Puglisi SJ .Fixed block compression boosting in FM-indexes. InProceedings of the 18th International Conference on String Processing and Information Retrieval SPIRE, Pisa, Italy, 2011; pp.174-184.
[30]
Fog A .Instruction tables, 2012. Available from: "http://www.agner.org/optimize/instruction_tables.pdf" accessed March 13, 2012.
[31]
Sadakane K .New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms 2003; Volume 48 Issue 2: pp.294-313.

Cited By

View all
  • (2023)An Overview of Emerging Anomaly Detection Methods and a Research Agenda for Internet of Everything and Industry 5.0 ContextsProceedings of the 2023 International Conference on embedded Wireless Systems and Networks10.5555/3639940.3639997(357-362)Online publication date: 15-Dec-2023
  • (2022)A Learned Approach to Design Compressed Rank/Select Data StructuresACM Transactions on Algorithms10.1145/352406018:3(1-28)Online publication date: 11-Oct-2022
  • (2022)Optimal Joins Using Compressed QuadtreesACM Transactions on Database Systems10.1145/351423147:2(1-53)Online publication date: 23-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Software—Practice & Experience
Software—Practice & Experience  Volume 44, Issue 11
November 2014
128 pages

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 November 2014

Author Tags

  1. FM-index
  2. SSE
  3. algorithm engineering
  4. binary sequences
  5. hugepages
  6. massive data sets
  7. rank
  8. select
  9. succinct data structures

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An Overview of Emerging Anomaly Detection Methods and a Research Agenda for Internet of Everything and Industry 5.0 ContextsProceedings of the 2023 International Conference on embedded Wireless Systems and Networks10.5555/3639940.3639997(357-362)Online publication date: 15-Dec-2023
  • (2022)A Learned Approach to Design Compressed Rank/Select Data StructuresACM Transactions on Algorithms10.1145/352406018:3(1-28)Online publication date: 11-Oct-2022
  • (2022)Optimal Joins Using Compressed QuadtreesACM Transactions on Database Systems10.1145/351423147:2(1-53)Online publication date: 23-May-2022
  • (2019)Distributed enhanced suffix arraysProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356211(1-17)Online publication date: 17-Nov-2019
  • (2019)Fixed Block Compression Boosting in FM-IndexesAlgorithmica10.1007/s00453-018-0475-981:4(1370-1391)Online publication date: 1-Apr-2019
  • (2019)Adaptive SuccinctnessString Processing and Information Retrieval10.1007/978-3-030-32686-9_33(467-481)Online publication date: 7-Oct-2019
  • (2019)A Practical Alphabet-Partitioning Rank/Select Data StructureString Processing and Information Retrieval10.1007/978-3-030-32686-9_32(452-466)Online publication date: 7-Oct-2019
  • (2019)Succinct BWT-Based Sequence PredictionDatabase and Expert Systems Applications10.1007/978-3-030-27618-8_7(91-101)Online publication date: 26-Aug-2019
  • (2018)Log(graph)Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243198(1-13)Online publication date: 1-Nov-2018
  • (2018)A Simplified Description of Child Tables for Sequence Similarity SearchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2018.279606415:6(2067-2073)Online publication date: 1-Nov-2018
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media