Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

When indexing equals compression: Experiments with compressing suffix arrays and applications

Published: 01 October 2006 Publication History

Abstract

We report on a new experimental analysis of high-order entropy-compressed suffix arrays, which retains the theoretical performance of previous work and represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size---without requiring a separate instance of the text. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows--Wheeler transform), achieving a compression ratio comparable to that of bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.

References

[1]
Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Disc. Algor. 2, 1, 53--86.
[2]
Arimura, H., Asaka, H., Sakamoto, H., and Arikawa, S. 2001. Efficient discovery of proximity patterns with suffix arrays (extended abstract). In CPM: 12th Symposium on Combinatorial Pattern Matching.
[3]
Bender, M. A., and Farach-Colton, M. 2004. The level ancestor problem simplified. Theoret. Comput. Sci. 321, 1, 5--12.
[4]
Bentley, J., Sleator, D., Tarjan, R., and Wei, V. 1986. A locally adaptive data compression scheme. Commun. ACM, 320--330.
[5]
Brodnik, A., and Munro, J. I. 1999. Membership in constant time and almost-minimum space. SIAM J. Comput. 28, 5 (Oct.), 1627--1640.
[6]
The Canterbury Corpus. 2001. http://corpus.canterbury.ac.nz.
[7]
Chazelle, B., and Guibas, L. J. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2, 133--162.
[8]
Deorowicz, S. 2002. Second step algorithms in the Burrows-Wheeler compression algorithm. Softw. Pract. Exper. 32, 99--111.
[9]
Fenwick, P. 1996. Punctured elias codes for variable-length coding of the integers. The University of Auckland, NZ. TR 137. ISSN 1173--3500.
[10]
Fenwick, P. 2002. Burrows-wheeler compression with variable-length integer codes. Softw. Pract. Exper. 32, 1307--1316.
[11]
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4, 688--713.
[12]
Ferragina, P., and Manzini, G. 2001. An experimental study of an opportunistic index. In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, pp. 269--278.
[13]
Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581.
[14]
Foschini, L., Grossi, R., Gupta, A., and Vitter, J. S. 2004. Fast compression with a static model in high-order entropy. In Proceedings of the IEEE Data Compression Conference (Snowbird, UT, Mar.)
[15]
Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. 1992. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms. chap. 5. Prentice-Hall, Englewood Cliffs, NJ, pp. 66--82.
[16]
Grossi, R., Gupta, A., and Vitter, J. S. 2003. High-order entropy-compressed text indexes. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (Jan.), ACM, New York.
[17]
Grossi, R., Gupta, A., and Vitter, J. S. 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms, ACM, New York.
[18]
Grossi, R., and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407.
[19]
Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, MA.
[20]
Hon, W., Lam, T., Tse, W., Wong, C., and Yiu, S. 2004. Practical aspects of compressed suffix arrays and fm-index in searching dna sequences. In Proceedings of the 6th Workshop on Algorithm Engineering and Experiments (ALENEX).
[21]
Hon, W.-K., Sadakane, K., and Sung, W.-K. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of the 44th Annual IEEE Symposium on Foundation of Computer Science. IEEE Computer Society Press, Los Alamitos, CA pp. 251--260.
[22]
Howard, P. G. 1997. Interleaving entropy codes. In Sequences.
[23]
Jacobson, G. 1989. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA, pp. 549--554.
[24]
Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching (CPM). 181--192.
[25]
Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Experi. 29, 13, 1149--1171.
[26]
Li, M., and Vitanyi, P. 1997. An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York.
[27]
Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948.
[28]
McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272.
[29]
Moffat, A., Neal, R. M., and Witten, I. H. 1998. Arithmetic coding revisited. ACM Trans. Inf. Syst. (TOIS) 16, 3, 256--294.
[30]
Munro, J. I., and Raman, V. 1999. Succinct representation of balanced parentheses, static trees, and planar graphs. SIAM J. Comput. 31, 762--776.
[31]
Munro, J. I., Raman, V., and Srinivasa Rao, S. S. 2001. Space efficient suffix trees. J. Algorithms 39, 205--222.
[32]
Navarro, G., and Mäkinen, V. 2006. Compressed full-text indexes. Tech. Rep. TR/DCC-2006-6, University of Chile.
[33]
Nelson, M. 2003. Run length encoding/RLE. http://www.datacompression.info/RLE.shtml.
[34]
Oki, M. 2003. http://www.infor.kanazawa-it.ac.jp/~ishii/lhaunix/.
[35]
Pagh, R. 2001. Low redundancy in static dictionaries with constant query time. SIAM J. Comput. 31, 353--363.
[36]
Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 233--242.
[37]
Rao, S. S. 2002. Time-space trade-offs for compressed suffix arrays. IPL 82, 6, 307--311.
[38]
Rissanen, J., and Langdon, G. G. 1979. Arithmetic coding. IBM J. Res. Devel. 23, 2 (Mar.), 149--162.
[39]
Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, New York.
[40]
Sadakane, K. 2003. New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48, 2, 294--313.
[41]
Schindler, M. 1999. http://www.compressconsult.com/rangecoder.
[42]
Smith, J. O., III. 2003. http://ccrma-www.stanford.edu/~jos/mdft/Autocorrelation.html.
[43]
TREC. Tipster 3. 2000. http://trec.nist.gov/data/docs_eng.html.
[44]
Wirth, A. I., and Moffat, A. 2001. Can we do without ranks in burrows wheeler transform compression? In Data Compression Conference. pp. 419--428.
[45]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufmann, Los Altos, CA.

Cited By

View all
  • (2024)An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI61247.2024.00020(93-102)Online publication date: 3-Jun-2024
  • (2022)A Learned Approach to Design Compressed Rank/Select Data StructuresACM Transactions on Algorithms10.1145/352406018:3(1-28)Online publication date: 11-Oct-2022
  • (2022)A scalable approach for index compression using wavelet tree and LZWInternational Journal of Information Technology10.1007/s41870-022-00915-y14:4(2191-2204)Online publication date: 12-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 2, Issue 4
October 2006
233 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/1198513
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2006
Published in TALG Volume 2, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows--Wheeler Transform
  2. Entropy
  3. suffix array
  4. text indexing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)3
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Average-Case Efficient Two-Stage Algorithm for Enumerating All Longest Common Substrings of Minimum Length $k$ Between Genome Pairs2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)10.1109/ICHI61247.2024.00020(93-102)Online publication date: 3-Jun-2024
  • (2022)A Learned Approach to Design Compressed Rank/Select Data StructuresACM Transactions on Algorithms10.1145/352406018:3(1-28)Online publication date: 11-Oct-2022
  • (2022)A scalable approach for index compression using wavelet tree and LZWInternational Journal of Information Technology10.1007/s41870-022-00915-y14:4(2191-2204)Online publication date: 12-Apr-2022
  • (2022)Adaptive SuccinctnessAlgorithmica10.1007/s00453-021-00872-184:3(694-718)Online publication date: 1-Mar-2022
  • (2021)Practical High-order Entropy-compressed Text Self-indexingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3114401(1-1)Online publication date: 2021
  • (2021)CIndex: compressed indexes for fast retrieval of FASTQ filesBioinformatics10.1093/bioinformatics/btab65538:2(335-343)Online publication date: 15-Sep-2021
  • (2021)A comprehensive analysis of wavelet tree based indexing schemes in GIR systemsInternational Journal of Information Technology10.1007/s41870-021-00683-1Online publication date: 30-Apr-2021
  • (2020)A Hybrid Compressed Data Structure Supporting Rank and Select on Bit Sequences2020 39th International Conference of the Chilean Computer Science Society (SCCC)10.1109/SCCC51225.2020.9281244(1-8)Online publication date: 16-Nov-2020
  • (2020)Improved parallel construction of wavelet trees and rank/select structuresInformation and Computation10.1016/j.ic.2020.104516273(104516)Online publication date: Aug-2020
  • (2020)Comparison between text compression algorithms in biological sequencesInformation and Computation10.1016/j.ic.2019.104466270:COnline publication date: 1-Feb-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media