Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Compressed representations of sequences and full-text indexes

Published: 01 May 2007 Publication History

Abstract

Given a sequence S = s1s2sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/log log n) time.
Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Specifically, we design a variant of the FM-index that indexes a string T[1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth-order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log|Σ| n, constant 0 < α < 1, and |Σ| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+ε n) time for any constant 0 < ε < 1; and reports a text substring of length ℓ in O(ℓ + log1+ε n) time.
Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| = O(nβ), for any 0 < β < 1, by paying o(n log|Σ|) extra space and multiplying all query times by O(log |Σ|/log log n).

References

[1]
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.
[2]
Chan, W.-L., Hon, W.-K., and Lam, T.-W. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 445--456.
[3]
Clark, D. 1996. Compact pat trees. Ph.D. thesis, University of Waterloo.
[4]
Crochemore, M., and Rytter, W. 1994. Text Algorithms. Oxford University Press.
[5]
Demaine, E. D., and López-Ortiz, A. 2003. A linear lower bound on index size for text retrieval. J. Alg. 48, 1, 2--15.
[6]
Ferragina, P., Giancarlo, R., Manzini, G., and Sciortino, M. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4, 688--713.
[7]
Ferragina, P., Luccio, F., Manzini, G., and Muthukrishnan, S. 2005. Structuring labeled trees for optimal succinctness, and beyond. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 184--193.
[8]
Ferragina, P., and Manzini, G. 2001. An experimental study of a compressed index. Inf. Sci. 135, 13--28.
[9]
Ferragina, P., and Manzini, G. 2005. Indexing compressed texts. J. ACM 52, 4, 552--581.
[10]
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 3246. Springer Verlag, Berlin. 150--160.
[11]
Foschini, L., Grossi, R., Gupta, A., and Vitter, J. 2006. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Trans. Alg. 2, 4, 611--639.
[12]
Geary, R., Rahman, N., Raman, R., and V.Raman. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 159--172.
[13]
Golynski, A., Munro, I., and Rao, S. S. 2006. Rank/Select operations on large alphabets: A tool for text indexing. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 368--373.
[14]
Gonnet, G. H., Baeza-Yates, R. A., and Snider, T. 1992. New indies for text: Information Retrieval: Data Structures and Algorithms. Prentice-Hall, PAT trees and PAT arrays. In/Upper Saddle River, NJ. 66--82.
[15]
González, R., and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer Verlag, Berlin. 295--306.
[16]
Grabowski, S., Navarro, G., Przywarski, R., Salinger, A., and Mäkinen, V. 2006. A simple alphabet-independent FM-index. Int. J. Found. Comput. Sci. 17, 6, 1365--1384.
[17]
Grossi, R., Gupta, A., and Vitter, J. 2003. High-Order entropy-compressed text indexes. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 841--850.
[18]
Grossi, R., and Vitter, J. 2006. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407.
[19]
Healy, J., Thomas, E., Schwartz, J., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Res. 13, 2306--2315.
[20]
Hon, W., Lam, T., Sadakane, K., Sung, W., and Yiu, S. 2004. Compressed index for dynamic text. In Proceedings of the IEEE Data Compression Conference (DCC). 102--111.
[21]
Hon, W.-K., Lam, T.-W., Sung, W. K., Tse, W. L., Wong, C.-K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 31--38.
[22]
Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 434--444.
[23]
Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS). 549--554.
[24]
Mäkinen, V., and Navarro, G. 2004a. Compressed compact suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 420--433.
[25]
Mäkinen, V., and Navarro, G. 2004b. New search algorithms and time/space tradeoffs for succinct suffix arrays. Tech. Rep. C-2004-20, Department of Computer Science, University of Helsinki.
[26]
Mäkinen, V., and Navarro, G. 2005. Succinct suffix arrays based on run-length encoding. Nordic J. Comput. 12, 1, 40--66.
[27]
Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---Efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the International Symposium on Algorithms and Computation (ISAAC). Lecture Notes in Computer Science, vol. 3341. Springer Verlag, Berlin. 681--692.
[28]
Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 935--948.
[29]
Manzini, G. 2001. An analysis of the Burrows-Wheeler transform. J. ACM 48, 3, 407--430.
[30]
Munro, I. 1996. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Lecture Notes in Computer Science, vol. 1180. Springer Verlag, Berlin. 37--42.
[31]
Munro, I., and Raman, V. 1998. Succinct representation of balanced parentheses, static trees and planar graphs. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 118--126.
[32]
Munro, J., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer Verlag, Berlin. 345--356.
[33]
Navarro, G. 2004. Indexing text using the Ziv-Lempel trie. J. Discrete Alg. 2, 1, 87--114.
[34]
Pagh, R. 1999. Low redundancy in dictionaries with O(1) worst case lookup time. In Proceedings of the 26th International Colloquium on Automata, Languages and Programming (ICALP). 595--604.
[35]
Raman, R., Raman, V., and Rao, S. S. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 233--242.
[36]
Sadakane, K. 2002. Succinct representations of LCP information and improvements in the compressed suffix arrays. In ACM-SIAM Symposium on Discrete Algorithms (SODA). 225--232.
[37]
Sadakane, K. 2003. New text indexing functionalities of the compressed suffix arrays. J. Alg. 48, 2, 294--313.
[38]
Sadakane, K., and Grossi, R. 2006. Squeezing succinct data structures into entropy bounds. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 1230--1239.
[39]
Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Inf. 12, 175--183.

Cited By

View all
  • (2024)Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access ArchitecturesFuture Internet10.3390/fi1606021716:6(217)Online publication date: 19-Jun-2024
  • (2024)Bit-Parallel Wavelet Tree Construction (Abstract)Proceedings of the 2024 ACM Workshop on Highlights of Parallel Computing10.1145/3670684.3673419(37-38)Online publication date: 17-Jun-2024
  • (2024)The Ring: Worst-case Optimal Joins in Graph Databases using (Almost) No Extra SpaceACM Transactions on Database Systems10.1145/364482449:2(1-45)Online publication date: 23-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 3, Issue 2
May 2007
338 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/1240233
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2007
Published in TALG Volume 3, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows-Wheeler transform
  2. Text indexing
  3. compression boosting
  4. entropy
  5. rank and select
  6. text compression
  7. wavelet tree

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access ArchitecturesFuture Internet10.3390/fi1606021716:6(217)Online publication date: 19-Jun-2024
  • (2024)Bit-Parallel Wavelet Tree Construction (Abstract)Proceedings of the 2024 ACM Workshop on Highlights of Parallel Computing10.1145/3670684.3673419(37-38)Online publication date: 17-Jun-2024
  • (2024)The Ring: Worst-case Optimal Joins in Graph Databases using (Almost) No Extra SpaceACM Transactions on Database Systems10.1145/364482449:2(1-45)Online publication date: 23-Mar-2024
  • (2024)Rank and Select on Degenerate Strings2024 Data Compression Conference (DCC)10.1109/DCC58796.2024.00036(283-292)Online publication date: 19-Mar-2024
  • (2024)Faster Wavelet Tree Queries2024 Data Compression Conference (DCC)10.1109/DCC58796.2024.00030(223-232)Online publication date: 19-Mar-2024
  • (2024)How to Find Long Maximal Exact Matches and Ignore Short OnesDevelopments in Language Theory10.1007/978-3-031-66159-4_10(131-140)Online publication date: 27-Jul-2024
  • (2023)AN IMPROVED INDEXING METHOD FOR QUERYING BIG XML FILESJournal of Computer Science and Cybernetics10.15625/1813-9663/19018(323-342)Online publication date: 25-Dec-2023
  • (2023)String Indexing with Compressed PatternsACM Transactions on Algorithms10.1145/360714119:4(1-19)Online publication date: 26-Sep-2023
  • (2023)Compact representations of spatial hierarchical structures with support for topological queriesInformation and Computation10.1016/j.ic.2023.105029292:COnline publication date: 1-Jun-2023
  • (2023)Computational genomics for understanding of DNA-DNA and protein-protein similarityIn silico Approaches to Macromolecular Chemistry10.1016/B978-0-323-90995-2.00004-7(217-263)Online publication date: 2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media