Abstract
A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies \( O(n\log |\Sigma |) \) bits for the alphabet ∑. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in \( O(|P|\log n + occ\log ^\varepsilon n) \) time and decompress a part of the text of length l in \( O(l + \log ^e n) \) time for any given 1 ≥ ∈ > 0. Our data structure occupies only \( n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) \) bits where \( {\rm H}0 \leqslant {\text{log}}\left| \sum \right| \) is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
J. L. Bentley, D. D. Sleator, R. E. Tarjan, and V. K. Wei. A Locally Adaptive Data Compression Scheme. Communications of the ACM, 29(4):320–330, April 1986.
M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994.
P. Elias. Universal codeword sets and representation of the integers. IEEE Trans. Inform. Theory, IT-21(2):194–203, March 1975.
M. Farach and T. Thorup. String-matching in Lempel-Ziv Compressed Strings. In 27th ACM Symposium on Theory of Computing, pages 703–713, 1995.
P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. Technical Report TR00-03, Dipartimento di Informatica, Università di Pisa, March 2000.
R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000. http://www.cs.duke.edu/~jsv/Papers/catalog/node68.html.
D. A. Grossman and O. Frieder. Information Retrieval: Algorithms and Heuristics. Kluwer Academic Publishers, 1998.
G. Jacobson. Space-efficient Static Trees and Graphs. In 30th IEEE Symp. on Foundations of Computer Science, pages 549–554, 1989.
P. Jokinen and E. Ukkonen. Two Algorithms for Approximate String Matching in Static Texts. In A. Tarlecki, editor, Proceedings of Mathematical Foundations of Computer Science, LNCS 520, pages 240–248, 1991.
J. Kärkkäinen and E. Sutinen. Lempel-Ziv Index for q-Grams. Algorithmica, 21(1):137–154, 1998.
T. Kasai, H. Arimura, R. Fujino, and S. Arikawa. Text data mining based on optimal pattern discovery — towards a scalable data mining system for large text databases—. In Summer DB Workshop, SIGDBS-116-20, pages 151–156. IPSJ, July 1998. (in Japanese).
T. Kida, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. A Unifying Framework for Compressed Pattern Matching. In Proc. IEEE String Processing and Information Retrieval Symposium (SPIRE’99), pages 89–96, September 1999.
S. Kurtz. Reducing the Space Requirement of Suffix Trees. Technical Report 98–03, Technische Fakultät der Universität Bielefeld, Abteilung Informationstechnik, 1998.
U. Manber and G. Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, October 1993.
E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976.
E. Moura, G. Navarro, and N. Ziviani. Indexing compressed text. In Proc. of WSP’97, pages 95–111. Carleton University Press, 1997.
J. I. Munro. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Computer Science (FSTTCS’ 96), LNCS 1180, pages 37–42, 1996.
J. I. Munro. Personal communication, July 2000.
K. Sadakane. A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression. In Proceedings of IEEE Data Compression Conference (DCC’99), page 548, 1999. poster session.
K. Sadakane and H. Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In Advances in Database Technologies, number 1552 in LNCS, pages 434–445, 1999.
K. Sadakane and H. Imai. Text Retrieval by using k-word Proximity Search. In Proceedings of International Symposium on Database Applications in Non-Traditional Environments (DANTE’99), pages 23–28. Research Project on Advanced Databases, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sadakane, K. (2000). Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array. In: Goos, G., Hartmanis, J., van Leeuwen, J., Lee, D.T., Teng, SH. (eds) Algorithms and Computation. ISAAC 2000. Lecture Notes in Computer Science, vol 1969. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-40996-3_35
Download citation
DOI: https://doi.org/10.1007/3-540-40996-3_35
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41255-7
Online ISBN: 978-3-540-40996-0
eBook Packages: Springer Book Archive