Abstract
Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A.V. Aho and M. Corasick. Efficient string matching: An aid to bibliographic search. Comm. ACM, 18(6):333–340, 1975.
A. Amir and G. Benson. Efficient two-dimensional compressed matching. In Proc. Data Compression Conference, page 279, 1992.
S. Arikawa and T. Shinohara. A run-time efficient realization of Aho-Corasick pattern matching machines. New Generation Computing, 2(2):171–186, 1984.
S. Arikawa et al. The text database management syste SIGMA: An improvement of the main engine. In Proc. of Berliner Informatik-Tage, pages 72–81, 1989.
J. Jaakkola and P. Kilpeläinen. A tool to search structured text. University of Helsinki. (In preparation).
S. T. Klein and D. Shapira. Pattern matching in Huffman encoded texts. In Proc. Data Compression Conference 2001, pages 449–458. IEEE Computer Society, 2001.
D. E. Knuth. The Art of Computer Programing, Sorting and Searching, volume 3. Addison-Wesley, 1973.
N. J. Larsson and A. Moffat. Offline dictionary-based compression. In Proc. Data Compression Conference’ 99, pages 296–305. IEEE Computer Society, 1999.
M. Miyazaki, S. Fukamachi, M. Takeda, and T. Shinohara. Speeding up the pattern matching machine for compressed texts. Transactions of Information Processing Society of Japan, 39(9):2638–2648, 1998. (in Japanese).
D. Revuz. Minimisation of acyclic deterministic automata in linear time. Theoretical Computer Science, 92(1):181–189, 1992.
Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa. A Boyer-Moore type algorithm for compressed pattern matching. In Proc. 11th Ann. Symp. on Combinatorial Pattern Matching, volume 1848 of Lecture Notes in Computer Science, pages 181–194. Springer-Verlag, 2000.
N. Uratani and M. Takeda. A fast string-searching algorithm for multiple patterns. Information Processing & Management, 29(6):775–791, 1993.
M. Yoshikawa and T. Amagasa. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology, 1(1):110–141, August 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takeda, M. et al. (2002). Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts. In: Laender, A.H.F., Oliveira, A.L. (eds) String Processing and Information Retrieval. SPIRE 2002. Lecture Notes in Computer Science, vol 2476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45735-6_16
Download citation
DOI: https://doi.org/10.1007/3-540-45735-6_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44158-8
Online ISBN: 978-3-540-45735-0
eBook Packages: Springer Book Archive