Abstract
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions.
Similar content being viewed by others
Notes
Search speed of an online string matching algorithm may depend on the length of the pattern. Typical search speed of a fast solution, on a modern laptop computer, goes from 1 GB/s (in the case of short patterns) to 5 GB/s (in the case of very long patterns) [5].
Search speed of a fast offline solution do not depend on the length of the text and is typically under 1 ms per query.
According to their theoretical evaluation and their experimental results it turns out that, when searching on an English text, the best performance are obtained when the least 13 characters are removed from the original alphabet.
In practical cases we can implement our solution with a block size \(k=256\), which allows to represent the elements of the sequence \(\dot{y}\) using a single byte. In such a case the assumption \(k\ge \sigma\) is plausible for any practical application.
The Smart tool is available online for download at http://www.dmi.unict.it/~faro/smart/ or at https://github.com/smart-tool/smart.
Specifically, the text buffer is the concatenation of two different texts: The King James version of the bible (3.9 MB) and The CIA world fact book (2.4 MB). The first 5MB of the resulting text buffer have been used in our experimental results.
References
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, London (1974)
Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer, Berlin (1985)
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)
Cantone, D., Faro, S., Giaquinta, E.: Adapting Boyer-Moore-like algorithms for searching Huffman encoded texts. Int. J. Found. Comput. Sci. 23(2), 343–356 (2012)
Cantone, D., Faro, S., Pavone, A.: Speeding up string matching by weak factor recognition. Stringology 2017, 42–50 (2017)
Claude, F., Navarro, G., Peltola, H., Salmela, L., Tarhio, J.: String matching with alphabet sampling. J. Discrete Algorithms 11, 37–50 (2012)
Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W.: Speeding up two string-matching algorithms. Algorithmica 12(4), 247–267 (1994)
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)
Faro, S., Lecroq, T., Borzì, S., Di Mauro, S., Maggio, A.: The String Matching Algorithms Research Tool. In Procedings of Stringology, pp. 99–111, (2016)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Fredriksson, K., Grabowski, S.: A general compression algorithm that supports fast searching. Inf. Process. Lett. 100(6), 226–232 (2006)
Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Porceedings of String Processing and Information Retrieval (SPIRE 2015), Lecture Notes in Computer Science, vol 9309, Springer, pp. 287–298 (2015)
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Proceedings of 2nd Annual International Conference on Computing and Combinatorics (COCOON), LNCS 1090, pp. 219–230 (1996)
Klein, S.T., Shapira, D.: A new compression method for compressed matching. In: Data Compression Conference, IEEE. pp. 400–409 (2000)
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Manber: A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. Inf. Syst. 15(2), 124–136 (1997)
Manber, U., Myers, G.: Suffix arrays: a new method for online string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. 18(2), 113–139 (2000)
Navarro, G., Tarhio, J.: LZgrep: a Boyer-Moore string matching tool for Ziv-Lempel compressed text. Softw. Pract. Exp. 35, 1107–1130 (2005)
Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: CIAC 306–315 (2000)
Yao, A.C.: The complexity of pattern matching for a random string. SIAM J. Comput. 8(3), 368–387 (1979)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Faro, S., Marino, F.P. & Pavone, A. Efficient Online String Matching Based on Characters Distance Text Sampling. Algorithmica 82, 3390–3412 (2020). https://doi.org/10.1007/s00453-020-00732-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-020-00732-4