Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Efficient Online String Matching Based on Characters Distance Text Sampling

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11 to 2.8% of the text size, with a gain up to 50% if compared with previous solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Search speed of an online string matching algorithm may depend on the length of the pattern. Typical search speed of a fast solution, on a modern laptop computer, goes from 1 GB/s (in the case of short patterns) to 5 GB/s (in the case of very long patterns) [5].

  2. Search speed of a fast offline solution do not depend on the length of the text and is typically under 1 ms per query.

  3. According to their theoretical evaluation and their experimental results it turns out that, when searching on an English text, the best performance are obtained when the least 13 characters are removed from the original alphabet.

  4. In practical cases we can implement our solution with a block size \(k=256\), which allows to represent the elements of the sequence \(\dot{y}\) using a single byte. In such a case the assumption \(k\ge \sigma\) is plausible for any practical application.

  5. The Smart tool is available online for download at http://www.dmi.unict.it/~faro/smart/ or at https://github.com/smart-tool/smart.

  6. Specifically, the text buffer is the concatenation of two different texts: The King James version of the bible (3.9 MB) and The CIA world fact book (2.4 MB). The first 5MB of the resulting text buffer have been used in our experimental results.

References

  1. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, London (1974)

    MATH  Google Scholar 

  2. Apostolico, A.: The myriad virtues of suffix trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO Advanced Science Institutes, Series F, vol. 12, pp. 85–96. Springer, Berlin (1985)

    Chapter  Google Scholar 

  3. Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)

    Article  Google Scholar 

  4. Cantone, D., Faro, S., Giaquinta, E.: Adapting Boyer-Moore-like algorithms for searching Huffman encoded texts. Int. J. Found. Comput. Sci. 23(2), 343–356 (2012)

    Article  MathSciNet  Google Scholar 

  5. Cantone, D., Faro, S., Pavone, A.: Speeding up string matching by weak factor recognition. Stringology 2017, 42–50 (2017)

    Google Scholar 

  6. Claude, F., Navarro, G., Peltola, H., Salmela, L., Tarhio, J.: String matching with alphabet sampling. J. Discrete Algorithms 11, 37–50 (2012)

    Article  MathSciNet  Google Scholar 

  7. Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W., Rytter, W.: Speeding up two string-matching algorithms. Algorithmica 12(4), 247–267 (1994)

    Article  MathSciNet  Google Scholar 

  8. Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)

    Article  Google Scholar 

  9. Faro, S., Lecroq, T., Borzì, S., Di Mauro, S., Maggio, A.: The String Matching Algorithms Research Tool. In Procedings of Stringology, pp. 99–111, (2016)

  10. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  11. Fredriksson, K., Grabowski, S.: A general compression algorithm that supports fast searching. Inf. Process. Lett. 100(6), 226–232 (2006)

    Article  MathSciNet  Google Scholar 

  12. Grabowski, S., Raniszewski, M.: Sampling the suffix array with minimizers. In: Porceedings of String Processing and Information Retrieval (SPIRE 2015), Lecture Notes in Computer Science, vol 9309, Springer, pp. 287–298 (2015)

  13. Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)

    Article  Google Scholar 

  14. Karkkainen, J., Ukkonen, E.: Sparse suffix trees. In: Proceedings of 2nd Annual International Conference on Computing and Combinatorics (COCOON), LNCS 1090, pp. 219–230 (1996)

  15. Klein, S.T., Shapira, D.: A new compression method for compressed matching. In: Data Compression Conference, IEEE. pp. 400–409 (2000)

  16. Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  Google Scholar 

  17. Manber: A text compression scheme that allows fast searching directly in the compressed file. ACM Trans. Inf. Syst. 15(2), 124–136 (1997)

    Article  Google Scholar 

  18. Manber, U., Myers, G.: Suffix arrays: a new method for online string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  19. Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Trans. Inf. Syst. 18(2), 113–139 (2000)

    Article  Google Scholar 

  20. Navarro, G., Tarhio, J.: LZgrep: a Boyer-Moore string matching tool for Ziv-Lempel compressed text. Softw. Pract. Exp. 35, 1107–1130 (2005)

    Article  Google Scholar 

  21. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding Up Pattern Matching by Text Compression. In: CIAC 306–315 (2000)

  22. Yao, A.C.: The complexity of pattern matching for a random string. SIAM J. Comput. 8(3), 368–387 (1979)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simone Faro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Faro, S., Marino, F.P. & Pavone, A. Efficient Online String Matching Based on Characters Distance Text Sampling. Algorithmica 82, 3390–3412 (2020). https://doi.org/10.1007/s00453-020-00732-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-020-00732-4

Keywords