Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Efficient String Matching Algorithm for Searching Large DNA and Binary Texts

Published: 01 October 2017 Publication History
  • Get Citation Alerts
  • Abstract

    The exact string matching is essential in application areas such as Bioinformatics and Intrusion Detection Systems. Speeding-up the string matching algorithm will therefore result in accelerating the searching process in DNA and binary data. Previously, there are two types of fast algorithms exist, bit-parallel based algorithms and hashing algorithms. The bit-parallel based are efficient when dealing with patterns of short lengths, less than 64, but slow on long patterns. On the other hand, hashing algorithms have optimal sublinear average case on large alphabets and long patterns, but the efficiency not so good on small alphabet such as DNA and binary texts. In this paper, the authors present hybrid algorithm to overcome the shortcomings of those previous algorithms. The proposed algorithm is based on q-gram hashing with guaranteeing the maximal shift in advance. Experimental results on random and complete human genome confirm that the proposed algorithm is efficient on various pattern lengths and small alphabet.

    References

    [1]
    Al-Ssulami, A. M. 2015. Hybrid string matching algorithm with a pivot. Journal of Information Science, 411, 82-88.
    [2]
    Allauzen, C., Crochemore, M., & Raffinot, M. 1999. Factor Oracle: A New Structure for Pattern Matching. Paper presented at the 26th Conference on Current Trends in Theory and Practice of Informatics on Theory and Practice of Informatics. 10.1007/3-540-47849-3_18
    [3]
    Allauzen, C., & Raffinot, M. 2000. Simple Optimal String Matching Algorithm. Journal of Algorithms, 361, 102-116.
    [4]
    Baeza-Yates, R., & Gonnet, G. H. 1992. A new approach to text searching. Communications of the ACM, 3510, 74-82.
    [5]
    Baeza-Yates, R. A. 1989. Improved string searching. Software, Practice & Experience, 193, 257-271.
    [6]
    Barton, C., Iliopoulos, C., & Pissis, S. 2014. Fast algorithms for approximate circular string matching. Algorithms for Molecular Biology; AMB, 91, 1-10. 24656145
    [7]
    Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M. T., & Seiferas, J. 1985. The smallest automation recognizing the subwords of a text. Theoretical Computer Science, 40, 31-55.
    [8]
    Boyer, R. S., & Moore, J. S. 1977. A fast string searching algorithm. Communications of the ACM, 2010, 762-772.
    [9]
    Branislav, D. 2010. Improving practical exact string matching. Information Processing Letters, 1104, 148-152.
    [10]
    Cantone, D., Faro, S., & Giaquinta, E. 2010. A compact representation of nondeterministic suffix automata for the bit-parallel approach. Paper presented at the Annual Symposium on Combinatorial Pattern Matching. 10.1007/978-3-642-13509-5_26
    [11]
    Chen, K.-H., Huang, G.-S., & Lee, R. C.-T. 2014. Bit-parallel algorithms for exact circular string matching. The Computer Journal, 575, 731-743.
    [12]
    Chu, Y.-M., Huang, N.-F., Tsai, C.-H., & Hsieh, C.-Y. 2008. A software-based string matching algorithm for resource-restricted network system. Communications Letters, IEEE, 128, 599-601.
    [13]
    Deusdado, S., & Carvalho, P. 2009. GRASPm: An efficient algorithm for exact pattern matching in genomic sequences. International Journal of Bioinformatics Research and Applications, 54, 385-401. 19640827
    [14]
    Fan, H., Yao, N., & Ma, H. 2009, 21-22 Dec. 2009. Fast Variants of the Backward-Oracle-Marching Algorithm. Paper presented at the Internet Computing for Science and Engineering ICICSE, 2009 Fourth International Conference on.
    [15]
    Faro, S., & Lecroq, T. 2009a. An Efficient Matching Algorithm for Encoded DNA Sequences and Binary Strings. Paper presented at the 20th Annual Symposium on Combinatorial Pattern Matching, Lille, France. 10.1007/978-3-642-02441-2_10
    [16]
    Faro, S., & Lecroq, T. 2009b. Efficient variants of the Backward-Oracle-Matching algorithm. International Journal of Foundations of Computer Science, 2006, 967-984.
    [17]
    Faro, S., & Lecroq, T. 2012, November 11-13. Fast searching in biological sequences using multiple hash functions. Paper presented at the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering BIBE.
    [18]
    Faro, S., & Lecroq, T. 2013. The exact online string matching problem: A review of the most recent results. ACM Computing Surveys, 452, 1-42.
    [19]
    Franek, F., Jennings, C. G., & Smyth, W. F. 2007. A simple fast hybrid pattern-matching algorithm. Journal of Discrete Algorithms, 54, 682-695.
    [20]
    He, L., Fang, B., & Sui, J. 2005. The wide window string matching algorithm. Theoretical Computer Science, 3321-3, 391-404.
    [21]
    Horspool, R. N. 1980. Practical fast searching in strings. Software, Practice & Experience, 106, 501-506.
    [22]
    James, H., Morris, J., & Pratt, V. R. 1970. A linear pattern-matching algorithm Technical Report. Berkeley: University of California.
    [23]
    Kalsi, P., Peltola, H., & Tarhio, J. 2008. Comparison of Exact String Matching Algorithms for Biological Sequences. In Elloumi, M., Küng, J., Linial, M., Murphy, R., Schneider, K., & Toma, C. Eds., Bioinformatics Research and Development Vol. 13, pp. 417-426. Springer Berlin Heidelberg.
    [24]
    Knuth, D. E., James, H., Morris, J., & Pratt, V. R. 1977. Fast Pattern Matching in Strings. SIAM Journal on Computing, 62, 323-350.
    [25]
    Lecroq, T. 2007. Fast exact string matching algorithms. Information Processing Letters, 1026, 229-235.
    [26]
    Liu, R.-T., Huang, N.-F., Chen, C.-H., & Kao, C.-N. 2004. A fast string-matching algorithm for network processor-based intrusion detection system. ACM Transactions on Embedded Computing Systems, 33, 614-633.
    [27]
    Mohanty, P., & Tragoudas, S. 2014. Scalable Offline Searches in DNA Sequences. J. Emerg. Technol. Comput. Syst., 112, 1-25.
    [28]
    Morozova, O., & Marra, M. A. 2008. Applications of next-generation sequencing technologies in functional genomics. Genomics, 925, 255-264. 18703132
    [29]
    Navarro, G., & Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. J. Exp. Algorithmics, 5, 4, es.
    [30]
    Peltola, H., & Tarhio, J. 2003. Alternative Algorithms for Bit-Parallel String Matching. In Nascimento, M., de Moura, E., & Oliveira, A. Eds., String Processing and Information Retrieval Vol. 2857, pp. 80-93. Springer Berlin Heidelberg.
    [31]
    Peltola, H., & Tarhio, J. 2014. String matching with lookahead. Discrete Applied Mathematics, 163Part 3, 352-360.
    [32]
    Simon, I. 1994. String Matching Algorithms and Automata. Paper presented at the Colloquium in Honor of Arto Salomaa on Results and Trends in Theoretical Computer Science. 10.1007/3-540-58131-6_61
    [33]
    Sunday, D. M. 1990. A very fast substring search algorithm. Communications of the ACM, 338, 132-142.
    [34]
    Tarhio, J., & Peltola, H. 1997. String matching in the DNA alphabet. Software, Practice & Experience, 277, 851-861.
    [35]
    Tuck, N., Sherwood, T., Calder, B., & Varghese, G. 2004. Deterministic memory-efficient string matching algorithms for intrusion detection. Paper presented at the INFOCOM 2004. Twenty-third Annual Joint Conference of the IEEE Computer and Communications Societies. 10.1109/INFCOM.2004.1354682
    [36]
    Wu, S., & Manber, U. 1992. Fast text searching: Allowing errors. Communications of the ACM, 3510, 83-91.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image International Journal on Semantic Web & Information Systems
    International Journal on Semantic Web & Information Systems  Volume 13, Issue 4
    October 2017
    220 pages
    ISSN:1552-6283
    EISSN:1552-6291
    Issue’s Table of Contents

    Publisher

    IGI Global

    United States

    Publication History

    Published: 01 October 2017

    Author Tags

    1. DNA Searching
    2. Hashing
    3. String Matching
    4. q-Gram Practical Algorithms

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media