Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/545381.545472acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
Article

A sub-quadratic sequence alignment algorithm for unrestricted cost matrices

Published: 06 January 2002 Publication History

Abstract

The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size n in O(n2) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global alignment computations.The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n2/log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(hn2/log n) where h ≤ 1 is the entropy of the text.

References

[1]
Aggarwal, A., M. Klawe, S. Moran, P. Shor, and R. Wilber, Geometric Applications of a Matrix-Searching Algorithm, Algorithmica,2, 195-208 (1987).]]
[2]
Aggarawal, A., and J. Park, Notes on Searching in Multidimensional Monotone Arrays, Proc. 29th IEEE Symp. on Foundations of Computer Science, 497-512 (1988).]]
[3]
Amir, A., G. Benson, and M. Farach, Let sleeping files lie: Pattern matching in Z-compressed files. J. of Comp. and Sys. Sciences,52(2), 299-307 (1996).]]
[4]
Apostolico, A., M. Atallah, L. Larmore, and S. Mc-Faddin, Efficient parallel algorithms for string editing problems. SIAM J. Comput.,19, 968-998 (1990).]]
[5]
Bell, T. C., J. C. Cleary, and I. H. Witten. Text Compression. Prentice Hall, (1990).]]
[6]
Benson, G., A space efficient algorithm for finding the best nonoverlapping alignment score, Theoretical Computer Science, 145, 357-369 (1995).]]
[7]
Crochemore, M., and W. Rytter, Text Algorithms, Oxford University Press, (1994).]]
[8]
Eppstein, D., Sequence Comparison with Mixed Convex and Concave Costs, Journal of Algorithms,11, 85-101 (1990).]]
[9]
Eppstein, D., Z. Galil, and R. Giancarlo, Speeding Up Dynamic Programming, Proc. 29th IEEE Symp. on Foundations of Computer Science, 488-296 (1988).]]
[10]
Eppstein, D., Z. Galil, R. Giancarlo, and G. F. Italiano, Sparse Dynamic Programming I: Linear Cost Functions, JACM, 39, 546-567 (1992).]]
[11]
Eppstein, D., Z. Galil, R. Giancarlo, and G. F. Italiano, Sparse Dynamic Programming II: Convex and Concave Cost Functions, JACM, 39, 568-599 (1992).]]
[12]
Erickson, B. W., and P. H. Sellers, Recognition of patterns in genetic sequences, in Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, D. Sankoff and J. B. Kruskal, eds., Addison-Wesley, Reading, MA, 55-91 (1983).]]
[13]
Farach, M., and M. Thorup, String matching in Lempel-Ziv compressed strings. Algorithmica, 20, 388-404 (1998).]]
[14]
Galil, Z., and R. Giancarlo, Speeding Up Dynamic Programming with Applications to Molecular Biology, Theoretical Computer Science, 64, 107-118 (1989).]]
[15]
Galil Z., and K. Park, A linear-time algorithm for concave one-dimensional dynamic programming, Info. Processing Letters, 33, 309-311 (1990).]]
[16]
Gasieniec, L., M. Karpinski, W. Plandowski, W. Rytter, Randomised efficient algorithms for compressed strings: the finger-print approach, Proc. 7th Annual Symposium On Combinatorial Pattern Matching, LNCS 1075, 39-49 (1996).]]
[17]
Gasieniec, L., and W. Rytter, Almost optimal fully LZW compressed pattern matching, Data Compression Conference, J. Storer, ed, (1999).]]
[18]
Giancarlo, R., Dynamic Programming: Special Cases, Pattern Matching Algorithms, edited by Apostolico, A. and Z. Galil, Oxford University Press, 201-232 (1997).]]
[19]
Gusfield, D., Algorithms on Strings, Trees, and Sequences. Cambridge University Press, (1997).]]
[20]
Kannan, S. K., and E. W. Myers, An Algorithm For Locating Non-Overlapping Regions of Maximum Alignment Score, SIAM J. Comput., 25(3), 648-662 (1996).]]
[21]
Karkkainen, J., G. Navarro and E. Ukkonen, Approximate String Matching over Ziv-Lempel Compressed Text, Proc. 11th Annual Symposium On Combinatorial Pattern Matching, LNCS 1848, 195-209 (2000).]]
[22]
Karkkainen, J., and E. Ukkonen, Lempel-Ziv parsing and sublinear-size index structures for string matching, Proc. Third South American Workshop on String Processing (WSP '96), 141-155 (1996).]]
[23]
Kida, T., M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa, Shift-And approach to pattern matching in LZW compressed text, Proc. 10th Annual Symposium On Combinatorial Pattern Matching, LNCS 1645, 1-13 (1999).]]
[24]
Klawe, M., and D. Kleitman, An Almost Linear Algorithm for Generalized Matrix Searching, SIAM Jour. Descrete Math., 3, 81-97 (1990).]]
[25]
Landau, G. M., E. W. Myers and J. P. Schmidt, Incremental String Comparison, SIAM J. Comput., 27(2), 557-582 (1998).]]
[26]
Landau, G. M. and M. Ziv-Ukelson, On the Shared Substring Alignment Problem, Proc. Symposium On Discrete Algorithms, 804-814 (2000).]]
[27]
Landau, G. M., and M. Ziv-Ukelson, On the Common Substring Alignment Problem, Journal of Algorithms.]]
[28]
Lempel, A., and J. Ziv, On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75-81 (1976).]]
[29]
Levenshtein, V. I., Binary Codes Capable of Correcting, Deletions, Insertions and Reversals, Soviet Phys. Dokl, 10, 707-710 (1966).]]
[30]
Manber, U., A text compression scheme that allows fast searching directly in the compressed file, Proc. 5th Annual Symposium On Combinatorial Pattern Matching, LNCS 807, 113-124 (1994).]]
[31]
Masek, W. J., and M. S. Paterson, A faster algorithm for computing string edit distances. J. Comput. Syst. Sci.,20, 18-31 (1980).]]
[32]
Monge, G., Deblai et Remblai,Memoires del l'Academie des Sciences, Paris (1781).]]
[33]
Navarro G., T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Faster Approximate String Matching Over Compressed Text, Proc. Data Compression Conference (DCC2001), IEEE Computer Society, 459-468 (2001).]]
[34]
Navarro, G., and M. Raffinot, A general practical approach to pattern matching over Ziv-Lempel compressed text, Proc. l0th Annual Symposium On Combinatorial Pattern Matching, LNCS 1645, 14-36 (1999).]]
[35]
Navarro, G., and M. Raffinot. Boyer-Moore string matching over Ziv-Lempel compressed text, Proc. 11th Annual Symposium On Combinatorial Pattern Matching, LNCS 1848, 166-180 (2000).]]
[36]
Sankoff D., and J. B. Kruskal (editors), Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison, Addison-Wesley, Reading, MA, (1983).]]
[37]
Schmidt, J. P., All Highest Scoring Paths In Weighted Grid Graphs and Their Application To Finding All Approximate Repeats In Strings, SIAM J. Comput,27(4), 972-992 (1998).]]
[38]
Shabita Y., T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, S. Arikawa, Speeding up pattern matching by text compression, CIAC 2000, LNCS 1767, 306-315 (2000).]]
[39]
Smith, T. F. and M. S. Waterman, Identification of common molecular subsequences, J. Molecular Biol.,147, 195-197 (1981).]]
[40]
Szpankowski, W., and P. Jacquet. Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees, Theoretical Computer Science,144, 161-197 (1995).]]
[41]
Takeda, M., Y. Shibata, T. Matsumoto, T. Kida, A. Shinohara, S. Fukamachi, T. Shinohara, and S. Arikawa: Speeding up string pattern matching by text compression: The dawn of a new era, 42(3), pp. 370-384 (2001).]]
[42]
Waterman, M. S., and M. Eggert, A new algorithm for best subsequence alignment with application to tRNA-rRNA comparisons, J. Molecular Biol.,197, 723-728 (1987).]]
[43]
Welch, T. A., A Technique for High Performance Data Compression, IEEE Trans. on Computers,17(6), 8-19 (1984).]]
[44]
Ziv, J., and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory,IT-23(3), 337-343 (1977).]]
[45]
Ziv, J., and A. Lempel, Compression of individual sequences via variable rate coding, IEEE Trans. Inform. Th.,24, 530-536 (1978).]]

Cited By

View all
  • (2012)A comparison of index-based lempel-Ziv LZ77 factorization algorithmsACM Computing Surveys10.1145/2379776.237978145:1(1-17)Online publication date: 7-Dec-2012
  • (2009)Privacy-preserving genomic computation through program specializationProceedings of the 16th ACM conference on Computer and communications security10.1145/1653662.1653703(338-347)Online publication date: 9-Nov-2009
  • (2008)Algorithms for computing variants of the longest common subsequence problemTheoretical Computer Science10.1016/j.tcs.2008.01.009395:2-3(255-267)Online publication date: 20-Apr-2008
  • Show More Cited By
  1. A sub-quadratic sequence alignment algorithm for unrestricted cost matrices

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SODA '02: Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
    January 2002
    1018 pages
    ISBN:089871513X

    Sponsors

    Publisher

    Society for Industrial and Applied Mathematics

    United States

    Publication History

    Published: 06 January 2002

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate 411 of 1,322 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2012)A comparison of index-based lempel-Ziv LZ77 factorization algorithmsACM Computing Surveys10.1145/2379776.237978145:1(1-17)Online publication date: 7-Dec-2012
    • (2009)Privacy-preserving genomic computation through program specializationProceedings of the 16th ACM conference on Computer and communications security10.1145/1653662.1653703(338-347)Online publication date: 9-Nov-2009
    • (2008)Algorithms for computing variants of the longest common subsequence problemTheoretical Computer Science10.1016/j.tcs.2008.01.009395:2-3(255-267)Online publication date: 20-Apr-2008
    • (2007)Algorithms for computing the longest parameterized common subsequenceProceedings of the 18th annual conference on Combinatorial Pattern Matching10.5555/2394373.2394409(265-273)Online publication date: 9-Jul-2007
    • (2007)Speeding up HMM decoding and training by exploiting sequence repetitionsProceedings of the 18th annual conference on Combinatorial Pattern Matching10.5555/2394373.2394379(4-15)Online publication date: 9-Jul-2007
    • (2007)Protein similarity search with subset seeds on a dedicated reconfigurable hardwareProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786340(1240-1248)Online publication date: 9-Sep-2007
    • (2007)Comparing Compressed Sequences for Faster Nucleotide BLAST SearchesIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2007.10294:3(349-364)Online publication date: 1-Jul-2007
    • (2007)A New Efficient Algorithm for Computing the Longest Common SubsequenceProceedings of the 3rd international conference on Algorithmic Aspects in Information and Management10.1007/978-3-540-72870-2_8(82-90)Online publication date: 6-Jun-2007
    • (2006)Faster Algorithms for Optimal Multiple Sequence Alignment Based on Pairwise ComparisonsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2006.533:4(408-422)Online publication date: 1-Oct-2006
    • (2005)Rapid Homology Search with Two-Stage Extension and Daughter SeedsProceedings of the 11th Annual International Conference on Computing and Combinatorics - Volume 359510.5555/2958119.2958197(104-114)Online publication date: 16-Aug-2005
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media