Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Faster Algorithms for Optimal Multiple Sequence Alignment Based on Pairwise Comparisons

Published: 01 October 2006 Publication History

Abstract

Multiple Sequence Alignment (MSA) is one of the most fundamental problems in computational molecular biology. The running time of the best known scheme for finding an optimal alignment, based on dynamic programming, increases exponentially with the number of input sequences. Hence, many heuristics were suggested for the problem. We consider a version of the MSA problem where the goal is to find an optimal alignment in which matches are restricted to positions in predefined matching segments. We present several techniques for making the dynamic programming algorithm more efficient, while still finding an optimal solution under these restrictions. We prove that it suffices to find an optimal alignment of the predefined sequence segments, rather than single letters, thereby reducing the input size and thus improving the running time. We also identify "shortcuts" that expedite the dynamic programming scheme. Empirical study shows that, taken together, these observations lead to an improved running time over the basic dynamic programming algorithm by 4 to 12 orders of magnitude, while still obtaining an optimal solution. Under the additional assumption that matches between segments are transitive, we further improve the running time for finding the optimal solution by restricting the search space of the dynamic programming algorithm.

References

[1]
{1} S. Altschul, F. Stephen, L.M. Thomas, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[2]
{2} A. Bairoch and R. Apweiler, "The SWISS-PROT Protein Sequence Data Bank and Its Supplement TrEMBL," Nucleic Acids Research, vol. 25, no. 1, pp. 31-36, 1997.
[3]
{3} P. Bonizzoni and G. Della Vedova, "The Complexity of Multiple Sequence Alignment with Sp-Score that Is a Metric," Theoretical Computer Science, vol. 259, nos. 1-2, pp. 63-79, 2001.
[4]
{4} H. Carrillo and D. Lipman, "The Multiple Sequence Alignment Problem in Biology," SIAM J. Applied Math., vol. 48, no. 5, pp. 1073-1082, 1988.
[5]
{5} F. Corpet, "Multiple Sequence Alignment with Hierarchical-Clustering," Nucleic Acids Research, vol. 16, no. 22, pp. 10881- 10890, 1988.
[6]
{6} C.B. Do, M.A. Mahabhashyam, M. Brudno, and S. Batzoglou, "ProbCons: Probabilistic Consistency-Based Multiple Sequence Alignment," Genome Research, vol. 15, no. 2, pp. 330-340, 2005.
[7]
{7} R.G. Downey and M. Fellows, Parameterized Complexity. Springer-Verlag, 1999.
[8]
{8} R.C. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004.
[9]
{9} R.C. Edgar, "MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity," BMC Bioinformatics, vol. 5, p. 113, 2004.
[10]
{10} I. Elias, "Settling the Intractability of Multiple Alignment," Proc. Ann. Int'l Symp. Algorithms and Computation (ISAAC), pp. 352-363, 2003.
[11]
{11} D. Eppstein, Z. Galil, R. Giancarlo, and G.F. Italiano, "Sparse Dynamic-Programming: I. Linear Cost-Functions," J. ACM, vol. 39, no. 3, pp. 519-545, 1992.
[12]
{12} D. Eppstein, Z. Galil, R. Giancarlo, and G.F. Italiano, "Sparse Dynamic-Programming: II. Convex and Concave Cost-Functions," J. ACM, vol. 39, no. 3, pp. 546-567, 1992.
[13]
{13} D.F. Feng and R.F. Doolittle, "Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees," J. Molecular Evolution, vol. 25, pp. 351-360, 1987.
[14]
{14} M.R. Garey and D.S. Johnson, Computers and Intractability--A Guide to the Theory of NP-Completeness. Freeman, 1979.
[15]
{15} D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[16]
{16} S. Henikoff and J.G. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences, vol. 89, pp. 10915-10919, 1992.
[17]
{17} D.G. Higgins and P.M. Sharp, "Clustal--A Package for Performing Multiple Sequence Alignment on a Microcomputer," Gene, vol. 73, no. 1, pp. 237-244, 1988.
[18]
{18} T. Jiang and L. Wang, "On the Complexity of Multiple Sequence Alignment," J. Computer Biology, vol. 1, no. 4, pp. 337-348, 1994.
[19]
{19} W. Just, "Computational Complexity of Multiple Sequence Alignment with Sp-Score," J. Computer Biology, vol. 8, no. 6, pp. 615-623, 2001.
[20]
{20} G.M. Landau, M. Crochemore, and M. Ziv-Ukelson, "A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices," Proc. 13th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 679-688, 2002.
[21]
{21} T. Lassmann and E.L.L. Sonnhammer, "Quality Assessment of Multiple Alignment Programs," Febs Letters, vol. 529, no. 1, pp. 126-130, 2002.
[22]
{22} C. Lee, C. Grasso, and M.F. Sharlow, "Multiple Sequence Alignment Using Partial Order Graphs," Bioinformatics, vol. 18, no. 3, pp. 452-464, 2002.
[23]
{23} M. Lermen and K. Reinert, "The Practical Use of the A * Algorithm for Exact Multiple Sequence Alignment," J. Computer Biology, vol. 7, no. 5, pp. 655-672, 2000.
[24]
{24} H.P. Lenhof, B. Morgenstern, and K. Reinert, "An Exact Solution for the Segment-to-Segment Multiple Sequence Alignment Problem," Bioinformatics, vol. 15, no. 3, pp. 203-210, 1999.
[25]
{25} B. Manthey, "Non-Approximability of Weighted Multiple Sequence Alignment," Theoretical Computer Science, vol. 296, no. 1, pp. 179-192, 2003.
[26]
{26} W.J. Masek and M.S. Paterson, "A Faster Algorithm Computing String Edit Distances," J. Computer Systems Science, vol. 20, no. 1, pp. 18-31, 1980.
[27]
{27} B. Morgenstern, "Dialign 2: Improvement of the Segment-to-Segment Approach to Multiple Sequence Alignment," Bioinformatics , vol. 15, no. 3, pp. 211-218, 1999.
[28]
{28} B. Morgenstern, "A Simple and Space-Efficient Fragment-Chaining Algorithm for Alignment of DNA and Protein Sequences," Applied Math. Letters, vol. 15, no. 1, pp. 11-16, 2002.
[29]
{29} B. Morgenstern, A. Dress, and T. Werner, "Multiple DNA and Protein Sequence Alignment Based on Segment-to-Segment Comparison," Proc. Nat'l Academy of Sciences, vol. 93, no. 22, pp. 12098-12103, 1996.
[30]
{30} M. Murata, J.S. Richardson, and J.L. Sussman, "Simultaneous Comparison of 3 Protein Sequences," Proc. Nat'l Academy of Science, vol. 82, no. 10, pp. 3073-3077, 1985.
[31]
{31} G. Myers and W. Miller, "Chaining Multiple-Alignment Fragments in Subquadratic Time," Proc. Sixth Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 38-47, 1995.
[32]
{32} S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins," J. Molecular Biology, vol. 48, pp. 443-453, 1970.
[33]
{33} C. Notredame, "Recent Progress in Multiple Sequence Alignment: A Survey," Pharmacogenomics, vol. 3, no. 1, pp. 131-144, 2002.
[34]
{34} C. Notredame, D.G. Higgins, and J. Heringa, "T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment," J. Molecular Biology, vol. 302, no. 1, pp. 205-217, 2000.
[35]
{35} G.D. Schuler, S.F.F. Altschul, and D.J. Lipman, "A Workbench for Multiple Alignment Construction and Analysis," Proteins-Structure Function and Genetics, vol. 9, no. 3, pp. 180-190, 1991.
[36]
{36} R.M. Schwartz and M.O. Dayhoff, "Matrices for Detecting Distant Relationships," Atlas of Protein Sequences and Structure, pp. 353-358, 1979.
[37]
{37} T.F. Smith and M.S. Waterman, "Comparison of Biosequences," Advanced Applied Math., vol. 2, no. 4, pp. 482-489, 1981.
[38]
{38} S.-H. Sze, Y. Lu, and Q. Yang, "A Polynomial Time Solvable Formulation of Multiple Sequence Alignment," J. Computational Biology, to appear.
[39]
{39} J.D. Thompson, D.G. Higgins, and T.J. Gibson, "Clustal-W-- Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, no. 22, pp. 4673-4680, 1994.
[40]
{40} J.D. Thompson, F. Plewniak, and O. Poch, "BAliBASE: A Benchmark Alignment Database for the Evaluation of Multiple Alignment Programs," Bioinformatics, vol. 15, pp. 87-88, 1999.
[41]
{41} W.J. Wilbur and D.J. Lipman, "Rapid Similarity Searches of Nucleic-Acid and Protein Data Banks," Proc. Nat'l Academy of Sciences, vol. 80, no. 3, pp. 726-730, 1983.
[42]
{42} W.J. Wilbur and D.J. Lipman, "The Context Dependent Comparison of Biological Sequences," SIAM J. Applied Math., vol. 44, no. 3, pp. 557-567, 1984.
[43]
{43} H. Zhou and Y. Zhou, "SPEM: Improving Multiple Sequence Alignment with Sequence Profiles and Predicted Secondary Structures," Bioinformatics, vol. 21, no. 18, pp. 3615-3621, 2005.

Cited By

View all
  • (2012)A parallel algorithm for multiple biological sequence alignmentProceedings of the 9th international conference on Information Processing in Cells and Tissues10.1007/978-3-642-28792-3_31(264-276)Online publication date: 31-Mar-2012
  • (2010)Nonlinear synchronization for automatic learning of 3D pose variability in human motion sequencesEURASIP Journal on Advances in Signal Processing10.1155/2010/5072472010(4-4)Online publication date: 1-Jan-2010

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 3, Issue 4
October 2006
117 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 October 2006
Published in TCBB Volume 3, Issue 4

Author Tags

  1. Multiple Sequence Alignment
  2. algorithms
  3. dynamic programming
  4. shortest path.

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2012)A parallel algorithm for multiple biological sequence alignmentProceedings of the 9th international conference on Information Processing in Cells and Tissues10.1007/978-3-642-28792-3_31(264-276)Online publication date: 31-Mar-2012
  • (2010)Nonlinear synchronization for automatic learning of 3D pose variability in human motion sequencesEURASIP Journal on Advances in Signal Processing10.1155/2010/5072472010(4-4)Online publication date: 1-Jan-2010

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media