Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2382936.2382940acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Improved multiple sequence alignments using coupled pattern mining

Published: 07 October 2012 Publication History

Abstract

We present ARMiCoRe, a novel approach to a classical bioinformatics problem, viz. multiple sequence alignment (MSA) of gene and protein sequences. Aligning multiple biological sequences is a key step in elucidating evolutionary relationships, annotating newly sequenced segments, and understanding the relationship between biological sequences and functions. Classical MSA algorithms are designed to primarily capture conservations in sequences whereas couplings, or correlated mutations, are well known as an additional important aspect of sequence evolution. (Two sequence positions are coupled when mutations in one are accompanied by compensatory mutations in another). As a result, better exposition of couplings is sometimes one of the reasons for hand-tweaking of MSAs by practitioners. ARMiCoRe introduces a distinctly pattern mining approach to improving MSAs: using frequent episode mining as a foundational basis, we define the notion of a coupled pattern and demonstrate how the discovery and tiling of coupled patterns using a max-flow approach can yield MSAs that are better than conservation-based alignments. Although we were motivated to improve MSAs for the sake of better exposing couplings, we demonstrate that our MSAs are also improvements in terms of traditional metrics of assessment. We demonstrate the effectiveness of ARMiCoRe on a large collection of datasets.

References

[1]
S. Balakrishnan, H. Kamisetty, J. Carbonell, S. Lee, and C. Langmead. Learning Generative Models for Protein Fold Families. Proteins: Structure, Function, and Bioinformatics, 79(4):1061--1078, 2011.
[2]
B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, and A. Zimmermann. Mining Sets of Patterns: A Tutorial at ECMLPKDD 2010, Barcelona, Spain.
[3]
H. Carrillo and D. Lipman. The Multiple Sequence Alignment Problem in Biology. SIAM J. Appl. Math., 48(5):1073--1082, 1988.
[4]
M. Cline, R. Hughey, and K. Karplus. Predicting Reliable Regions in Protein Sequence Alignments. Bioinformatics, 18(2):306--314, 2002.
[5]
C. Do and K. Katoh. Protein Multiple Sequence Alignment. In Functional Proteomics, volume 484, chapter 25, pages 379--413. 2008.
[6]
C. Do, M. Mahabhashyam, M. Brudno, and S. Batzoglou. ProbCons: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Res., 15:330--340, 2005.
[7]
R. Edgar. MSA Benchmark Collection. Available at http://www.drive5.com/bench/.
[8]
R. Edgar. MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput. Nucleic Acids Res., 32:1792--1797, 2004.
[9]
A. Goldberg and R. Tarjan. A New Approach to the Maximum-Flow Problem. J. ACM, 35:921--940, 1988.
[10]
R. Gouveia-Oliveira and A. Pedersen. Finding Coevolving Amino Acid Residues using Row and Column Weighting of Mutual Information and Multi-dimensional Amino Acid Representation. Algorithms for Molecular Biology, 1:12, 2007.
[11]
L. Guasco. Multiple Sequence Alignment Correction using Constraints. PhD thesis, Universidade Nova de Lisboa, 2010.
[12]
T. Hopf, L. Colwell, R. Sheridan, B. Rost, C. Sander, and D. Marks. Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing. Cell, 149(7):1607--1621, 2012.
[13]
T. Hubbard, A. Lesk, and A. Tramontano. Gathering Them in to the Fold. NSMB, 3(4):313, 1996.
[14]
W. Kroeze, D. Sheffler, and B. Roth. G-protein-coupled Receptors at a Glance. J. Cell Sci., 116:4867--4869, 2003.
[15]
A. Löytynoja and M. Milinkovitch. A Hidden Markov Model for Progressive Multiple Alignment. Bioinformatics, 19(12):1505--1513, 2003.
[16]
B. Morgenstern. DIALIGN: Multiple DNA and Protein Sequence Alignment at BiBiServ. Nucleic Acids Res., 32(suppl 2):W33--W36, 2004.
[17]
G. Nemhauser, L. Wolsey, and M. Fisher. An Analysis of Approximations for Maximizing Submodular Set Functions-I. Math. Prog., 14:265--294, 1978.
[18]
C. Notredame, D. Higgins, and J. Heringa. T-coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. JMB, 302(1):205--217, 2000.
[19]
J. Papadopoulos and R. Agarwala. COBALT: Constraint-based Alignment Tool for Multiple Protein Sequences. Bioinformatics, 23(9):1073--1079, 2007.
[20]
J. Pei and N. Grishin. PROMALS: Towards Accurate Multiple Sequence Alignments of Distantly Related Proteins. Bioinformatics, 23(7):802--808, 2007.
[21]
G. P. S. Raghava, S. Searle, P. Audley, J. Barber, and G. Barton. OXBench: A Benchmark for Evaluation of Protein Multiple Sequence Alignment Accuracy. BMC bioinformatics, 4(1):47, 2003.
[22]
J. Sauder, J. Arthur, and R. Dunbrack. Large-scale Comparison of Protein Sequence Alignment Algorithms with Structure Alignments. Proteins, 40:6--22, 2000.
[23]
W. Taylor. The Classification of Amino Acid Conservation. J. Theor. Biol., 119:205--218, 1986.
[24]
J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg. Graphical Models of Residue Coupling in Protein Families. IEEE/ACM TCBB, 5(2):183--197, 2008.
[25]
J. Thomas, N. Ramakrishnan, and C. Bailey-Kellogg. Protein Design by Sampling an Undirected Graphical Model of Residue Constraints. IEEE/ACM TCBB, 6(3):506--516, 2009.
[26]
J. Thompson, D. Higgins, and T. Gibson. CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-specific Gap Penalties and Weight Matrix Choice. Nucleic Acids Res., 22:4673--4680, 1994.
[27]
J. Thompson, P. Koehl, R. Ripp, and O. Poch. BAliBASE 3.0: Latest Developments of the Multiple Sequence Alignment Benchmark. Proteins: Structure, Function, and Bioinformatics, 61(1):127--136, 2005.
[28]
L. Wang and T. Jiang. On the Complexity of Multiple Sequence Alignment. JCB, 1(4):337--348, 1994.

Index Terms

  1. Improved multiple sequence alignments using coupled pattern mining

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      BCB '12: Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
      October 2012
      725 pages
      ISBN:9781450316705
      DOI:10.1145/2382936
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 October 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. bioinformtics
      2. coupled patterns
      3. coupled residues
      4. max-flow problems
      5. multiple sequence alignment
      6. pattern set mining

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      BCB' 12
      Sponsor:

      Acceptance Rates

      BCB '12 Paper Acceptance Rate 33 of 159 submissions, 21%;
      Overall Acceptance Rate 254 of 885 submissions, 29%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 129
        Total Downloads
      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media