Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Discovering almost any hidden motif from multiple sequences

Published: 31 March 2011 Publication History

Abstract

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G=g1g2gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2bm of G, every character is probabilistically generated such that the probability for bigi is at most α.
In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ|≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8(1- 1/|Σ|), there exist positive constants c0, ϵ, and δ2 such that if there are at least c0 log n input sequences, then in O(n2/h(log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G∈ Σρρ, h(Σ), where n the length of longest sequences, ρ is the length of the motif, h is a parameter with ρ≥ 4h≥ δ2log n, and Ψρ, h(Σ) is a small subset of at most 2−Θ(ϵ2 h) fraction of the sequences in Σρ.

References

[1]
Chin, F. and Leung, H. 2005. Voting algorithms for discovering long motifs. In Proceedings of the 3rd Asia-Pacific Bioinformatics Conference. 261--272.
[2]
Dopazo, J., Rodríguez, A., Sáiz, J. C., and Sobrino, F. 1993. Design of primers for PCR amplification of highly variable genomes. Comput. Appl. Biosci. 9, 123--125.
[3]
Frances, M. and Litman, A. 1997. On covering problems of codes. Theory Comput. Syst. 30, 113--119.
[4]
Fu, B., Kao, M.-Y., and Wang, L. 2009. Probabilistic analysis of a motif discovery algorithm for multiple sequences. SIAM J. Discr. Math. 23, 4, 1715--173.
[5]
Gąsieniec, L., Jansson, J., and Lingas, A. 1999. Efficient approximation algorithms for the Hamming center problem. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms. S905--S906.
[6]
Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press.
[7]
Hertz, G. and Stormo, G. 1995. Identification of consensus patterns in unaligned DNA and protein sequences: A large-deviation statistical basis for penalizing gaps. In Proceedings of the 3rd International Conference on Bioinformatics and Genome Research. 201--216.
[8]
Karloff, H. 1993. Fast algorithms for approximately counting mismatches. Inf. Process. Lett. 48, 2, 53--60.
[9]
Keich, U. and Pevzner, P. 2002a. Finding motifs in the twilight zone. Bioinf. 18, 1374--1381.
[10]
Keich, U. and Pevzner, P. 2002b. Subtle motifs: Defining the limits of motif finding algorithms. Bioinf. 18, 1382--1390.
[11]
Lanctot, J. K., Li, M., Ma, B., Wang, L., and Zhang, L. 2003. Distinguishing string selection problems. Inf. Comput. 185, 41--55.
[12]
Lawrence, C. and Reilly, A. 1990. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41--51.
[13]
Li, M., Ma, B., and Wang, L. 1999. Finding similar regions in many strings. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. 473--482.
[14]
Li, M., Ma, B., and Wang, L. 2002. On the closest string and substring problems. J. ACM 49, 2, 157--171.
[15]
Lucas, K., Busch, M., Mossinger, S., and Thompson, J. 1991. An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Comput. Appl. Biosci. 7, 525--529.
[16]
Motwani, R. and Raghavan, P. 2000. Randomized Algorithms. Cambridge University Press.
[17]
Pevzner, P. and Sze, S. 2000. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 269--278.
[18]
Proutski, V. and Holme, E. C. 1996. Primer master: A new program for the design and analysis of PCR primers. Comput. Appl. Biosci. 12, 253--255.
[19]
Stormo, G. 1990. Consensus patterns in DNA, in R. F. Doolitle (ed.), Molecular evolution: Computer analysis of protein and nucleic acid sequences. Methods Enzymolog. 183, 211--221.
[20]
Stormo, G. and Hartzell III, G. 1991. Identifying protein-binding sites from unaligned DNA fragments. Proc. Nat. Acad. Sci. 88, 5699--5703.
[21]
Wang, L. and Dong, L. 2005. Randomized algorithms for motif detection. J. Bioinf. Comput. Biol. 3, 5, 1039--1052.

Cited By

View all
  • (2013)Sublinear Time Motif Discovery from Multiple SequencesAlgorithms10.3390/a60406366:4(636-677)Online publication date: 14-Oct-2013

Index Terms

  1. Discovering almost any hidden motif from multiple sequences

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Algorithms
      ACM Transactions on Algorithms  Volume 7, Issue 2
      March 2011
      284 pages
      ISSN:1549-6325
      EISSN:1549-6333
      DOI:10.1145/1921659
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 March 2011
      Accepted: 01 February 2010
      Revised: 01 February 2010
      Received: 01 April 2008
      Published in TALG Volume 7, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Probability
      2. complexity
      3. motif detection
      4. probabilistic analysis

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2013)Sublinear Time Motif Discovery from Multiple SequencesAlgorithms10.3390/a60406366:4(636-677)Online publication date: 14-Oct-2013

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media