research-article

Discovering almost any hidden motif from multiple sequences

Authors:

Lusheng WangAuthors Info & Claims

ACM Transactions on Algorithms (TALG), Volume 7, Issue 2

Article No.: 26, Pages 1 - 18

https://doi.org/10.1145/1921659.1921672

Published: 31 March 2011 Publication History

Abstract

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G=g₁g₂… g_m is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b₁b₂… b_m of G, every character is probabilistically generated such that the probability for b_i≠ g_i is at most α.

In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ|≥ 2 and is applicable to DNA motif discovery. We prove that for α < 1/8(1- 1/|Σ|), there exist positive constants c₀, ϵ, and δ₂ such that if there are at least c₀ log n input sequences, then in O(n²/h(log n)^O(1)) time this algorithm finds the motif with probability at least 3/4 for every G∈ Σ^ρ-Ψ_{ρ, h,ϵ}(Σ), where n the length of longest sequences, ρ is the length of the motif, h is a parameter with ρ≥ 4h≥ δ₂log n, and Ψ_{ρ, h,ϵ}(Σ) is a small subset of at most 2^{−Θ(ϵ² h)} fraction of the sequences in Σ^ρ.

References

[1]

Chin, F. and Leung, H. 2005. Voting algorithms for discovering long motifs. In Proceedings of the 3rd Asia-Pacific Bioinformatics Conference. 261--272.

[2]

Dopazo, J., Rodríguez, A., Sáiz, J. C., and Sobrino, F. 1993. Design of primers for PCR amplification of highly variable genomes. Comput. Appl. Biosci. 9, 123--125.

[3]

Frances, M. and Litman, A. 1997. On covering problems of codes. Theory Comput. Syst. 30, 113--119.

[4]

Fu, B., Kao, M.-Y., and Wang, L. 2009. Probabilistic analysis of a motif discovery algorithm for multiple sequences. SIAM J. Discr. Math. 23, 4, 1715--173.

Digital Library

[5]

Gąsieniec, L., Jansson, J., and Lingas, A. 1999. Efficient approximation algorithms for the Hamming center problem. In Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms. S905--S906.

Digital Library

[6]

Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press.

Digital Library

[7]

Hertz, G. and Stormo, G. 1995. Identification of consensus patterns in unaligned DNA and protein sequences: A large-deviation statistical basis for penalizing gaps. In Proceedings of the 3rd International Conference on Bioinformatics and Genome Research. 201--216.

[8]

Karloff, H. 1993. Fast algorithms for approximately counting mismatches. Inf. Process. Lett. 48, 2, 53--60.

Digital Library

[9]

Keich, U. and Pevzner, P. 2002a. Finding motifs in the twilight zone. Bioinf. 18, 1374--1381.

[10]

Keich, U. and Pevzner, P. 2002b. Subtle motifs: Defining the limits of motif finding algorithms. Bioinf. 18, 1382--1390.

[11]

Lanctot, J. K., Li, M., Ma, B., Wang, L., and Zhang, L. 2003. Distinguishing string selection problems. Inf. Comput. 185, 41--55.

Digital Library

[12]

Lawrence, C. and Reilly, A. 1990. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41--51.

[13]

Li, M., Ma, B., and Wang, L. 1999. Finding similar regions in many strings. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. 473--482.

Digital Library

[14]

Li, M., Ma, B., and Wang, L. 2002. On the closest string and substring problems. J. ACM 49, 2, 157--171.

Digital Library

[15]

Lucas, K., Busch, M., Mossinger, S., and Thompson, J. 1991. An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Comput. Appl. Biosci. 7, 525--529.

[16]

Motwani, R. and Raghavan, P. 2000. Randomized Algorithms. Cambridge University Press.

Digital Library

[17]

Pevzner, P. and Sze, S. 2000. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology. 269--278.

Digital Library

[18]

Proutski, V. and Holme, E. C. 1996. Primer master: A new program for the design and analysis of PCR primers. Comput. Appl. Biosci. 12, 253--255.

[19]

Stormo, G. 1990. Consensus patterns in DNA, in R. F. Doolitle (ed.), Molecular evolution: Computer analysis of protein and nucleic acid sequences. Methods Enzymolog. 183, 211--221.

[20]

Stormo, G. and Hartzell III, G. 1991. Identifying protein-binding sites from unaligned DNA fragments. Proc. Nat. Acad. Sci. 88, 5699--5703.

[21]

Wang, L. and Dong, L. 2005. Randomized algorithms for motif detection. J. Bioinf. Comput. Biol. 3, 5, 1039--1052.

Cited By

Fu BFu YXue Y(2013)Sublinear Time Motif Discovery from Multiple SequencesAlgorithms10.3390/a60406366:4(636-677)Online publication date: 14-Oct-2013
https://doi.org/10.3390/a6040636

Index Terms

Discovering almost any hidden motif from multiple sequences
1. Applied computing
  1. Life and medical sciences
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Probabilistic Analysis of a Motif Discovery Algorithm for Multiple Sequences

We study a natural probabilistic model for motif discovery that has been used to experimentally test the quality of motif discovery programs. In this model, there are $k$ background sequences, and each character in a background sequence is a random ...
Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability
TAMC '09: Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are <em>k</em> background sequences, and each character in a background sequence is ...
Discovering motifs in biological sequences using the micron automata processor

Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms

ACM Transactions on Algorithms Volume 7, Issue 2

March 2011

284 pages

ISSN:1549-6325

EISSN:1549-6333

DOI:10.1145/1921659

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2011

Accepted: 01 February 2010

Revised: 01 February 2010

Received: 01 April 2008

Published in TALG Volume 7, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
271
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fu BFu YXue Y(2013)Sublinear Time Motif Discovery from Multiple SequencesAlgorithms10.3390/a60406366:4(636-677)Online publication date: 14-Oct-2013
https://doi.org/10.3390/a6040636

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents