Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2093698.2093709acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisabelConference Proceedingsconference-collections
research-article

Computational inference of difficult word boundaries in DNA languages

Published: 26 October 2011 Publication History

Abstract

Many applications in molecular and systems biology exploit similarities between DNA and languages to make predictions about cell function. This approach provides structure to an otherwise monotonous sequence of nucleotides. However, one of the major differences between DNA sequences and text is in how semantic units (e.g. words) are distinguished within them. Whereas words and sentences are separated by spaces and punctuation in natural languages, no such markers exist in DNA. Some semantic units in DNA (e.g. genes) can be identified relatively easily and with relatively high accuracy. Other units may have less known molecular mechanisms and are therefore harder to identify accurately. In this paper we discuss three machine learning methods to elucidate the boundaries of such difficult units: heuristic approaches use hypothesized models of the mechanism to identify word boundaries, supervised machine learning methods generalise labelled examples of word boundaries to a model that can be used to detect these boundaries, and unsupervised machine learning methods infer a model from unlabeled data. As an example, we use a bacterial transposable element called ISEcp1 that moves DNA segments of variable length. We assess the accuracy of each of the above methods using rediscovery experiments. We demonstrate the power of the methods by examining 9 instances of DNA segments associated with ISEcp1 that lack known boundaries. We identified 6 units that include genes that confer resistance to clinically important antibiotics.

References

[1]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J Mol Biol, 215(3):403--410, Oct 1990.
[2]
F. Baquero. From pieces to patterns: evolutionary engineering in bacterial pathogens. Nat Rev Microbiol, 2(6):510--518, Jun 2004.
[3]
V. Cao, T. Lambert, and P. Courvalin. ColE1-like plasmid pIP843 of Klebsiella pneumoniae encoding extended-spectrum β-lactamase CTX-M-17. Antimicrob Agents Chemother, 46(5):1212--1217, May 2002.
[4]
V. Cattoir, P. Nordmann, J. Silva-Sanchez, P. Espinal, and L. Poirel. ISEcp1-mediated transposition of qnrB-like gene in Escherichia coli. Antimicrob Agents Chemother, 52(8):2929--2932, Aug 2008.
[5]
M. Chandler and J. Mahillon. Insertion sequences revisited. Mobile DNA II, 1:305--366, 2002.
[6]
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
[7]
A. L. Delcher, K. A. Bratke, E. C. Powers, and S. L. Salzberg. Identifying bacterial genes and endosymbiont dna with glimmer. Bioinformatics, 23(6):673--679, Mar 2007.
[8]
C. Humeniuk, G. Arlet, V. Gautier, P. Grimont, R. Labia, and A. Philippon. β-lactamases of kluyvera ascorbata, probable progenitors of some plasmid-encoded CTX-M types. Antimicrob Agents Chemother, 46(9):3045--3049, Sep 2002.
[9]
G. A. Jacoby. AmpC β-lactamases.l. Clin Microbiol Rev, 22(1):161--82, Jan 2009.
[10]
S. Ji. The linguistics of DNA: Words, sentences, grammar, phonetics, and semantics. Annals of the New York Academy of Sciences, 870(1):411, 1999.
[11]
A. Krogh et al. An introduction to hidden Markov models for biological sequences. NEW COMPREHENSIVE BIOCHEMISTRY, 32:45--63, 1998.
[12]
M.-F. Lartigue, L. Poirel, D. Aubert, and P. Nordmann. In vitro analysis of ISEcp1B-mediated mobilization of naturally occurring beta-lactamase gene blaCTX-M of Kluyvera ascorbatap. Antimicrob Agents Chemother, 50(4):1282--1286, Apr 2006.
[13]
S. Leung, C. Mellish, and D. Robertson. Basic gene grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences. Bioinformatics, 17(3):226--236, Mar 2001.
[14]
S. R. Partridge and G. Tsafnat. The repository of antibiotic-resistance cassettes - an online database of gene cassettes found in mobile resistance integrons and a web-based application for annotation of dna sequences containing these cassettes. In 21st European Congress of Clinical Microbiology and Infectious Diseases., Milan, Italy, May 2011.
[15]
S. R. Partridge, G. Tsafnat, E. Coiera, and J. R. Iredell. Gene cassettes and cassette arrays in mobile resistance integrons. FEMS Microbiol Rev, 33(4):757--784, Jul 2009.
[16]
W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA, 85(8):2444--2448, April 1988.
[17]
L. Poirel, J.-W. Decousser, and P. Nordmann. Insertion sequence ISEcp1B is involved in expression and mobilization of a bla ctx--m β-lactamase gene. Antimicrob Agents Chemother, 47(9):2938--2945, Sep 2003.
[18]
L. Poirel, M. F. Lartigue, J.-W. Decousser, and P. Nordmann. ISEcp1B-mediated transposition of bla ctx--m in Escherichia coli. Antimicrob Agents Chemother, 49(1):447--450, Jan 2005.
[19]
E. Rivas and S. R. Eddy. The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics, 16(4):334--340, Apr 2000.
[20]
G. M. Rossolini, M. M. D'Andrea, and C. Mugnaioli. The spread of CTX-M-type extended-spectrum β-lactamases.wa. Clin Microbiol Infect, 14 Suppl 1:33--41, Jan 2008.
[21]
J. Schaeffer, A. Held, and G. Tsafnat. Infectious Disease Informatics, chapter Computational Grammars for Interrogation of Genomes, pages 263--278. Springer, New York, USA, 2010.
[22]
D. B. Searls. The language of genes. Nature, 420(6912):211--217, Nov 2002.
[23]
G. Tsafnat, F. Lin, and M. Choong. Encyclopaedia of Systems Biology, chapter Translational Biomedical Informatics, pages in press, 11/9/2010. Springer, 2011.
[24]
G. Tsafnat, S. R. Partridge, E. Coiera, J. Schaeffer, and J. R. Iredell. Context-driven discovery of gene cassettes in mobile integrons using a computational grammar. BMC Bioinformatics, 10(281), Sep 2009.
[25]
C. J. van Rijsbergen. Information Retrieval. Butterworth-Heinemann Newton, MA, USA, 1979.
[26]
A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260--269, 1967.
[27]
J.-I. Wachino, K. Yamane, K. Kimura, N. Shibata, S. Suzuki, Y. Ike, and Y. Arakawa. Mode of transposition and expression of 16s rrna methyltransferase gene rmtc accompanied by isecp1. Antimicrob Agents Chemother, 50(9):3212--3215, Sep 2006.
[28]
J.-I. Wachino, K. Yamane, K. Shibayama, H. Kurokawa, N. Shibata, S. Suzuki, Y. Doi, K. Kimura, Y. Ike, and Y. Arakawa. Novel plasmid-mediated 16S rRNA methylase, RmtC, found in a Proteus mirabilis isolate demonstrating extraordinary high-level resistance against various aminoglycosides. Antimicrob Agents Chemother, 50(1):178--184, Jan 2006.
[29]
Z. Zong, S. R. Partridge, and J. R. Iredell. ISEcp1-mediated transposition and homologous recombination can explain the context of bla ctx--m--62 linked to qnrB2. Antimicrob Agents Chemother, 54(7):3039--3042, Jul 2010.

Cited By

View all
  • (2020)Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable ElementsProcesses10.3390/pr80606388:6(638)Online publication date: 27-May-2020
  • (2019)A systematic review of the application of machine learning in the detection and classification of transposable elementsPeerJ10.7717/peerj.83117(e8311)Online publication date: 18-Dec-2019
  • (2019)Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine LearningInternational Journal of Molecular Sciences10.3390/ijms2015383720:15(3837)Online publication date: 6-Aug-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies
October 2011
949 pages
ISBN:9781450309134
DOI:10.1145/2093698
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Universitat Pompeu Fabra
  • IEEE
  • Technical University of Catalonia Spain: Technical University of Catalonia (UPC), Spain
  • River Publishers: River Publishers
  • CTTC: Technological Center for Telecommunications of Catalonia
  • CTIF: Kyranova Ltd, Center for TeleInFrastruktur

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DNA languages
  2. antibioitc resistance
  3. machine learning
  4. translational bioinformatics

Qualifiers

  • Research-article

Conference

ISABEL '11
Sponsor:
  • Technical University of Catalonia Spain
  • River Publishers
  • CTTC
  • CTIF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable ElementsProcesses10.3390/pr80606388:6(638)Online publication date: 27-May-2020
  • (2019)A systematic review of the application of machine learning in the detection and classification of transposable elementsPeerJ10.7717/peerj.83117(e8311)Online publication date: 18-Dec-2019
  • (2019)Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine LearningInternational Journal of Molecular Sciences10.3390/ijms2015383720:15(3837)Online publication date: 6-Aug-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media