Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2506583.2506602acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
tutorial

MarkovBin: An Algorithm to Cluster Metagenomic Reads Using a Mixture Modeling of Hierarchical Distributions

Published: 22 September 2013 Publication History

Abstract

Metagenomics is the study of genomic content of microorganisms from environmental samples without isolation and cultivation. Recently developed next generation sequencing (NGS) technologies efficiently generate vast amounts of metagenomic DNA sequences. However, the ultra-high throughput and short read lengths make the separation of reads from different species more challenging. Among the existing computational tools for NGS data, there are supervised methods that use reference databases to classify reads and unsupervised methods that use oligonucleotide patterns to cluster reads. The former may leave a large fraction of reads unclassified due to the absence of closely related references. The latter often rely on long oligonucleotide frequencies and are sensitive to species abundance levels. In this work, we present MarkovBin, a new unsupervised method that can accurately cluster metagenomic reads across various species abundance ratios. We first model the nucleotide sequences as a fixed-order Markov chain. We then propose a hierarchical distribution to model the dependency between paired-end reads. Finally, we employ the mixture model framework to separate reads from different genomes in a metagenomic dataset. Using extensive simulation data, we demonstrate a high accuracy and precision by comparing to selected unsupervised read clustering tools. The software is freely available at http://orleans.cs.wayne.edu/MarkovBin.

References

[1]
M. S. Rappé and S. J. Giovannoni. The uncultured microbial majority. Annual Reviews in Microbiology, 57(1):369--394, 2003.
[2]
J. A. Eisen. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS biology, 5(3):e82, 2007.
[3]
J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5(10):R245--R249, 1998.
[4]
K. Chen and L. Pachter. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS computational biology, 1(2):e24, 2005.
[5]
S. Leininger, T. Urich, M. Schloter, L. Schwark, J. Qi, G. Nicol, J. Prosser, S. Schuster, and C. Schleper. Archaea predominate among ammonia-oxidizing prokaryotes in soils. Nature, 442(7104):806--809, 2006.
[6]
G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banfield. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 428(6978):37--43, 2004.
[7]
S. Yooseph, G. Sutton, D. B. Rusch, A. L. Halpern, S. J. Williamson, K. Remington, J. A. Eisen, K. B. Heidelberg, G. Manning, W. Li, et al. The sorcerer ii global ocean sampling expedition: expanding the universe of protein families. PLoS biology, 5(3):e16, 2007.
[8]
E. K. Costello, C. L. Lauber, M. Hamady, N. Fierer, J. I. Gordon, and R. Knight. Bacterial community variation in human body habitats across space and time. Science, 326(5960):1694--1697, 2009.
[9]
E. A. Grice, H. H. Kong, S. Conlan, C. B. Deming, J. Davis, A. C. Young, G. G. Bouffard, R. W. Blakesley, P. R. Murray, E. D. Green, et al. Topographical and temporal diversity of the human skin microbiome. science, 324(5931):1190--1192, 2009.
[10]
J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, T. Yamada, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464(7285):59--65, 2010.
[11]
M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen, Z. Chen, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376--380, 2005.
[12]
D. R. Bentley. Whole-genome re-sequencing. Current opinion in genetics & development, 16(6):545--552, 2006.
[13]
M. Pop. Genome assembly reborn: recent computational challenges. Briefings in bioinformatics, 10(4):354--366, 2009.
[14]
A. Charuvaka and H. Rangwala. Evaluation of short read metagenomic assembly. BMC genomics, 12(Suppl 2):S8, 2011.
[15]
H. Teeling and F. O. Glöckner. Current opportunities and challenges in microbial metagenome analysisŮa bioinformatic perspective. Briefings in bioinformatics, 13(6):728--742, 2012.
[16]
D. J. Lane, B. Pace, G. J. Olsen, D. A. Stahl, M. L. Sogin, and N. R. Pace. Rapid determination of 16s ribosomal rna sequences for phylogenetic analyses. Proceedings of the National Academy of Sciences, 82(20):6955--6959, 1985.
[17]
J. Cole, B. Chai, R. Farris, Q. Wang, S. Kulam, D. McGarrell, G. Garrity, and J. Tiedje. The ribosomal database project (rdp-ii): sequences and tools for high-throughput rrna analysis. Nucleic Acids Research, 33(suppl 1):D294--D296, 2005.
[18]
S. Chakravorty, D. Helb, M. Burday, N. Connell, and D. Alland. A detailed analysis of 16s ribosomal rna gene segments for the diagnosis of pathogenic bacteria. Journal of microbiological methods, 69(2):330--339, 2007.
[19]
R. J. Case, Y. Boucher, I. Dahllöf, C. Holmström, W. F. Doolittle, and S. Kjelleberg. Use of 16s rrna and rpob genes as molecular markers for microbial ecology studies. Applied and environmental microbiology, 73(1):278--288, 2007.
[20]
A. C. McHardy, H. G. Martin, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos. Accurate phylogenetic classification of variable-length dna fragments. Nature methods, 4(1):63--72, 2006.
[21]
D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. Megan analysis of metagenomic data. Genome Res., 17(3):377--386, 2007.
[22]
M. Wu and J. A. Eisen. A simple, fast, and accurate method of phylogenomic inference. Genome Biology, 9(10):R151, 2008.
[23]
A. Brady and S. L. Salzberg. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature methods, 6(9):673--676, 2009.
[24]
J. C. Clemente, J. Jansson, and G. Valiente. Flexible taxonomic assignment of ambiguous sequencing reads. BMC bioinformatics, 12(1):8, 2011.
[25]
T. Abe, S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, and T. Ikemura. Informatics for unveiling hidden genome signatures. Genome Res., 13(4):693--702, 2003.
[26]
H. Teeling, A. Meyerdierks, M. Bauer, R. Amann, and F. O. Glockner. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9):938--947, 2004.
[27]
J. Bohlin, E. Skjerve, and D. W. Ussery. Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS computational biology, 4(4):e1000057, 2008.
[28]
H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glöckner. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5(1):163, 2004.
[29]
S. Chatterji, I. Yamazaki, Z. Bai, and J. A. Eisen. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Research in Computational Molecular Biology, pages 17--28. Springer, 2008.
[30]
N. N. Diaz, L. Krause, A. Goesmann, K. Niehaus, and T. W. Nattkemper. TACOA--Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC bioinformatics, 10(1):56, 2009.
[31]
A. Kislyuk, S. Bhatnagar, J. Dushoff, and J. Weitz. Unsupervised statistical clustering of environmental shotgun sequences. BMC bioinformatics, 10(1):316, 2009.
[32]
D. Kelley and S. Salzberg. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics, 11(1):544, 2010.
[33]
H. C. Leung, S. Yiu, B. Yang, Y. Peng, Y. Wang, Z. Liu, J. Chen, J. Qin, R. Li, and F. Y. Chin. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics, 27(11):1489--1495, 2011.
[34]
Y.-W. Wu and Y. Ye. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. Journal of Computational Biology, 18(3):523--534, 2011.
[35]
X. Li and M. S. Waterman. Estimating the Repeat Structure and Length of DNA Sequences Using l-Tuples. Genome research, 13(8):1916--1922, 2003.
[36]
Y. Wang, H. C. Leung, S. Yiu, and F. Y. Chin. Metacluster 4.0: a novel binning algorithm for ngs reads and huge number of species. Journal of Computational Biology, 19(2):241--249, 2012.
[37]
Y. Wang, H. C. Leung, S. Yiu, and F. Y. Chin. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics, 28(18):i356--i362, 2012.
[38]
J. C. Wooley, A. Godzik, and I. Friedberg. A primer on metagenomics. PLoS computational biology, 6(2):e1000667, 2010.
[39]
S. Schbath, B. Prum, and E. DE TURCKHEIM. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. Journal of Computational Biology, 2(3):417--437, 1995.
[40]
R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.
[41]
L. G. Wayne. International committee on systematic bacteriology: announcement of the report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Systematic and Applied Microbiology, 10(2):99--100, 1988.
[42]
S. Schbath, B. Prum, and E. de Turckheim. Exceptional motifs in different markov chain models for a statistical analysis of dna sequences. Journal of Computational Biology, 2(3):417--437, 1995.
[43]
G. J. McLachlan and S. U. Chang. Mixture modelling for cluster analysis. Statistical Methods in Medical Research, 13(5):347--361, 2004.
[44]
J.-J. Daudin, S. Li-Thiao-Te, and E. Lebarbier. Statistical challenges from the analysis of NGS-Metagenomics experiment., 2010.
[45]
G. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
[46]
D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and D. H. Huson. MetaSim-A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE, 3(10):e3373, 2008.
[47]
L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193--218, 1985.
[48]
W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846--850, 1971.
[49]
H. Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716--723, 1974.
[50]
G. Schwarz. Estimating the dimension of a model. The annals of statistics, 6(2):461--464, 1978.

Cited By

View all
  • (2016)Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing ReadsBioinformatics Research and Applications10.1007/978-3-319-38782-6_2(15-26)Online publication date: 27-May-2016
  • (2014)A novel semi-supervised learning approach to analyzing metagenomic readsProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2649387.2660808(629-630)Online publication date: 20-Sep-2014

Index Terms

  1. MarkovBin: An Algorithm to Cluster Metagenomic Reads Using a Mixture Modeling of Hierarchical Distributions

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
      September 2013
      987 pages
      ISBN:9781450324342
      DOI:10.1145/2506583
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 September 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Hierarchical Distribution
      2. Metagenomics
      3. Mixture Model

      Qualifiers

      • Tutorial
      • Research
      • Refereed limited

      Conference

      BCB'13
      Sponsor:
      BCB'13: ACM-BCB2013
      September 22 - 25, 2013
      Wshington DC, USA

      Acceptance Rates

      BCB'13 Paper Acceptance Rate 43 of 148 submissions, 29%;
      Overall Acceptance Rate 254 of 885 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 11 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2016)Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing ReadsBioinformatics Research and Applications10.1007/978-3-319-38782-6_2(15-26)Online publication date: 27-May-2016
      • (2014)A novel semi-supervised learning approach to analyzing metagenomic readsProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2649387.2660808(629-630)Online publication date: 20-Sep-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media