tutorial

MarkovBin: An Algorithm to Cluster Metagenomic Reads Using a Mixture Modeling of Hierarchical Distributions

Authors:

Tin Chi Nguyen,

Dongxiao ZhuAuthors Info & Claims

BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Pages 115 - 123

https://doi.org/10.1145/2506583.2506602

Published: 22 September 2013 Publication History

Abstract

Metagenomics is the study of genomic content of microorganisms from environmental samples without isolation and cultivation. Recently developed next generation sequencing (NGS) technologies efficiently generate vast amounts of metagenomic DNA sequences. However, the ultra-high throughput and short read lengths make the separation of reads from different species more challenging. Among the existing computational tools for NGS data, there are supervised methods that use reference databases to classify reads and unsupervised methods that use oligonucleotide patterns to cluster reads. The former may leave a large fraction of reads unclassified due to the absence of closely related references. The latter often rely on long oligonucleotide frequencies and are sensitive to species abundance levels. In this work, we present MarkovBin, a new unsupervised method that can accurately cluster metagenomic reads across various species abundance ratios. We first model the nucleotide sequences as a fixed-order Markov chain. We then propose a hierarchical distribution to model the dependency between paired-end reads. Finally, we employ the mixture model framework to separate reads from different genomes in a metagenomic dataset. Using extensive simulation data, we demonstrate a high accuracy and precision by comparing to selected unsupervised read clustering tools. The software is freely available at http://orleans.cs.wayne.edu/MarkovBin.

References

[1]

M. S. Rappé and S. J. Giovannoni. The uncultured microbial majority. Annual Reviews in Microbiology, 57(1):369--394, 2003.

[2]

J. A. Eisen. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS biology, 5(3):e82, 2007.

[3]

J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology, 5(10):R245--R249, 1998.

[4]

K. Chen and L. Pachter. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS computational biology, 1(2):e24, 2005.

[5]

S. Leininger, T. Urich, M. Schloter, L. Schwark, J. Qi, G. Nicol, J. Prosser, S. Schuster, and C. Schleper. Archaea predominate among ammonia-oxidizing prokaryotes in soils. Nature, 442(7104):806--809, 2006.

[6]

G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banfield. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 428(6978):37--43, 2004.

[7]

S. Yooseph, G. Sutton, D. B. Rusch, A. L. Halpern, S. J. Williamson, K. Remington, J. A. Eisen, K. B. Heidelberg, G. Manning, W. Li, et al. The sorcerer ii global ocean sampling expedition: expanding the universe of protein families. PLoS biology, 5(3):e16, 2007.

[8]

E. K. Costello, C. L. Lauber, M. Hamady, N. Fierer, J. I. Gordon, and R. Knight. Bacterial community variation in human body habitats across space and time. Science, 326(5960):1694--1697, 2009.

[9]

E. A. Grice, H. H. Kong, S. Conlan, C. B. Deming, J. Davis, A. C. Young, G. G. Bouffard, R. W. Blakesley, P. R. Murray, E. D. Green, et al. Topographical and temporal diversity of the human skin microbiome. science, 324(5931):1190--1192, 2009.

[10]

J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen, N. Pons, F. Levenez, T. Yamada, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464(7285):59--65, 2010.

[11]

M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y.-J. Chen, Z. Chen, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376--380, 2005.

[12]

D. R. Bentley. Whole-genome re-sequencing. Current opinion in genetics & development, 16(6):545--552, 2006.

[13]

M. Pop. Genome assembly reborn: recent computational challenges. Briefings in bioinformatics, 10(4):354--366, 2009.

[14]

A. Charuvaka and H. Rangwala. Evaluation of short read metagenomic assembly. BMC genomics, 12(Suppl 2):S8, 2011.

[15]

H. Teeling and F. O. Glöckner. Current opportunities and challenges in microbial metagenome analysis&Uring;a bioinformatic perspective. Briefings in bioinformatics, 13(6):728--742, 2012.

[16]

D. J. Lane, B. Pace, G. J. Olsen, D. A. Stahl, M. L. Sogin, and N. R. Pace. Rapid determination of 16s ribosomal rna sequences for phylogenetic analyses. Proceedings of the National Academy of Sciences, 82(20):6955--6959, 1985.

[17]

J. Cole, B. Chai, R. Farris, Q. Wang, S. Kulam, D. McGarrell, G. Garrity, and J. Tiedje. The ribosomal database project (rdp-ii): sequences and tools for high-throughput rrna analysis. Nucleic Acids Research, 33(suppl 1):D294--D296, 2005.

[18]

S. Chakravorty, D. Helb, M. Burday, N. Connell, and D. Alland. A detailed analysis of 16s ribosomal rna gene segments for the diagnosis of pathogenic bacteria. Journal of microbiological methods, 69(2):330--339, 2007.

[19]

R. J. Case, Y. Boucher, I. Dahllöf, C. Holmström, W. F. Doolittle, and S. Kjelleberg. Use of 16s rrna and rpob genes as molecular markers for microbial ecology studies. Applied and environmental microbiology, 73(1):278--288, 2007.

[20]

A. C. McHardy, H. G. Martin, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos. Accurate phylogenetic classification of variable-length dna fragments. Nature methods, 4(1):63--72, 2006.

[21]

D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster. Megan analysis of metagenomic data. Genome Res., 17(3):377--386, 2007.

[22]

M. Wu and J. A. Eisen. A simple, fast, and accurate method of phylogenomic inference. Genome Biology, 9(10):R151, 2008.

[23]

A. Brady and S. L. Salzberg. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature methods, 6(9):673--676, 2009.

[24]

J. C. Clemente, J. Jansson, and G. Valiente. Flexible taxonomic assignment of ambiguous sequencing reads. BMC bioinformatics, 12(1):8, 2011.

[25]

T. Abe, S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, and T. Ikemura. Informatics for unveiling hidden genome signatures. Genome Res., 13(4):693--702, 2003.

[26]

H. Teeling, A. Meyerdierks, M. Bauer, R. Amann, and F. O. Glockner. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9):938--947, 2004.

[27]

J. Bohlin, E. Skjerve, and D. W. Ussery. Investigations of oligonucleotide usage variance within and between prokaryotes. PLoS computational biology, 4(4):e1000057, 2008.

[28]

H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glöckner. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5(1):163, 2004.

[29]

S. Chatterji, I. Yamazaki, Z. Bai, and J. A. Eisen. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Research in Computational Molecular Biology, pages 17--28. Springer, 2008.

Digital Library

[30]

N. N. Diaz, L. Krause, A. Goesmann, K. Niehaus, and T. W. Nattkemper. TACOA--Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC bioinformatics, 10(1):56, 2009.

[31]

A. Kislyuk, S. Bhatnagar, J. Dushoff, and J. Weitz. Unsupervised statistical clustering of environmental shotgun sequences. BMC bioinformatics, 10(1):316, 2009.

[32]

D. Kelley and S. Salzberg. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics, 11(1):544, 2010.

[33]

H. C. Leung, S. Yiu, B. Yang, Y. Peng, Y. Wang, Z. Liu, J. Chen, J. Qin, R. Li, and F. Y. Chin. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics, 27(11):1489--1495, 2011.

Digital Library

[34]

Y.-W. Wu and Y. Ye. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. Journal of Computational Biology, 18(3):523--534, 2011.

[35]

X. Li and M. S. Waterman. Estimating the Repeat Structure and Length of DNA Sequences Using l-Tuples. Genome research, 13(8):1916--1922, 2003.

[36]

Y. Wang, H. C. Leung, S. Yiu, and F. Y. Chin. Metacluster 4.0: a novel binning algorithm for ngs reads and huge number of species. Journal of Computational Biology, 19(2):241--249, 2012.

[37]

Y. Wang, H. C. Leung, S. Yiu, and F. Y. Chin. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics, 28(18):i356--i362, 2012.

Digital Library

[38]

J. C. Wooley, A. Godzik, and I. Friedberg. A primer on metagenomics. PLoS computational biology, 6(2):e1000667, 2010.

[39]

S. Schbath, B. Prum, and E. DE TURCKHEIM. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. Journal of Computational Biology, 2(3):417--437, 1995.

[40]

R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.

[41]

L. G. Wayne. International committee on systematic bacteriology: announcement of the report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Systematic and Applied Microbiology, 10(2):99--100, 1988.

[42]

S. Schbath, B. Prum, and E. de Turckheim. Exceptional motifs in different markov chain models for a statistical analysis of dna sequences. Journal of Computational Biology, 2(3):417--437, 1995.

[43]

G. J. McLachlan and S. U. Chang. Mixture modelling for cluster analysis. Statistical Methods in Medical Research, 13(5):347--361, 2004.

[44]

J.-J. Daudin, S. Li-Thiao-Te, and E. Lebarbier. Statistical challenges from the analysis of NGS-Metagenomics experiment., 2010.

[45]

G. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.

[46]

D. C. Richter, F. Ott, A. F. Auch, R. Schmid, and D. H. Huson. MetaSim-A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE, 3(10):e3373, 2008.

[47]

L. Hubert and P. Arabie. Comparing partitions. Journal of classification, 2(1):193--218, 1985.

[48]

W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846--850, 1971.

[49]

H. Akaike. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716--723, 1974.

[50]

G. Schwarz. Estimating the dimension of a model. The annals of statistics, 6(2):461--464, 1978.

Cited By

Wang LZhu DLi YDong M(2016)Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing ReadsBioinformatics Research and Applications10.1007/978-3-319-38782-6_2(15-26)Online publication date: 27-May-2016
https://doi.org/10.1007/978-3-319-38782-6_2
Geng YChen XZhu DBaldi PWang W(2014)A novel semi-supervised learning approach to analyzing metagenomic readsProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2649387.2660808(629-630)Online publication date: 20-Sep-2014
https://dl.acm.org/doi/10.1145/2649387.2660808

Index Terms

MarkovBin: An Algorithm to Cluster Metagenomic Reads Using a Mixture Modeling of Hierarchical Distributions
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Mathematics of computing
  1. Probability and statistics

Recommendations

A two-way multi-dimensional mixture model for clustering metagenomic sequences
BCB '11: Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Motivation: A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in ...
A probabilistic approach to accurate abundance-based binning of metagenomic reads
WABI'12: Proceedings of the 12th international conference on Algorithms in Bioinformatics

An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The identification of phylogenetically related groups of sequence reads in a metagenomic dataset is often referred to as binning. ...
MetaCluster: unsupervised binning of environmental genomic fragments and taxonomic annotation
BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology

Limited by the laboratory technique, traditional microorganism research usually focuses on one single individual species. This significantly limits the deep analysis of intricate biological processes among complex microorganism communities. With the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

September 2013

987 pages

ISBN:9781450324342

DOI:10.1145/2506583

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 September 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Conference

BCB'13

Sponsor:

SIGBio

BCB'13: ACM-BCB2013

September 22 - 25, 2013

Wshington DC, USA

Acceptance Rates

BCB'13 Paper Acceptance Rate 43 of 148 submissions, 29%;

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang LZhu DLi YDong M(2016)Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing ReadsBioinformatics Research and Applications10.1007/978-3-319-38782-6_2(15-26)Online publication date: 27-May-2016
https://doi.org/10.1007/978-3-319-38782-6_2
Geng YChen XZhu DBaldi PWang W(2014)A novel semi-supervised learning approach to analyzing metagenomic readsProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2649387.2660808(629-630)Online publication date: 20-Sep-2014
https://dl.acm.org/doi/10.1145/2649387.2660808

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents