A Two-Way Bayesian Mixture Model for Clustering in Metagenomics

Prabhakara, Shruthi; Acharya, Raj

doi:10.1007/978-3-642-24855-9_3

Shruthi Prabhakara²¹ &
Raj Acharya²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7036))

Included in the following conference series:

IAPR International Conference on Pattern Recognition in Bioinformatics

1382 Accesses

Abstract

We present a new and efficient Bayesian mixture model based on Poisson and Multinomial distributions for clustering metagenomic reads by their species of origin. We use the relative abundance of different words along a genome to distinguish reads from different species. The distribution of word counts within a genome is accurately represented by a Poisson distribution. The Multinomial mixture model is derived as a standardized Poisson mixture model. The Bayesian network efficiently encodes the conditional dependencies between word counts in a DNA due to overlaps and hence is most consistent with the data. We present a two-way mixture model that captures the high dimensionality and sparsity associated with the data. Our method can cluster reads as short as 50 bps with accuracy over 80%. The Bayesian mixture models clearly outperform their Naive Bayes counterparts on datasets of varying abundances, divergences and read lengths. Our method attains comparable accuracy to that of state-of-art Scimm and converges at least 5 times faster than Scimm for all the cases tested. The reduced time taken, by our method, to obtain accurate results is highly significant and justifies the use of our proposed method to evaluate large metagenome datasets.

Download to read the full chapter text

Chapter PDF

Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads

Mash: fast genome and metagenome distance estimation using MinHash

Article Open access 20 June 2016

Analysis of Metagenomic Data

Keywords

References

Bailly-Bechet, M., Danchin, A., Iqbal, M., Marsili, M., Vergassola, M.: Codon Usage Domains over Bacterial Chromosomes. PLoS Comput. Biol. 2(4), e37+ (2006)
Google Scholar
Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38(1), 771–791 (2004)
Article Google Scholar
Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods 6(9), 673–676 (2009)
Article Google Scholar
Campbell, A., Mrázek, J., Karlin, S.: Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences of the United States of America 96(16), 9184–9189 (1999)
Article Google Scholar
Chatterji, S., Yamazaki, I., Bai, Z., Eisen, J.: CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. ArXiv e-prints, 708 (August 2007)
Google Scholar
Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol. 1(2), e24 (2005)
Google Scholar
Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides, N.C., Markowitz, V.M.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16), i7–i13 (2008)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977)
MATH Google Scholar
Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1. Wiley (1968)
Google Scholar
Heckerman, D.: A tutorial on learning with bayesian networks. Technical report, Learning in Graphical Models (1995)
Google Scholar
Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome research 17(3), 377–386 (2007)
Article Google Scholar
Kelley, D., Salzberg, S.: Clustering metagenomic sequences with interpolated markov models. BMC Bioinformatics 11(1), 544 (2010)
Article Google Scholar
Kislyuk, A., Bhatnagar, S., Dushoff, J., Weitz, J.S.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10(1), 316+ (2009)
Google Scholar
Li, J., Zha, H.: Two-way poisson mixture models for simultaneous document classification and word clustering. Comput. Stat. Data Anal. 50, 163–180 (2006)
Article MATH Google Scholar
McHardy, A.C.C., Martín, H.G.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4(1), 63–72 (2007)
Article Google Scholar
Rapp, M.S., Giovannoni, S.J.: The uncultured microbial majority. Annual Review of Microbiology 57(1), 369–394 (2003)
Article Google Scholar
Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational Biology 7(1-2), 1–46 (2000)
Article Google Scholar
Robin, S., Rodolphe, F., Schbath, S.: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press (2005)
Google Scholar
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., Sokhansanj, B.: Metagenome fragment classification using n-mer frequency profiles
Google Scholar
Shruthi Prabhakara, R.A.: A two-way multi-dimensional mixture model for clustering metagenomic sequences. In: ACM BCB (2011)
Google Scholar
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glöckner, F.O.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6(9), 938–947 (2004)
Article Google Scholar
Tibshirani, R., Walther, G.: Cluster Validation by Prediction Strength. Journal of Computational & Graphical Statistics 14(3), 511–528 (2005)
Article Google Scholar
Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M., Solovyev, V.V., Rubin, E.M., Rokhsar, D.S., Banfield, J.F.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)
Article Google Scholar
Willse, A., Tyler, B.: Poisson and multinomial mixture models for multivariate sims image segmentation. Analytical Chemistry 74(24), 6314–6322 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Pennsylvania State University, University Park, PA, 16801, USA
Shruthi Prabhakara & Raj Acharya

Authors

Shruthi Prabhakara
View author publications
You can also search for this author in PubMed Google Scholar
Raj Acharya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Pattern Recognition Laboratory, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Marco Loog , Marcel J. T. Reinders & Dick de Ridder , &
Netherlands Cancer Institute, Bioinformatics and Statistics, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
Lodewyk Wessels

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prabhakara, S., Acharya, R. (2011). A Two-Way Bayesian Mixture Model for Clustering in Metagenomics. In: Loog, M., Wessels, L., Reinders, M.J.T., de Ridder, D. (eds) Pattern Recognition in Bioinformatics. PRIB 2011. Lecture Notes in Computer Science(), vol 7036. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24855-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-24855-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24854-2
Online ISBN: 978-3-642-24855-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Two-Way Bayesian Mixture Model for Clustering in Metagenomics

Abstract

Chapter PDF

Similar content being viewed by others

Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads

Mash: fast genome and metagenome distance estimation using MinHash

Analysis of Metagenomic Data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Two-Way Bayesian Mixture Model for Clustering in Metagenomics

Abstract

Chapter PDF

Similar content being viewed by others

Poisson-Markov Mixture Model and Parallel Algorithm for Binning Massive and Heterogenous DNA Sequencing Reads

Mash: fast genome and metagenome distance estimation using MinHash

Analysis of Metagenomic Data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation