Abstract
We present a new and efficient Bayesian mixture model based on Poisson and Multinomial distributions for clustering metagenomic reads by their species of origin. We use the relative abundance of different words along a genome to distinguish reads from different species. The distribution of word counts within a genome is accurately represented by a Poisson distribution. The Multinomial mixture model is derived as a standardized Poisson mixture model. The Bayesian network efficiently encodes the conditional dependencies between word counts in a DNA due to overlaps and hence is most consistent with the data. We present a two-way mixture model that captures the high dimensionality and sparsity associated with the data. Our method can cluster reads as short as 50 bps with accuracy over 80%. The Bayesian mixture models clearly outperform their Naive Bayes counterparts on datasets of varying abundances, divergences and read lengths. Our method attains comparable accuracy to that of state-of-art Scimm and converges at least 5 times faster than Scimm for all the cases tested. The reduced time taken, by our method, to obtain accurate results is highly significant and justifies the use of our proposed method to evaluate large metagenome datasets.
Chapter PDF
Similar content being viewed by others
References
Bailly-Bechet, M., Danchin, A., Iqbal, M., Marsili, M., Vergassola, M.: Codon Usage Domains over Bacterial Chromosomes. PLoS Comput. Biol. 2(4), e37+ (2006)
Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38(1), 771–791 (2004)
Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods 6(9), 673–676 (2009)
Campbell, A., Mrázek, J., Karlin, S.: Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences of the United States of America 96(16), 9184–9189 (1999)
Chatterji, S., Yamazaki, I., Bai, Z., Eisen, J.: CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. ArXiv e-prints, 708 (August 2007)
Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol. 1(2), e24 (2005)
Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides, N.C., Markowitz, V.M.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16), i7–i13 (2008)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977)
Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1. Wiley (1968)
Heckerman, D.: A tutorial on learning with bayesian networks. Technical report, Learning in Graphical Models (1995)
Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome research 17(3), 377–386 (2007)
Kelley, D., Salzberg, S.: Clustering metagenomic sequences with interpolated markov models. BMC Bioinformatics 11(1), 544 (2010)
Kislyuk, A., Bhatnagar, S., Dushoff, J., Weitz, J.S.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10(1), 316+ (2009)
Li, J., Zha, H.: Two-way poisson mixture models for simultaneous document classification and word clustering. Comput. Stat. Data Anal. 50, 163–180 (2006)
McHardy, A.C.C., MartÃn, H.G.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4(1), 63–72 (2007)
Rapp, M.S., Giovannoni, S.J.: The uncultured microbial majority. Annual Review of Microbiology 57(1), 369–394 (2003)
Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational Biology 7(1-2), 1–46 (2000)
Robin, S., Rodolphe, F., Schbath, S.: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press (2005)
Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., Sokhansanj, B.: Metagenome fragment classification using n-mer frequency profiles
Shruthi Prabhakara, R.A.: A two-way multi-dimensional mixture model for clustering metagenomic sequences. In: ACM BCB (2011)
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glöckner, F.O.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6(9), 938–947 (2004)
Tibshirani, R., Walther, G.: Cluster Validation by Prediction Strength. Journal of Computational & Graphical Statistics 14(3), 511–528 (2005)
Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M., Solovyev, V.V., Rubin, E.M., Rokhsar, D.S., Banfield, J.F.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)
Willse, A., Tyler, B.: Poisson and multinomial mixture models for multivariate sims image segmentation. Analytical Chemistry 74(24), 6314–6322 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Prabhakara, S., Acharya, R. (2011). A Two-Way Bayesian Mixture Model for Clustering in Metagenomics. In: Loog, M., Wessels, L., Reinders, M.J.T., de Ridder, D. (eds) Pattern Recognition in Bioinformatics. PRIB 2011. Lecture Notes in Computer Science(), vol 7036. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24855-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-24855-9_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24854-2
Online ISBN: 978-3-642-24855-9
eBook Packages: Computer ScienceComputer Science (R0)