Abstract
Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The microbial representative genomes and their associated taxonomic lineages were downloaded from the proGenomes v.2.1 database. The GTDB r202 was used to annotate the reference genomes. The SIM1 test set is available via Zenodo at https://zenodo.org/record/8343498 (ref. 51). The SIM2 test set is available via Zenodo at https://zenodo.org/records/11608439 (ref. 52). The CAMI I short reads were downloaded from the 1st CAMI Challenge Dataset 1 CAMI_low, 1st CAMI Challenge Dataset 2 CAMI_medium and 1st CAMI Challenge Dataset 3 CAMI_high from https://data.cami-challenge.org/participate. The NCBI SRA accessions of seven soil samples are SRR25158210, SRR25158221, SRR25158244, SRR25158253, SRR25158281, SRR25158363 and SRR25158536; those of the three freshwater samples are ERR4195020, ERR9631077 and SRR26420192; those of the three plant samples are SRR10968246, SRR14308228 and SRR14308230. The 11 ocean samples are from ref. 30. The human faecal metagenomic sequencing reads of the IBS-D cohort were downloaded from China National GeneBank with accession number CNPO000334.
Code availability
The source code is freely available at https://github.com/ericcombiolab/Deepurify/ (ref. 53) under an MIT licence. The versions of the software used in the study are provided in Supplementary Note 17.
References
Bernard, G., Pathmanathan, J. S., Lannes, R., Lopez, P. & Bapteste, E. Microbial dark matter investigations: how microbial studies transform biological knowledge and empirically sketch a logic of scientific discovery. Genome Biol. Evol. 10, 707â715 (2018).
Dam, H. T., Vollmers, J., Sobol, M. S., Cabezas, A. & Kaster, A.-K. Targeted cell sorting combined with single cell genomics captures low abundant microbial dark matter with higher sensitivity than metagenomics. Front. Microbiol. 11, 1377 (2020).
Kaster, A.-K. & Sobol, M. S. Microbial single-cell omics: the crux of the matter. Appl. Microbiol. Biotechnol. 104, 8209â8220 (2020).
Pratscher, J., Vollmers, J., Wiegand, S., Dumont, M. G. & Kaster, A.-K. Unravelling the identity, metabolic potential and global biogeography of the atmospheric methane-oxidizing upland soil cluster α. Environ. Microbiol. 20, 1016â1029 (2018).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaspades: a new versatile metagenomic assembler. Genome Res. 27, 824â834 (2017).
Liang, K.-C. & Sakakibara, Y. Metavelvet-dl: a metavelvet deep learning extension for de novo metagenome assembly. BMC Bioinforma. 22, 427 (2021).
Kolmogorov, M. et al. metaflye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103â1110 (2020).
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555â560 (2021).
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144â1146 (2014).
Wu, Y.-W., Tang, Y.-H., Tringe, S. G., Simmons, B. A. & Singer, S. W. Maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26 (2014).
Kang, D. D. et al. Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Vollmers, J., Wiegand, S. & Kaster, A.-K. Comparing and evaluating metagenome assembly tools from a microbiologistâs perspective-not only size matters! PLoS ONE 12, e0169662 (2017).
Nayfach, S. et al. A genomic catalog of earthâs microbiomes. Nat. Biotechnol. 39, 499â509 (2021).
Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499â504 (2019).
Jennifer Mattock, M. W. A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination. Nat. Methods 20, 1170â1173 (2023).
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725â731 (2017).
Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431â437 (2013).
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533â1542 (2017).
Nayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S. & Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505â510 (2019).
Vollmers, J., Wiegand, S., Lenk, F. & Kaster, A.-K. How clear is our current view on microbial dark matter? (Re-) Assessing public MAG & SAG datasets with MDMcleaner. Nucleic Acids Res. 50, e76âe76 (2022).
Drillon, G., Champeimont, R., Oteri, F., Fischer, G. & Carbone, A. Phylogenetic reconstruction based on synteny block and gene adjacencies. Mol. Biol. Evol. 37, 2747â2762 (2020).
Periwal, V. & Scaria, V. Insights into structural variations and genome rearrangements in prokaryotic genomes. Bioinformatics 31, 1â9 (2015).
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretationâa benchmark of metagenomics software. Nat. Methods 14, 1063â1071 (2017).
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).
Pan, S., Zhao, X.-M. & Coelho, L. P. Semibin2: self-supervised contrastive learning leads to better mags for short-and long-read sequencing. Bioinformatics 39, i21âi29 (2023).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 139, 8748â8763 (PMLR, 2021).
Wagstaff, K. et al. Constrained k-means clustering with background knowledge. In Proc. 18th International Conference on Machine Learning 1, 577â584 (Morgan Kaufmann, 2001).
Chklovski, A., Parks, D. H., Woodcroft, B. J. & Tyson, G. W. CheckM2 a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat. Methods 20, 1203â1212 (2023).
Ma, B. et al. A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nat. Commun. 14, 7318 (2023).
Duncan, A. et al. Metagenome-assembled genomes of phytoplankton microbiomes from the arctic and atlantic oceans. Microbiome 10, 67 (2022).
Faist, H. et al. Potato root-associated microbiomes adapt to combined water and nutrient limitation and have a plant genotype-specific role for plant stress mitigation. Environ. Microbiome 18, 18 (2023).
Tláskal, V. et al. Metagenomes, metatranscriptomes and microbiomes of naturally decomposing deadwood. Sci. Data 8, 198 (2021).
Buck, M. et al. Comprehensive dataset of shotgun metagenomes from oxygen stratified freshwater lakes and ponds. Sci. Data 8, 131 (2021).
Kavagutti, V. S. et al. High-resolution metagenomic reconstruction of the freshwater spring bloom. Microbiome 11, 15 (2023).
Maestre-Carballa, L., Navarro-López, V. & Martinez-Garcia, M. City-scale monitoring of antibiotic resistance genes by digital pcr and metagenomics. Environ. Microbiome 19, 16 (2024).
Zhao, L. et al. A clostridia-rich microbiota enhances bile acid excretion in diarrhea-predominant irritable bowel syndrome. J. Clin. Invest. 130, 438â450 (2020).
Rodriguez-R, L. M. & Konstantinidis, K. T. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30, 629â635 (2014).
Lai, S. et al. metamic: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol. 23, 242 (2022).
Derakhshani, H., Bernier, S. P., Marko, V. A. & Surette, M. G. Completion of draft bacterial genomes by long-read sequencing of synthetic genomic pools. BMC Genomics 21, 519 (2020).
Mende, D. R. et al. progenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621âD625 (2020).
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315â5316 (2022).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31, 1674â1676 (2015).
Li, K. et al. Uniformer: unified transformer for efficient spatiotemporal representation learning. Preprint at https://doi.org/10.48550/arXiv.2201.04676 (2022).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583â589 (2021).
Tan, M. & Le, Q. Efficientnet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning 97, 6105â6114 (PMLR, 2019).
Li, C., Zhou, A. & Yao, A. Omni-dimensional dynamic convolution. Preprint at https://doi.org/10.48550/arXiv.2209.07947 (2022).
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M. & Hu, S.-M. Visual attention network. Comput. Vis. Media 9, 733â752 (2023).
Wang, H. et al. Deepnet: scaling transformers to 1,000 layers. IEEE Trans. Pattern Anal. Mach. Intell. 46, 6761â6774 (2024).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. Preprint at https://doi.org/10.48550/arXiv.1708.02002 (2018).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864â2868 (2017).
Zou, B. Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes. Simulation 1, v.1. Zenodo https://doi.org/10.5281/zenodo.8343497 (2023).
Zou, B. Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes. Simulation 2, v.2. Zenodo https://doi.org/10.5281/zenodo.8343505 (2024).
Zou, B. A deep multi-modal deep language model for contaminant removal from metagenome-assembled genomes (code). Zenodo https://doi.org/10.5281/zenodo.11919065 (2024).
Acknowledgements
The design of the study and the collection, analysis and interpretation of the data were partially supported by the Young Collaborative Research grant (no. C2004-23Y), HMRF (grant no. 11221026), the open project of BGI-Shenzhen, Shenzhen 518000, China (grant no. BGIRSZ20220014) and HKBU Start-up Grant Tier 2 (grant no. RC-SGT2/19-20/SCI/007). We also thank the BGI Research-Shenzhen, the Research Committee of Hong Kong Baptist University, and the Interdisciplinary Research Clusters Matching Scheme for their kind support of this project.
Author information
Authors and Affiliations
Contributions
L.Z. conceived the study. B.Z. designed and implemented the Deepurify algorithms. L.Z. and B.Z. conceived the experiments. B.Z., Y.D. and Z.Z. conducted the experiments. B.Z. and J.W. analysed the results. B.Z. and L.Z. wrote the paper. Y.H. and X.F. revised the paper and supported the project. K.C.C. and S.S. contributed computational resources. All authors reviewed the paper.
Corresponding author
Ethics declarations
Competing interests
K.C.C. and S.S. are the employees of Nvidia Corporation (NVIDIA). The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Antonio Pedro Camargo and Luis Coehlo for their contribution to the peer review of this work.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1â10, Notes 1â18 and Tables 1â4.
Supplementary Table
Supplementary Tables 5â9.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zou, B., Wang, J., Ding, Y. et al. A multi-modal deep language model for contaminant removal from metagenome-assembled genomes. Nat Mach Intell 6, 1245â1255 (2024). https://doi.org/10.1038/s42256-024-00908-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00908-5