A multi-modal deep language model for contaminant removal from metagenome-assembled genomes

Zou, Bohao; Wang, Jingjing; Ding, Yi; Zhang, Zhenmiao; Huang, Yufen; Fang, Xiaodong; Cheung, Ka Chun; See, Simon; Zhang, Lu

doi:10.1038/s42256-024-00908-5

Article
Published: 07 October 2024

A multi-modal deep language model for contaminant removal from metagenome-assembled genomes

Bohao Zou¹,
Jingjing Wang¹,
Yi Ding¹,
Zhenmiao ZhangÂ ORCID: orcid.org/0000-0003-3748-1664¹,
Yufen Huang²,
Xiaodong Fang^2,3,
Ka Chun CheungÂ ORCID: orcid.org/0000-0002-2939-4686⁴,
Simon SeeÂ ORCID: orcid.org/0000-0002-4958-9237⁴ &
â¦
Lu ZhangÂ ORCID: orcid.org/0000-0002-2794-7371^1,5Â

Nature Machine Intelligence volumeÂ 6,Â pages 1245â1255 (2024)Cite this article

2577 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Deepurify training procedure.**

**Fig. 2: The workflow of Deepurify for MAG decontamination.**

**Fig. 3: The averaged balanced macro F1 score across various contamination rates.**

**Fig. 4: The averaged balanced macro F1 score across various contamination rates.**

**Fig. 5: Distribution of the completeness and contamination of MAGs before and after decontamination on 227 human faecal samples by CheckM2.**

A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments

Article Open access 28 April 2022

BASALT refines binning from metagenomic data and increases resolution of genome-resolved metagenomic analysis

Article Open access 11 March 2024

Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity

Article Open access 31 May 2024

Data availability

The microbial representative genomes and their associated taxonomic lineages were downloaded from the proGenomes v.2.1 database. The GTDB r202 was used to annotate the reference genomes. The SIM₁ test set is available via Zenodo at https://zenodo.org/record/8343498 (ref. ⁵¹). The SIM₂ test set is available via Zenodo at https://zenodo.org/records/11608439 (ref. ⁵²). The CAMI I short reads were downloaded from the 1st CAMI Challenge Dataset 1 CAMI_low, 1st CAMI Challenge Dataset 2 CAMI_medium and 1st CAMI Challenge Dataset 3 CAMI_high from https://data.cami-challenge.org/participate. The NCBI SRA accessions of seven soil samples are SRR25158210, SRR25158221, SRR25158244, SRR25158253, SRR25158281, SRR25158363 and SRR25158536; those of the three freshwater samples are ERR4195020, ERR9631077 and SRR26420192; those of the three plant samples are SRR10968246, SRR14308228 and SRR14308230. The 11 ocean samples are from ref. ³⁰. The human faecal metagenomic sequencing reads of the IBS-D cohort were downloaded from China National GeneBank with accession number CNPO000334.

Code availability

The source code is freely available at https://github.com/ericcombiolab/Deepurify/ (ref. ⁵³) under an MIT licence. The versions of the software used in the study are provided in Supplementary Note 17.

References

Bernard, G., Pathmanathan, J. S., Lannes, R., Lopez, P. & Bapteste, E. Microbial dark matter investigations: how microbial studies transform biological knowledge and empirically sketch a logic of scientific discovery. Genome Biol. Evol. 10, 707â715 (2018).
ArticleÂ Google ScholarÂ
Dam, H. T., Vollmers, J., Sobol, M. S., Cabezas, A. & Kaster, A.-K. Targeted cell sorting combined with single cell genomics captures low abundant microbial dark matter with higher sensitivity than metagenomics. Front. Microbiol. 11, 1377 (2020).
ArticleÂ Google ScholarÂ
Kaster, A.-K. & Sobol, M. S. Microbial single-cell omics: the crux of the matter. Appl. Microbiol. Biotechnol. 104, 8209â8220 (2020).
ArticleÂ Google ScholarÂ
Pratscher, J., Vollmers, J., Wiegand, S., Dumont, M. G. & Kaster, A.-K. Unravelling the identity, metabolic potential and global biogeography of the atmospheric methane-oxidizing upland soil cluster Î±. Environ. Microbiol. 20, 1016â1029 (2018).
ArticleÂ Google ScholarÂ
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaspades: a new versatile metagenomic assembler. Genome Res. 27, 824â834 (2017).
ArticleÂ Google ScholarÂ
Liang, K.-C. & Sakakibara, Y. Metavelvet-dl: a metavelvet deep learning extension for de novo metagenome assembly. BMC Bioinforma. 22, 427 (2021).
ArticleÂ Google ScholarÂ
Kolmogorov, M. et al. metaflye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103â1110 (2020).
ArticleÂ Google ScholarÂ
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555â560 (2021).
ArticleÂ Google ScholarÂ
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144â1146 (2014).
ArticleÂ Google ScholarÂ
Wu, Y.-W., Tang, Y.-H., Tringe, S. G., Simmons, B. A. & Singer, S. W. Maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26 (2014).
ArticleÂ Google ScholarÂ
Kang, D. D. et al. Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
ArticleÂ Google ScholarÂ
Vollmers, J., Wiegand, S. & Kaster, A.-K. Comparing and evaluating metagenome assembly tools from a microbiologistâs perspective-not only size matters! PLoS ONE 12, e0169662 (2017).
ArticleÂ Google ScholarÂ
Nayfach, S. et al. A genomic catalog of earthâs microbiomes. Nat. Biotechnol. 39, 499â509 (2021).
ArticleÂ Google ScholarÂ
Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499â504 (2019).
ArticleÂ Google ScholarÂ
Jennifer Mattock, M. W. A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination. Nat. Methods 20, 1170â1173 (2023).
ArticleÂ Google ScholarÂ
Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725â731 (2017).
ArticleÂ Google ScholarÂ
Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431â437 (2013).
ArticleÂ Google ScholarÂ
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533â1542 (2017).
ArticleÂ Google ScholarÂ
Nayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S. & Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505â510 (2019).
ArticleÂ Google ScholarÂ
Vollmers, J., Wiegand, S., Lenk, F. & Kaster, A.-K. How clear is our current view on microbial dark matter? (Re-) Assessing public MAG & SAG datasets with MDMcleaner. Nucleic Acids Res. 50, e76âe76 (2022).
ArticleÂ Google ScholarÂ
Drillon, G., Champeimont, R., Oteri, F., Fischer, G. & Carbone, A. Phylogenetic reconstruction based on synteny block and gene adjacencies. Mol. Biol. Evol. 37, 2747â2762 (2020).
ArticleÂ Google ScholarÂ
Periwal, V. & Scaria, V. Insights into structural variations and genome rearrangements in prokaryotic genomes. Bioinformatics 31, 1â9 (2015).
ArticleÂ Google ScholarÂ
Sczyrba, A. et al. Critical Assessment of Metagenome Interpretationâa benchmark of metagenomics software. Nat. Methods 14, 1063â1071 (2017).
ArticleÂ Google ScholarÂ
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).
ArticleÂ Google ScholarÂ
Pan, S., Zhao, X.-M. & Coelho, L. P. Semibin2: self-supervised contrastive learning leads to better mags for short-and long-read sequencing. Bioinformatics 39, i21âi29 (2023).
ArticleÂ Google ScholarÂ
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 139, 8748â8763 (PMLR, 2021).
Wagstaff, K. et al. Constrained k-means clustering with background knowledge. In Proc. 18th International Conference on Machine Learning 1, 577â584 (Morgan Kaufmann, 2001).
Chklovski, A., Parks, D. H., Woodcroft, B. J. & Tyson, G. W. CheckM2 a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat. Methods 20, 1203â1212 (2023).
ArticleÂ Google ScholarÂ
Ma, B. et al. A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nat. Commun. 14, 7318 (2023).
ArticleÂ Google ScholarÂ
Duncan, A. et al. Metagenome-assembled genomes of phytoplankton microbiomes from the arctic and atlantic oceans. Microbiome 10, 67 (2022).
ArticleÂ Google ScholarÂ
Faist, H. et al. Potato root-associated microbiomes adapt to combined water and nutrient limitation and have a plant genotype-specific role for plant stress mitigation. Environ. Microbiome 18, 18 (2023).
ArticleÂ Google ScholarÂ
TlÃ¡skal, V. et al. Metagenomes, metatranscriptomes and microbiomes of naturally decomposing deadwood. Sci. Data 8, 198 (2021).
ArticleÂ Google ScholarÂ
Buck, M. et al. Comprehensive dataset of shotgun metagenomes from oxygen stratified freshwater lakes and ponds. Sci. Data 8, 131 (2021).
ArticleÂ Google ScholarÂ
Kavagutti, V. S. et al. High-resolution metagenomic reconstruction of the freshwater spring bloom. Microbiome 11, 15 (2023).
ArticleÂ Google ScholarÂ
Maestre-Carballa, L., Navarro-LÃ³pez, V. & Martinez-Garcia, M. City-scale monitoring of antibiotic resistance genes by digital pcr and metagenomics. Environ. Microbiome 19, 16 (2024).
ArticleÂ Google ScholarÂ
Zhao, L. et al. A clostridia-rich microbiota enhances bile acid excretion in diarrhea-predominant irritable bowel syndrome. J. Clin. Invest. 130, 438â450 (2020).
ArticleÂ Google ScholarÂ
Rodriguez-R, L. M. & Konstantinidis, K. T. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30, 629â635 (2014).
ArticleÂ Google ScholarÂ
Lai, S. et al. metamic: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol. 23, 242 (2022).
ArticleÂ Google ScholarÂ
Derakhshani, H., Bernier, S. P., Marko, V. A. & Surette, M. G. Completion of draft bacterial genomes by long-read sequencing of synthetic genomic pools. BMC Genomics 21, 519 (2020).
ArticleÂ Google ScholarÂ
Mende, D. R. et al. progenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621âD625 (2020).
Google ScholarÂ
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315â5316 (2022).
ArticleÂ Google ScholarÂ
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31, 1674â1676 (2015).
ArticleÂ Google ScholarÂ
Li, K. et al. Uniformer: unified transformer for efficient spatiotemporal representation learning. Preprint at https://doi.org/10.48550/arXiv.2201.04676 (2022).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583â589 (2021).
ArticleÂ Google ScholarÂ
Tan, M. & Le, Q. Efficientnet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning 97, 6105â6114 (PMLR, 2019).
Li, C., Zhou, A. & Yao, A. Omni-dimensional dynamic convolution. Preprint at https://doi.org/10.48550/arXiv.2209.07947 (2022).
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M. & Hu, S.-M. Visual attention network. Comput. Vis. Media 9, 733â752 (2023).
ArticleÂ Google ScholarÂ
Wang, H. et al. Deepnet: scaling transformers to 1,000 layers. IEEE Trans. Pattern Anal. Mach. Intell. 46, 6761â6774 (2024).
ArticleÂ Google ScholarÂ
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & DollÃ¡r, P. Focal loss for dense object detection. Preprint at https://doi.org/10.48550/arXiv.1708.02002 (2018).
Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864â2868 (2017).
ArticleÂ Google ScholarÂ
Zou, B. Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes. Simulation 1, v.1. Zenodo https://doi.org/10.5281/zenodo.8343497 (2023).
Zou, B. Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes. Simulation 2, v.2. Zenodo https://doi.org/10.5281/zenodo.8343505 (2024).
Zou, B. A deep multi-modal deep language model for contaminant removal from metagenome-assembled genomes (code). Zenodo https://doi.org/10.5281/zenodo.11919065 (2024).

Download references

Acknowledgements

The design of the study and the collection, analysis and interpretation of the data were partially supported by the Young Collaborative Research grant (no. C2004-23Y), HMRF (grant no. 11221026), the open project of BGI-Shenzhen, Shenzhen 518000, China (grant no. BGIRSZ20220014) and HKBU Start-up Grant Tier 2 (grant no. RC-SGT2/19-20/SCI/007). We also thank the BGI Research-Shenzhen, the Research Committee of Hong Kong Baptist University, and the Interdisciplinary Research Clusters Matching Scheme for their kind support of this project.

Author information

Authors and Affiliations

Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
Bohao Zou,Â Jingjing Wang,Â Yi Ding,Â Zhenmiao ZhangÂ &Â Lu Zhang
BGI Research, Shenzhen, China
Yufen HuangÂ &Â Xiaodong Fang
BGI Research, Sanya, China
Xiaodong Fang
NVIDIA AI Technology Center, NVIDIA, Hong Kong, China
Ka Chun CheungÂ &Â Simon See
Institute for Research and Continuing Education, Hong Kong Baptist University, Hong Kong, China
Lu Zhang

Authors

Bohao Zou
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jingjing Wang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Yi Ding
View author publications
You can also search for this author in PubMedÂ Google Scholar
Zhenmiao Zhang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Yufen Huang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Xiaodong Fang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Ka Chun Cheung
View author publications
You can also search for this author in PubMedÂ Google Scholar
Simon See
View author publications
You can also search for this author in PubMedÂ Google Scholar
Lu Zhang
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

L.Z. conceived the study. B.Z. designed and implemented the Deepurify algorithms. L.Z. and B.Z. conceived the experiments. B.Z., Y.D. and Z.Z. conducted the experiments. B.Z. and J.W. analysed the results. B.Z. and L.Z. wrote the paper. Y.H. and X.F. revised the paper and supported the project. K.C.C. and S.S. contributed computational resources. All authors reviewed the paper.

Corresponding author

Correspondence to Lu Zhang.

Ethics declarations

Competing interests

K.C.C. and S.S. are the employees of Nvidia Corporation (NVIDIA). The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Antonio Pedro Camargo and Luis Coehlo for their contribution to the peer review of this work.

Additional information

Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1â10, Notes 1â18 and Tables 1â4.

Reporting Summary

Supplementary Table

Supplementary Tables 5â9.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zou, B., Wang, J., Ding, Y. et al. A multi-modal deep language model for contaminant removal from metagenome-assembled genomes. Nat Mach Intell 6, 1245â1255 (2024). https://doi.org/10.1038/s42256-024-00908-5

Download citation

Received: 08 October 2023
Accepted: 05 September 2024
Published: 07 October 2024
Issue Date: October 2024
DOI: https://doi.org/10.1038/s42256-024-00908-5