Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A multi-modal deep language model for contaminant removal from metagenome-assembled genomes

Abstract

Metagenome-assembled genomes (MAGs) offer valuable insights into the exploration of microbial dark matter using metagenomic sequencing data. However, there is growing concern that contamination in MAGs may substantially affect the results of downstream analysis. Current MAG decontamination tools primarily rely on marker genes and do not fully use the contextual information of genomic sequences. To overcome this limitation, we introduce Deepurify for MAG decontamination. Deepurify uses a multi-modal deep language model with contrastive learning to match microbial genomic sequences with their taxonomic lineages. It allocates contigs within a MAG to a MAG-separated tree and applies a tree traversal algorithm to partition MAGs into sub-MAGs, with the goal of maximizing the number of high- and medium-quality sub-MAGs. Here we show that Deepurify outperformed MDMclearer and MAGpurify on simulated data, CAMI datasets and real-world datasets with varying complexities. Deepurify increased the number of high-quality MAGs by 20.0% in soil, 45.1% in ocean, 45.5% in plants, 33.8% in freshwater and 28.5% in human faecal metagenomic sequencing datasets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Deepurify training procedure.
Fig. 2: The workflow of Deepurify for MAG decontamination.
Fig. 3: The averaged balanced macro F1 score across various contamination rates.
Fig. 4: The averaged balanced macro F1 score across various contamination rates.
Fig. 5: Distribution of the completeness and contamination of MAGs before and after decontamination on 227 human faecal samples by CheckM2.

Similar content being viewed by others

Data availability

The microbial representative genomes and their associated taxonomic lineages were downloaded from the proGenomes v.2.1 database. The GTDB r202 was used to annotate the reference genomes. The SIM1 test set is available via Zenodo at https://zenodo.org/record/8343498 (ref. 51). The SIM2 test set is available via Zenodo at https://zenodo.org/records/11608439 (ref. 52). The CAMI I short reads were downloaded from the 1st CAMI Challenge Dataset 1 CAMI_low, 1st CAMI Challenge Dataset 2 CAMI_medium and 1st CAMI Challenge Dataset 3 CAMI_high from https://data.cami-challenge.org/participate. The NCBI SRA accessions of seven soil samples are SRR25158210, SRR25158221, SRR25158244, SRR25158253, SRR25158281, SRR25158363 and SRR25158536; those of the three freshwater samples are ERR4195020, ERR9631077 and SRR26420192; those of the three plant samples are SRR10968246, SRR14308228 and SRR14308230. The 11 ocean samples are from ref. 30. The human faecal metagenomic sequencing reads of the IBS-D cohort were downloaded from China National GeneBank with accession number CNPO000334.

Code availability

The source code is freely available at https://github.com/ericcombiolab/Deepurify/ (ref. 53) under an MIT licence. The versions of the software used in the study are provided in Supplementary Note 17.

References

  1. Bernard, G., Pathmanathan, J. S., Lannes, R., Lopez, P. & Bapteste, E. Microbial dark matter investigations: how microbial studies transform biological knowledge and empirically sketch a logic of scientific discovery. Genome Biol. Evol. 10, 707–715 (2018).

    Article  Google Scholar 

  2. Dam, H. T., Vollmers, J., Sobol, M. S., Cabezas, A. & Kaster, A.-K. Targeted cell sorting combined with single cell genomics captures low abundant microbial dark matter with higher sensitivity than metagenomics. Front. Microbiol. 11, 1377 (2020).

    Article  Google Scholar 

  3. Kaster, A.-K. & Sobol, M. S. Microbial single-cell omics: the crux of the matter. Appl. Microbiol. Biotechnol. 104, 8209–8220 (2020).

    Article  Google Scholar 

  4. Pratscher, J., Vollmers, J., Wiegand, S., Dumont, M. G. & Kaster, A.-K. Unravelling the identity, metabolic potential and global biogeography of the atmospheric methane-oxidizing upland soil cluster α. Environ. Microbiol. 20, 1016–1029 (2018).

    Article  Google Scholar 

  5. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaspades: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    Article  Google Scholar 

  6. Liang, K.-C. & Sakakibara, Y. Metavelvet-dl: a metavelvet deep learning extension for de novo metagenome assembly. BMC Bioinforma. 22, 427 (2021).

    Article  Google Scholar 

  7. Kolmogorov, M. et al. metaflye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020).

    Article  Google Scholar 

  8. Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).

    Article  Google Scholar 

  9. Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).

    Article  Google Scholar 

  10. Wu, Y.-W., Tang, Y.-H., Tringe, S. G., Simmons, B. A. & Singer, S. W. Maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26 (2014).

    Article  Google Scholar 

  11. Kang, D. D. et al. Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).

    Article  Google Scholar 

  12. Vollmers, J., Wiegand, S. & Kaster, A.-K. Comparing and evaluating metagenome assembly tools from a microbiologist’s perspective-not only size matters! PLoS ONE 12, e0169662 (2017).

    Article  Google Scholar 

  13. Nayfach, S. et al. A genomic catalog of earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).

    Article  Google Scholar 

  14. Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499–504 (2019).

    Article  Google Scholar 

  15. Jennifer Mattock, M. W. A comparison of single-coverage and multi-coverage metagenomic binning reveals extensive hidden contamination. Nat. Methods 20, 1170–1173 (2023).

    Article  Google Scholar 

  16. Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    Article  Google Scholar 

  17. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

    Article  Google Scholar 

  18. Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

    Article  Google Scholar 

  19. Nayfach, S., Shi, Z. J., Seshadri, R., Pollard, K. S. & Kyrpides, N. C. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019).

    Article  Google Scholar 

  20. Vollmers, J., Wiegand, S., Lenk, F. & Kaster, A.-K. How clear is our current view on microbial dark matter? (Re-) Assessing public MAG & SAG datasets with MDMcleaner. Nucleic Acids Res. 50, e76–e76 (2022).

    Article  Google Scholar 

  21. Drillon, G., Champeimont, R., Oteri, F., Fischer, G. & Carbone, A. Phylogenetic reconstruction based on synteny block and gene adjacencies. Mol. Biol. Evol. 37, 2747–2762 (2020).

    Article  Google Scholar 

  22. Periwal, V. & Scaria, V. Insights into structural variations and genome rearrangements in prokaryotic genomes. Bioinformatics 31, 1–9 (2015).

    Article  Google Scholar 

  23. Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    Article  Google Scholar 

  24. Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).

    Article  Google Scholar 

  25. Pan, S., Zhao, X.-M. & Coelho, L. P. Semibin2: self-supervised contrastive learning leads to better mags for short-and long-read sequencing. Bioinformatics 39, i21–i29 (2023).

    Article  Google Scholar 

  26. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 139, 8748–8763 (PMLR, 2021).

  27. Wagstaff, K. et al. Constrained k-means clustering with background knowledge. In Proc. 18th International Conference on Machine Learning 1, 577–584 (Morgan Kaufmann, 2001).

  28. Chklovski, A., Parks, D. H., Woodcroft, B. J. & Tyson, G. W. CheckM2 a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat. Methods 20, 1203–1212 (2023).

    Article  Google Scholar 

  29. Ma, B. et al. A genomic catalogue of soil microbiomes boosts mining of biodiversity and genetic resources. Nat. Commun. 14, 7318 (2023).

    Article  Google Scholar 

  30. Duncan, A. et al. Metagenome-assembled genomes of phytoplankton microbiomes from the arctic and atlantic oceans. Microbiome 10, 67 (2022).

    Article  Google Scholar 

  31. Faist, H. et al. Potato root-associated microbiomes adapt to combined water and nutrient limitation and have a plant genotype-specific role for plant stress mitigation. Environ. Microbiome 18, 18 (2023).

    Article  Google Scholar 

  32. Tláskal, V. et al. Metagenomes, metatranscriptomes and microbiomes of naturally decomposing deadwood. Sci. Data 8, 198 (2021).

    Article  Google Scholar 

  33. Buck, M. et al. Comprehensive dataset of shotgun metagenomes from oxygen stratified freshwater lakes and ponds. Sci. Data 8, 131 (2021).

    Article  Google Scholar 

  34. Kavagutti, V. S. et al. High-resolution metagenomic reconstruction of the freshwater spring bloom. Microbiome 11, 15 (2023).

    Article  Google Scholar 

  35. Maestre-Carballa, L., Navarro-López, V. & Martinez-Garcia, M. City-scale monitoring of antibiotic resistance genes by digital pcr and metagenomics. Environ. Microbiome 19, 16 (2024).

    Article  Google Scholar 

  36. Zhao, L. et al. A clostridia-rich microbiota enhances bile acid excretion in diarrhea-predominant irritable bowel syndrome. J. Clin. Invest. 130, 438–450 (2020).

    Article  Google Scholar 

  37. Rodriguez-R, L. M. & Konstantinidis, K. T. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinformatics 30, 629–635 (2014).

    Article  Google Scholar 

  38. Lai, S. et al. metamic: reference-free misassembly identification and correction of de novo metagenomic assemblies. Genome Biol. 23, 242 (2022).

    Article  Google Scholar 

  39. Derakhshani, H., Bernier, S. P., Marko, V. A. & Surette, M. G. Completion of draft bacterial genomes by long-read sequencing of synthetic genomic pools. BMC Genomics 21, 519 (2020).

    Article  Google Scholar 

  40. Mende, D. R. et al. progenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. 48, D621–D625 (2020).

    Google Scholar 

  41. Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315–5316 (2022).

    Article  Google Scholar 

  42. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics 31, 1674–1676 (2015).

    Article  Google Scholar 

  43. Li, K. et al. Uniformer: unified transformer for efficient spatiotemporal representation learning. Preprint at https://doi.org/10.48550/arXiv.2201.04676 (2022).

  44. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  45. Tan, M. & Le, Q. Efficientnet: rethinking model scaling for convolutional neural networks. In Proc. 36th International Conference on Machine Learning 97, 6105–6114 (PMLR, 2019).

  46. Li, C., Zhou, A. & Yao, A. Omni-dimensional dynamic convolution. Preprint at https://doi.org/10.48550/arXiv.2209.07947 (2022).

  47. Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M. & Hu, S.-M. Visual attention network. Comput. Vis. Media 9, 733–752 (2023).

    Article  Google Scholar 

  48. Wang, H. et al. Deepnet: scaling transformers to 1,000 layers. IEEE Trans. Pattern Anal. Mach. Intell. 46, 6761–6774 (2024).

    Article  Google Scholar 

  49. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. Preprint at https://doi.org/10.48550/arXiv.1708.02002 (2018).

  50. Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 11, 2864–2868 (2017).

    Article  Google Scholar 

  51. Zou, B. Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes. Simulation 1, v.1. Zenodo https://doi.org/10.5281/zenodo.8343497 (2023).

  52. Zou, B. Deepurify: a multi-modal deep language model to remove contamination from metagenome-assembled genomes. Simulation 2, v.2. Zenodo https://doi.org/10.5281/zenodo.8343505 (2024).

  53. Zou, B. A deep multi-modal deep language model for contaminant removal from metagenome-assembled genomes (code). Zenodo https://doi.org/10.5281/zenodo.11919065 (2024).

Download references

Acknowledgements

The design of the study and the collection, analysis and interpretation of the data were partially supported by the Young Collaborative Research grant (no. C2004-23Y), HMRF (grant no. 11221026), the open project of BGI-Shenzhen, Shenzhen 518000, China (grant no. BGIRSZ20220014) and HKBU Start-up Grant Tier 2 (grant no. RC-SGT2/19-20/SCI/007). We also thank the BGI Research-Shenzhen, the Research Committee of Hong Kong Baptist University, and the Interdisciplinary Research Clusters Matching Scheme for their kind support of this project.

Author information

Authors and Affiliations

Authors

Contributions

L.Z. conceived the study. B.Z. designed and implemented the Deepurify algorithms. L.Z. and B.Z. conceived the experiments. B.Z., Y.D. and Z.Z. conducted the experiments. B.Z. and J.W. analysed the results. B.Z. and L.Z. wrote the paper. Y.H. and X.F. revised the paper and supported the project. K.C.C. and S.S. contributed computational resources. All authors reviewed the paper.

Corresponding author

Correspondence to Lu Zhang.

Ethics declarations

Competing interests

K.C.C. and S.S. are the employees of Nvidia Corporation (NVIDIA). The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Antonio Pedro Camargo and Luis Coehlo for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Notes 1–18 and Tables 1–4.

Reporting Summary

Supplementary Table

Supplementary Tables 5–9.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, B., Wang, J., Ding, Y. et al. A multi-modal deep language model for contaminant removal from metagenome-assembled genomes. Nat Mach Intell 6, 1245–1255 (2024). https://doi.org/10.1038/s42256-024-00908-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-024-00908-5

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing