The Genome Taxonomy Database (GTDB) is an online database that maintains information on a proposed nomenclature of prokaryotes, following a phylogenomic approach based on a set of conserved single-copy proteins. In addition to resolving paraphyletic groups, this method also reassigns taxonomic ranks algorithmically, updating names in both cases.[1] Information for archaea was added in 2020,[2] along with a species classification based on average nucleotide identity.[3] Each update incorporates new genomes as well as automated and manual curation of the taxonomy.[4]
Content | |
---|---|
Description | Proposed prokaryotic nomenclature |
Contact | |
Research center | Australian Centre for Ecogenomics, University of Queensland |
Authors |
|
Primary citation | PMID 30148503 |
Release date | 2018 |
Access | |
Website | gtdb |
Miscellaneous | |
License | CC BY-SA 4.0 |
Version | R07/RS207 (8 April 2022) |
Curation policy | mixed |
An open-source tool called GTDB-Tk is available to classify draft genomes into the GTDB hierarchy.[5] The GTDB system, via GTDB-Tk, has been used to catalogue not-yet-named bacteria in the human gut microbiome and other metagenomic sources.[6][7]
The GTDB is incorporated into the Bergey's Manual of Systematics of Archaea and Bacteria in 2019 as its phylogenomic resource.[8]
Methodology
editThe genomes used to construct the phylogeny are obtained from NCBI (RefSeq and Genbank), and GTDB releases are indexed to RefSeq releases, starting with release 76. Importantly and increasingly, this dataset includes draft genomes of uncultured microorganisms obtained from metagenomes and single cells, ensuring improved genomic representation of the microbial world. All genomes are independently quality controlled using CheckM before inclusion in GTDB.[9]
Genomes first undergo gene calling to extract genes. The taxonomy is based on trees inferred with FastTree from an aligned concatenated set of 120 single copy marker proteins for Bacteria under a WAG model, and with IQ-TREE from a concatenated set of 53 (since RS207; 122 before) marker proteins for Archaea under the PMSF model. Additional marker sets are also used to cross-validate tree topologies including concatenated ribosomal proteins and ribosomal RNA genes.[9] The relative evolutionary divergence (RED) metric, which determines the taxonomic ranks used, is derived from the two main trees by the PhyloRank program.[1]
Species are deliminated using average nucleotide identity and alignment fraction, both calculated by skani. For species existing in a previous release, GTDB compares the quality and position of two genomes and may decide to switch to a new species representative genome.[9]
Taxomony comes from the following sources:
- A previous release, if available for the neighborhood of genomic similarity.
- National Center for Biotechnology Information (NCBI) taxonomy was initially used to decorate the genome tree via tax2tree.[1]
- The 16S rRNA-based Greengenes taxonomy is used to supplement the taxonomy particularly in regions of the tree with no cultured representatives.[1]
- List of Prokaryotic names with Standing in Nomenclature (LPSN) is used as the primary taxonomic authority for establishing naming priorities.[1]
GTDB personnel curates the taxonomy from the aforementioned sources by checking them against the results of PhyloRank and the tree.
- The tree node corresponding to a taxon name may have a RED inappropriate for its rank. The name may either be moved onto another node or (by changing the Latin suffix) into a different rank.[1]
- Splitting may happen on the level of species or genera if the divergence turns out too high. Doing so creates new taxa.[3]
- The taxon may turn out to be polyphyletic. The curator first restricts the taxon to the clade containing its type material. A new taxon is created for each of the other clades.[1]
For the each new taxon, the curators try to find a proposed name in literature for it. If there is no name proposed, the taxon is given a placeholder name by adding a suffix to the original name, e.g. Lactobacillus gasseri_A. After "Z" comes "AA".[1]
See also
editReferences
edit- ^ a b c d e f g h Parks, DH; Chuvochina, M; Waite, DW; Rinke, C; Skarshewski, A; Chaumeil, PA; Hugenholtz, P (November 2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life". Nature Biotechnology. 36 (10): 996–1004. bioRxiv 10.1101/256800. doi:10.1038/nbt.4229. PMID 30148503. S2CID 52093100.
- ^ Rinke, Christian; Chuvochina, Maria; Mussig, Aaron J.; Chaumeil, Pierre-Alain; Davín, Adrián A.; Waite, David W.; Whitman, William B.; Parks, Donovan H.; Hugenholtz, Philip (21 June 2021). "A standardized archaeal taxonomy for the Genome Taxonomy Database" (PDF). Nature Microbiology. 6 (7): 946–959. doi:10.1038/s41564-021-00918-8. ISSN 2058-5276. PMID 34155373. S2CID 235595884.
- ^ a b Parks, DH; Chuvochina, M; Chaumeil, PA; Rinke, C; Mussig, AJ; Hugenholtz, P (September 2020). "A complete domain-to-species taxonomy for Bacteria and Archaea". Nature Biotechnology. 38 (9): 1079–1086. bioRxiv 10.1101/771964. doi:10.1038/s41587-020-0501-8. PMID 32341564. S2CID 216560589.
- ^ For information on each update, see relevant change logs. For notable, paper-worthy changes, see "Cite GTDB" section on the About page.
- ^ Chaumeil, PA; Mussig, AJ; Hugenholtz, P; Parks, DH (15 November 2019). "GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database". Bioinformatics. 36 (6): 1925–1927. doi:10.1093/bioinformatics/btz848. PMC 7703759. PMID 31730192.
- ^ Almeida, Alexandre; Nayfach, Stephen; Boland, Miguel; Strozzi, Francesco; Beracochea, Martin; Shi, Zhou Jason; Pollard, Katherine S.; Sakharova, Ekaterina; Parks, Donovan H.; Hugenholtz, Philip; Segata, Nicola; Kyrpides, Nikos C.; Finn, Robert D. (20 July 2020). "A unified catalog of 204,938 reference genomes from the human gut microbiome". Nature Biotechnology. 39 (1): 105–114. doi:10.1038/s41587-020-0603-3. PMC 7801254. PMID 32690973.
- ^ Nayfach, Stephen; et al. (9 November 2020). "A genomic catalog of Earth's microbiomes". Nature Biotechnology. 39 (4): 499–509. doi:10.1038/s41587-020-0718-6. PMC 8041624. PMID 33169036.
- ^ "Incorporation of Phylogenomics into BMSAB". Bergey's Manual Trust.
- ^ a b c "METHODS.txt (GTDB release 220)". data.gtdb.ecogenomic.org. 2024.
Further reading
edit- Pallen, MJ; Rodriguez-R, LM; Alikhan, NF (September 2022). "Naming the unnamed: over 65,000 Candidatus names for unnamed Archaea and Bacteria in the Genome Taxonomy Database" (PDF). International Journal of Systematic and Evolutionary Microbiology. 72 (9). doi:10.1099/ijsem.0.005482. PMID 36125864. – proposal to name unnamed GTDB taxa using meaningless random Latin syllables. Not affiliated with GTDB.
- Zhu, Q; Mai, U; Pfeiffer, W; et al. (2 December 2019). "Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea". Nature Communications. 10 (1): 5477. Bibcode:2019NatCo..10.5477Z. doi:10.1038/s41467-019-13443-4. PMC 6889312. PMID 31792218. – a similar independent effort that also evaluates GTDB taxonomic quality.