Galperin 2019 The COG Approach
Galperin 2019 The COG Approach
Galperin 2019 The COG Approach
doi: 10.1093/bib/bbx117
Advance Access Publication Date: 14 September 2017
Paper
Abstract
For the past 20 years, the Clusters of Orthologous Genes (COG) database had been a popular tool for microbial genome anno-
tation and comparative genomics. Initially created for the purpose of evolutionary classification of protein families, the COG
have been used, apart from straightforward functional annotation of sequenced genomes, for such tasks as (i) unification of
genome annotation in groups of related organisms; (ii) identification of missing and/or undetected genes in complete micro-
bial genomes; (iii) analysis of genomic neighborhoods, in many cases allowing prediction of novel functional systems; (iv)
analysis of metabolic pathways and prediction of alternative forms of enzymes; (v) comparison of organisms by COG func-
tional categories; and (vi) prioritization of targets for structural and functional characterization. Here we review the prin-
ciples of the COG approach and discuss its key advantages and drawbacks in microbial genome analysis.
Key words: comparative genomics; genome annotation; enzyme evolution; orthologs; paralogs
Michael Y. Galperin is a Lead Scientist at the NCBI’s (NIH) Computational Biology Branch. He uses comparative genomics to study evolution of membrane
energetics and bacterial metabolic and signaling pathways.
David M. Kristensen is an Assistant Professor at the University of Iowa’s Department of Biomedical Engineering. He uses tools of comparative genomics,
bioinformatics and systems biology to study evolution of genes in viruses and microbes.
Kira S. Makarova is a Staff Scientist at the NCBI’s Computational Biology Branch. Her area of expertise is comparative genomics and sequence analysis of
microbial genomes.
Yuri I. Wolf is a Lead Scientist at the National Center for Biotechnology Information in Bethesda, Maryland. His research is focused on quantitative aspects
of evolutionary and comparative genomics.
Eugene V. Koonin is a Senior Investigator and Leader of the Evolutionary Genomics Group at the National Center for Biotechnology Information at the
NIH. He studies various aspects of genome evolution.
Submitted: 30 May 2017; Received (in revised form): 1 August 2017
Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
1063
10 64 | Galperin et al.
Figure 1. Evolution of the COG system. The numbers in parentheses indicate the number of bacterial, archaeal and eukaryotic genomes, respectively, included in the
respective COG release [1–6].
orthologs and paralogs. In practice, however, such methods are paralogs among all proteins encoded in the given genome. With
computationally expensive and fraught with artifacts at differ- incomplete genomes, there always remains the obvious possi-
some ribosomal proteins. The COG approach also allows separ- should not allow accumulation of any intermediate that cannot
ation of closely related paralogs, such as, for example, 3-isopropyl- be further metabolized and represents a dead end: to avoid poi-
malate dehydrogenase (LeuB) and isocitrate dehydrogenase (Icd), soning the cell, such intermediate would have to be exported
members of COG0473 and COG0538, respectively, that in most into the surrounding milieu. Likewise, an intermediate in the
other databases are assigned to the same family (PF00180 in functional metabolic pathway needs to be either imported or
Pfam, SM01329 in SMART, PS00470 in PROSITE, SSF53659 in synthesized within the cell. Although the possibility of ‘distrib-
SUPERFAMILY). uted’ pathways cannot be discarded, these simple consider-
ations prove productive when COGs are superimposed on the
metabolic map to identify the intermediates that have no
Protein family granularity in COGs known enzymes to produce or metabolize them. Identification
Flexible similarity cutoffs have the built-in advantage of allowing of such gaps in pathways often suggests alternative enzymes
the COGs to be as wide or as narrow as dictated by the evolution- that can be then identified experimentally [54, 57].
ary history of a given gene family. In the above example, the LeuB/
receives a misleading annotation of its best COG hit that has a polymerases are represented by several paralogs which form
completely different domain architecture. The recent attempts distinct orthologous families (arCOG00328, arCOG00329,
to identify specific domain architectures and limit annotation arCOG15272 and others) within this archaeal phylum (all these
transfer to proteins with the same domain combination [36] genes are out-paralogs in Crenarchaeota). In contrast, most of
have the potential to resolve this issue. those bacteria that possess the polB gene have a single copy,
which is co-orthologous to all archaeal polB genes, so archaea
and bacteria share only one orthologous family of polB, COG0417
COG annotation
(all these genes are co-orthologs among prokaryotes with sev-
Functional annotation of COGs, including assignment of COG eral in-paralogs in archaea). Such complex relationships among
names, is based on two key principles. First, reliance on ortholo- homologous genes confound COG analysis because the defin-
gous relationships for the COG construction makes it likely, ac- ition of orthology becomes mutually dependent with the phy-
cording to the ‘orthology conjecture’, that members of each letic patterns (the definition of orthology depends on the list of
COG have equivalent functions [7] (with only rare known excep- organisms where these genes are present, which itself depends
divergence of the analyzed set of species from their last common number of proteins and then on a search of connected triangles
ancestor. Particularly severe problems are caused by promiscuous in clusters of reciprocal best hits that scales as O(n3) with the
domains, which can attract proteins to spurious COGs through number of proteins in the cluster [38]. Inevitably, the growth of
significant but effectively irrelevant sequence similarity to the the database outpaces the availability of the computational re-
promiscuous domains. Although this problem can be addressed sources, making regular major updates of the entire COG data-
semi-automatically, e.g. by excluding the hits that cover only a base impractical. Several divide-and-conquer strategies have
small portion of the protein sequence, precise solutions still re- been used to circumvent this major difficulty. One approach
quire manual intervention. On many occasions, conserved do- that has been implemented in several COG updates includes
main architectures allowed construction of consistent COGs that accommodating the new sequences into the existing COGs first,
were not substantially affected by the presence of a shared do- then searching for potential new COGs among the sequences
main (e.g. the widespread helix-turn-helix DNA-binding domain). that do not fit the existing ones, and then, moving some se-
Conversely, the diversity of domain architectures of proteins quences from the old COGs to the new ones [10]. The principal
involved in microbial signal transduction and containing a num- direction, however, has involved construction of dedicated COG
ber of promiscuous domains (PAS, GAF, CHASE, GGDEF, EAL and collections for distinct microbial taxa. In particular, the COGs
others) required splitting some of these proteins into individual for archaea (arCOGs) went through several closely curated re-
domains or domain combinations. As a result, the COGs are a mix leases and remain up to date, having become a widely used
of (i) highly specific domain architectures (such as the above- framework for archaeal genome annotation and analysis [10,
mentioned response regulators), (ii) multiple domain architectures 70, 73]. As illustrated in Figure 2, detailed analysis of archaeal
that include a single shared domain and (iii) separate promiscu- protein families increased the coverage of cren-, eury- and
ous domains. To our knowledge, as of this writing, there is no thaumarchaeal genomes by 18–20%, so that arCOGs now cover
complete, formal solution for optimal dissection of full-length pro- >92% of the proteins encoded in typical genomes of
teins into orthologous domains. At present, for the analysis of Crenarchaeota and Euryarchaeota. Separate projects have
multidomain proteins, the best practical approaches are offered involved construction and analysis of COGs for Cyanobacteria
by integrated domain identification tools, such as CDD (which in- and Gram-positive bacteria of the order Lactobacillales [74, 75].
cludes the COGs) and InterPro. The COG approach was also implemented in the database of
Alignable Tight Genome Clusters (ATGC) that includes closely
related bacterial and archaeal genomes [64, 76]. COGs have been
constructed separately for each ATGC. These ATGC-COGs
Scalability of the COG approach and
largely avoid the problems inherent in the COG analysis at
specialized COG collections larger evolutionary distances (lineage-specific paralogy,
The basic COG approach relies first on an exhaustive all- differential gene loss and differences in domain architectures)
against-all protein comparison that scales as O(n2) with the total and have proved an efficient platform for various types of
10 68 | Galperin et al.
19. Sonnhammer EL, Ostlund G. InParanoid 8: orthology analysis 40. Dewey CN. Positional orthology: putting genomic evolution-
between 273 proteomes, mostly eukaryotic. Nucleic Acids Res ary relationships into context. Brief Bioinform 2011;12:401–12.
2015;43:D234–9. 41. Marchler-Bauer A, Zheng C, Chitsaz F, et al. CDD: conserved
20. Kaduk M, Riegler C, Lemp O, et al. HieranoiDB: a database of domains and protein three-dimensional structure. Nucleic
orthologs inferred by Hieranoid. Nucleic Acids Res 2017;45: Acids Res 2013;41:D348–52.
D687–90. 42. Alexeyenko A, Tamas I, Liu G, et al. Automatic clustering of
21. Jensen LJ, Julien P, Kuhn M, et al. eggNOG: automated con- orthologs and inparalogs shared by multiple proteomes.
struction and annotation of orthologous groups of genes. Bioinformatics 2006;22:e9–15.
Nucleic Acids Res 2008;36:D250–4. 43. Chen F, Mackey AJ, Vermunt JK, et al. Assessing performance
22. Huerta-Cepas J, Szklarczyk D, Forslund K, et al. eggNOG 4.5: a of orthology detection strategies applied to eukaryotic gen-
hierarchical orthology framework with improved functional omes. PLoS One 2007;2:e383.
annotations for eukaryotic, prokaryotic and viral sequences. 44. Altenhoff AM, Dessimoz C. Phylogenetic and functional as-
Nucleic Acids Res 2016;44:D286–93. sessment of orthologs inference projects and methods. PLoS
62. Prunetti L, El Yacoubi B, Schiavon CR, et al. Evidence that 77. Novichkov PS, Wolf YI, Dubchak I, et al. Trends in prokaryotic
COG0325 proteins are involved in PLP homeostasis. evolution revealed by comparison of closely related bacterial
Microbiology 2016;162:694–706. and archaeal genomes. J Bacteriol 2009;191:65–73.
63. Zallot R, Yuan Y, de Crecy-Lagard V. The Escherichia coli 78. Ran W, Kristensen DM, Koonin EV. Coupling between protein
COG1738 member YhhQ is involved in 7-cyanodeazaguanine level selection and codon usage optimization in the evolution
(preQ0) transport. Biomolecules 2017;7:12. of bacteria and archaea. MBio 2014;5:e00956-14.
64. Kristensen DM, Wolf YI, Koonin EV. ATGC database and 79. Yutin N, Colson P, Raoult D, et al. Mimiviridae: clusters of
ATGC-COGs: an updated resource for micro- and macro- orthologous genes, reconstruction of gene repertoire evolu-
evolutionary studies of prokaryotic genomes and protein tion and proposed expansion of the giant virus family. Virol J
family annotation. Nucleic Acids Res 2017;45:D210–18. 2013;10:106.
65. Tettelin H, Masignani V, Cieslewicz MJ, et al. Genome analysis 80. Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic Virus
of multiple pathogenic isolates of Streptococcus agalactiae: Orthologous Groups (pVOGs): a resource for comparative gen-
implications for the microbial “pan-genome”. Proc Natl Acad omics and protein family annotation. Nucleic Acids Res 2017;