GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands
GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands
GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands
* Vincent.daubin@univ-lyon1.fr
Abstract
OPEN ACCESS
Introduction
Comparative genomics is a fundamental key to the inner workings of genomes. The identifica-
tion of genes and other functional elements such as regulatory regions, as well as the under-
standing of their influence on the fitness of organisms rely essentially on the detection of
signatures of natural selection within genomes [1]. In that respect, devising a model of se-
quence evolution in the absence of selective constraints (a neutral model) is critical for the de-
tection of functional sequences. Indeed, to explain the features of a given genomic segment,
comparing the fit of a neutral model to that of a model that also invokes selection (either puri-
fying or positive) is the operational way to infer evolutionary constraint and hence function.
The base composition of genomic sequences varies widely, both across species and along
chromosomes [2,3]. For instance, the genomic GC-content of cellular organisms ranges from
13% to about 75% [4,5], with vast intra-genomic heterogeneity. These large-scale variations in
base composition affect all parts of genomes, intergenic regions and genes—including all three
codon positions [6]—and hence cannot be simply explained by selective constraints on the en-
coded proteins. Determining the underlying causes (selective or neutral) of these variations
in GC-content is a major issue in genetics: if they result from selection, it implies that the geno-
mic base composition per se is an important trait that contributes to the fitness of organisms;
conversely, if these “genomic landscapes” are largely shaped by non-adaptive molecular pro-
cesses, then characterizing these processes is essential for the reliable detection of selection (see
e.g. [7]).
In mammals, the analysis of polymorphism data and substitution patterns along genomes
demonstrated that the evolution of GC-content is driven by recombination, which tends to
increase the probability of fixation of AT!GC mutations [8,9]. The impact of recombination
on base composition in these genomes is most probably due to a phenomenon known as GC-
biased gene conversion (gBGC), which favours G/C nucleotides at polymorphic sites in the
conversion of intermediates of recombination (see review in [10]). Although gBGC as a process
is unrelated to natural selection, it affects the probability of fixation of alleles in patterns similar
to selection [11]. It has been shown to be an important confounding factor, which can mimic
some marks of positive selection [7,12] and interfere with selection by actively promoting the
fixation of deleterious alleles [13,14]. The process of gBGC has been observed directly in meio-
sis products from yeast and human [15,16], and there is ample evidence, based on the analysis
of relationships between recombination rate and substitution patterns within genomes, that
this process affects many other eukaryotes [17–19].
In Bacteria and Archaea, several environmental factors potentially affecting genomic GC-
content have been proposed (such as the availability of oxygen or nitrogen in the environment,
growth temperature, or the variety of environments encountered by an organism, see for in-
stance [20] and ref. therein). Because these effects are weak and the nature of the selective pres-
sures remain elusive, the major force driving genomic GC-content has long been considered to
be mutational bias [21]. Recently however, two independent analyses have shown that in virtu-
ally all Bacteria, independently of their genomic GC-content, there is an excess of G/C!A/T
mutations [22,23]. This suggests that an unknown process, selective or neutral, is opposing this
universal mutational bias by favouring the fixation of G/C alleles Previously, an analysis of a
large number of E. coli genomes had suggested a possible role of gBGC, based on the link be-
tween GC-content, recombination and the organization of the chromosome in this species
[24]. However Hildebrand et al. [23] observed that the excess of G/C!A/T mutations was still
present after removing datasets with evidence of recombination. Moreover they found no cor-
relation between GC-content and recombination rate across bacterial species. They therefore
concluded that this force could not be gBGC and hence that selection was driving an increase
of genomic GC in Bacteria. The nature of this selective advantage remains however mysterious,
though various hypotheses have been proposed [25,26].
Here we argue that the analyses performed by Hildebrand et al. [23] are not conclusive re-
garding the gBGC hypothesis, and we present evidence that variations in GC-content observed
in Bacteria are influenced by gBGC. One pervasive signature of gBGC is that genomic regions
undergoing high recombination rates will also acquire a high GC-content [6]. We thus studied
the relationship between recombination and GC-content in 20 groups of Bacteria and one
group of Archaea. This dataset covers a wide range of clades representative of the bacterial di-
versity. To avoid problems inherent to comparisons of recombination rates among species
(such as differences in polymorphism, genome samples, population size, mutation rates, an
other life history factors), we examined the intragenomic variability for both recombination
and GC-content.
We show that in a wide variety of bacterial species, genes with evidence of recombination
have a higher GC-content. We further show that this bias towards G/C nucleotides in recom-
bining genes cannot be explained by selection on codon usage, and could interfere with the se-
lection for AT-ending optimal codons. These two observations strongly suggest that
homologous recombination, via gBGC, is a crucial factor universally influencing the nucleotide
content of genes and genomes. If confirmed, gBGC can account for several pervasive yet unex-
plained features of bacterial genomes. Finally, we emphasize that because gBGC has the ability
to both mimic and interfere with natural selection, gBGC must be considered by future studies
geared at understanding processes driving bacterial genome evolution.
Results
A universal relationship between recombination and GC% in Bacteria
In Bacteria, recombination occurs in the form of gene conversion (i.e. unidirectional transfer of
genetic material from a donor sequence towards a homologous recipient sequence). To detect
past gene conversion events in bacterial species, it is necessary to compare closely related ge-
nomes. We therefore selected in the database of homologous gene families HOGENOM (re-
lease 6) [27] all groups of closely related species or strains encompassing at least 6 sequenced
genomes. This dataset contains 20 bacterial groups and one archaeal group. For each gene fam-
ily represented in these groups, we computed i) the average GC-content at different positions
of codons and ii) the index of recombination provided by PHI [28] based on alignments of
standardized length (see methods for details). PHI is a rapid method for detecting
recombination in multiple alignments at the scale of the gene, which has been shown to be
more robust than most methods to variations in recombination rates, sequence divergence and
population dynamics [28]. We used this test to determine if homologous gene families had ex-
perienced gene conversion events among members of the taxa of interest. One important fea-
ture of this test is that it measures whether there is sufficient phylogenetic signal in an
alignment to tell if recombination has occurred. Only alignments with sufficient signal, wheth-
er recombinant or non-recombinant were retained for tests in the remaining of this study. We
also used three other approaches for detecting recombination, and these confirm the robust-
ness of our conclusions (see Methods and Supplementary Material).
In Eukaryotes, a general relationship between various estimates of recombination rate and
the GC% of genes has been documented and provides indirect evidence for gBGC. Our first
goal was to test this prediction in Bacteria and Archaea. To exclude a potential effect of the
number of genes in the alignment on our estimates of recombination (because alignments with
more sequences are expected to give more power to detect recombination), we focused on sin-
gle-copy genes of the core genome (i.e. genes that are present in only one copy and found in
each genome of a group). In 7 of the 21 groups, the proportion of single-copy genes of the core
genome with evidence for recombination was very low (<2% of all gene alignments tested),
suggesting that these species are clonal or nearly so (Table 1; shaded datasets in Fig. 1). Indeed,
the Burkholderia pseudomalei group, Chlamydia trachomatis, Francisella tularensis, Mycobac-
terium tuberculosis and Yersinia pestis are species known to be pathogenic clonal complexes
with low polymorphism and probably very low recombination [29–33], while Brucella spp. and
Sulfolobus spp. are likely composed of ecologically isolated clades, because of their respective
lifestyle as obligate intracellular pathogen or ecotypes endemic of hot springs [34–36]. In 11 of
the 14 remaining groups, we found a significant positive difference in average GC-content at
all and/or at the third position of codons (GC3) between recombinant and non-recombinant
genes (Fig. 1). In these 11 species, the difference in GC3 is always larger than that at all posi-
tions, suggesting that the effect of recombination on gene composition is stronger at synony-
mous positions (probably because of purifying selection on protein sequences). Two notable
exception to this pattern are i) the bacterial species Helicobacter pylori, where GC-content
seems to be lower in recombining genes and ii) the Bacillus anthracis/cereus group, where GC
at all positions and GC3 display opposite patterns, with GC3 being higher in recombining
genes. Consistent results are obtained using alternative recombination detection methods
(S1 Fig.).
Dataset Taxon name Nb. of Total Nb. of Nb. of Nb. of non- Nb. of Mean Mean Mean Mean
genomes nb. of core recombinant recombinant unclassified GC of GC3 of GC of GC3 of
core genes core genes core genes core genes* core core core core
genes of size genes genes genes genes
(%) (%) of size of size
900bp
900bp 900bp
(%) (%)
Brsp Brucella spp. 9 1675 776 0 422 354 58,1 67,3 58,8 68,6
Ftul Francisella 8 1015 469 0 359 110 33,3 21,8 33,8 21,9
tularensis
Mtub Mycobacterium 7 2222 1078 0 91 987 65,6 79,3 66,1 80,3
tuberculosis
complex
Bmal Burkholderia 9 1482 781 4 628 149 67,9 88,3 68,7 89,7
Pseudomallei
group
Ypes Yersinia pestis 11 2017 1073 7 679 387 48,7 48,9 49,3 49,9
Ctra Clamydia 13 772 391 6 376 9 41,6 34,6 41,8 34,6
trachomatis
Susp Sulfolobus spp. 8 1386 547 8 465 74 34,7 29,1 35,4 29,2
Bcen Burkholderia 8 1939 936 106 829 1 67,4 87,4 68,2 89,1
cenocepacia
complex (BCC)
Saur Staphylococcus 15 1464 668 92 571 5 33,5 22,4 34,2 22,2
aureus
Blon Bifidobacterium 6 1006 600 73 358 169 61,3 77,2 61,9 78,2
longum
Abau Acinetobacter 6 1429 638 115 516 7 40,3 29,9 40,8 30,2
spp.
Cjej Campylobacter 6 1048 501 93 403 5 31,0 19,4 31,6 19,6
jejunii
Cbot Clostridium 8 1715 730 168 562 0 28,6 16,8 29,1 16,4
botulinum
Spyo Streptococcus 12 1051 496 119 370 7 38,9 31,6 39,6 32,2
pyogenes
Spne Streptococcus 13 1090 507 119 368 20 41,3 36,9 42,0 37,4
pneumoniae
Sent Salmonella 14 2121 975 323 652 0 53,4 59,4 54,6 61,6
enterica
Lisp Listeria spp. 8 1631 678 328 350 0 38,1 29,6 38,8 29,5
Bant Bacillus 17 1730 727 422 305 0 36,3 26,3 37,0 26,5
anthracis/cereus
group
Nmen Nesseiria 8 1156 552 339 213 0 54,4 63,9 55,3 65,7
meningitidis
Ecol Escherichia coli 35 1357 619 442 177 0 52,3 56,3 53,3 58,0
Hpyl Helicobacter 14 995 467 334 133 0 39,9 42,8 40,4 43,3
pylori
Total number of genes in the core-genome, as well as the number of core genes classified as recombinant and non-recombinant based on PHI analysis
and unclassified ones (genes with insufficient signal to test for recombination, excluded from comparison tests) are indicated. The mean proportion of GC
(all positions of codons) and GC3 (third position of codons) of core genes are shown for each dataset.
doi:10.1371/journal.pgen.1004941.t001
possibly AU-ending optimal codons if gBGC is strong enough to override selection on codon
usage) should display the opposite pattern. We therefore looked specifically at the frequency of
the different types of codons, i.e. optimal and non-optimal, in recombining and non-
recombining genes.
There is a debate over the best way to define optimal codons, based on their over-representation
in either ribosomal protein genes (RP), or genes with the highest codon bias (HCB) [38–40].
We therefore analyzed the frequency of GC-ending and AU-ending optimal codons (FopGC
and FopAU) and non-optimal codons (FnopGC and FnopAU) according to both RP and HCB
definitions (Fig. 2, S3 Fig., and S4 Fig., respectively). The higher GC3 of recombining genes
means that GC-ending codons are over-represented in recombining genes, but this is true for
optimal GC-ending codons in only 2 (RP optimal codons) or 4 (HCB optimal codons) species
out of 11. This effect is hence essentially due to non-optimal codons (FnopGC is significantly
higher in recombining genes than non-recombining genes in respectively 9 and 8 species for
RP and HCB definitions). Moreover, optimal AU-ending codons are significantly depleted in
recombining genes for 8 (resp. 5) species for RP (resp. HCB) codons. In fact, only two species,
S. pyogenes and Nesseiria meningitidis (using the HCB method—only S. pyogenes using the RP
method) exhibit a pattern partially compatible with the selection hypothesis presented above.
All species display either an increase of FnopGC and/or a decrease of FopAU in recombining
genes, a fact that cannot be explained by a higher efficiency of selection. This pattern excludes
the possibility of pervasive selection for codon usage promoting a better adaptation to the pool
of tRNA for genes in regions of high recombination, but is compatible with the predictions of
gBGC.
Figure 2. Effect of recombination on codon usage of core genes. Difference in frequency of optimal (fop)
or non-optimal (fnop) codons (as determined by RP method) in recombining and non-recombining genes in
each dataset for AU-ending (redish colors) and GC-ending (blueish colors) codons. The recombination status
of genes was determined as in Fig. 1, only datasets with more than 10% recombining genes are shown. A
positive difference indicates that recombining genes are enriched in a category of codons, while a negative
difference indicate depletion. Stars indicate significance of a Student’s t-test between recombining and non
recombining genes. Colored boxes on the right of dataset names indicate the numbers of AU-ending and GC-
ending optimal or non-optimal codons used by the taxon (detailed in S2 Table). Symbols and dataset
abbreviations as in Fig. 1; shading is only used to distinguish between datasets. It should be noticed that
variations in fopGC and fnopAU (resp. fopAU and fnopGC) are not totally independent (typically, for all
amino-acids encoded by two synonymous codons, if the optimal codon is GC ending, the non-optimal is AT-
ending).
doi:10.1371/journal.pgen.1004941.g002
Discussion
The link between GC-content and recombination is independent from
constraints on protein coding genes
Our results suggest that recombination affects the GC-content of genes in most bacterial phyla.
We analyzed genes of the core genome to compare the base composition of genes with or with-
out evidence of recombination. In 11 of the 14 species in which a significant amount of recom-
bination could be detected, we observed that the GC-content (measured either at the third
codon position or along the entire coding region) is higher among recombining genes com-
pared to other (hereafter labeled as “non-recombining”) core genes.
Several hypotheses have been proposed to explain the variations in GC-content among bac-
terial genomes [25]. Several studies have revealed that the genomic GC-content of bacterial ge-
nomes is always higher than what would be predicted from mutational bias [22,23,43]. Hence,
it seems inescapable that some other evolutionary force is driving the genomic GC-content to-
wards higher values in virtually all bacterial species, except maybe for the most AT-rich ge-
nomes [22,43]. Recombination is known to enhance the efficiency of selection by breaking
Figure 3. Correlations between GC3 and estimates of recombination rate. For each dataset, core genes are sorted by increasing GC3 and pooled into
20 classes of equal size. Correlations between the mean GC3 and mean recombination rate of each class are reported. (A) Correlation between GC3 and
coalescent-based estimates of recombination rate for Homo sapiens (Hsap) and Stretococcus pyogenes (Spyo). For Hsap, recombination rate is expressed
as cM∙Mb-1; a subset of 600 genes out of the 16,346 human genes is shown as a representative of 1,000 random samples (mean R2 is 55%, see Main Text).
For Spyo, recombination rate is expressed as the value of rho parameter in ClonalOrigin [41] inferences, which is scaled by arbitrary coalescent time units; a
subset of 437 genes out of 478 core genes was used, after removal of the 41 genes showing no convergence of the rho estimate (correlation on the full 478
core genes yields a R2 of 31%, see S1 Text). (B) Correlation between GC3 and PREC, the proportion of genes detected as recombinant by PHI test [28] in the
class, for all 14 bacterial datasets showing sufficient evidence of recombination (Table 1).
doi:10.1371/journal.pgen.1004941.g003
linkage among sites. It is therefore conceivable that our results merely reveal a universal selec-
tive pressure favoring GC-rich alleles. But the mechanism underlying such selection would
have to be acting more efficiently on synonymous sites than non-synonymous sites because the
difference of GC% between recombining and non-recombining genes is higher at the third po-
sition of codons. This excludes potential selection on amino-acid content. One selectable trait
that may influence synonymous positions is codon usage. If optimal codons tended to be GC-
rich, recombination could drive GC% higher by favoring the adaptation of genes to better
Similar to selection, the impact of gBGC on genome evolution depends on its intensity
relative to genetic drift, and becomes negligible when B 1. There is evidence that besides Ne
and r, both L and b0 can vary strongly across species. For example, in budding yeast, when a
GC/AT heterozygote site is involved in a gene conversion event, the GC-allele is transmitted
with a probability pGC = 0.507 (which is significantly higher than the expected Mendelian
transmission ratio; [15]), whereas in humans, a recent analysis of gene conversions tracts
associated to non-crossover recombination showed that GC-alleles are transmitted with a
probability pGC = 0.70 [16]. Thus, the parameter b0 (b0 = 2 ∙ pGC – 1) is about 30 times higher in
humans than in yeast. Conversely, gene conversion tracts are on average about 4 times longer
in yeast than in mammals [15,44]. Thus, for a same population-scaled recombination rate
(Ne r), the intensity of gBGC would be about 7 times stronger in humans than in yeast. This
example illustrates that because of variations in L and b0, the gBGC model does not necessarily
predict a good correlation between population-scaled recombination rate and GC-content
across species. In fact, to test the predictions of the gBGC model, it is more appropriate to in-
vestigate correlations between base composition and recombination rate within genomes, so
that the other parameters (Ne, L and b0) can be controlled for.
across bacterial species? The gBGC model predicts that, all else being equal, the present-day
GC-content of a genome should directly reflect its average recombination rate over long
evolutionary time. To test this prediction, it is important to take into account two difficulties.
First, recombination rates measured in extant populations reflect recent events (more recent
than the coalescent time, i.e. of the order of Ne generations), and hence may not correspond to
the average recombination rate over times necessary for genomic GC-content to evolve signifi-
cantly (i.e. inter-species divergence times). Second, the precision in the estimate depends on
the physical scale at which recombination is measured. To illustrate these points let us consider
the human genome, where the impact of gBGC is well documented [6]. At the gene scale, the
correlation between present-day recombination rate (measured in a 10-kb window, centered
on the middle of the gene, using HapMap genetic map [42]), and the gene GC-content (at
third codon position) is significant but quite weak (R2 = 0.035, p<10−10). However, at 1Mb
scale the correlation is much stronger (R2 = 0.15; [9]). Furthermore, when GC-content varia-
tions and recombination rates are measured over the same evolutionary time period, the
correlation becomes very strong (1Mb scale: R2 = 0.64; [45]).
To test whether the impact of gBGC in bacteria was comparable to what is observed in
mammals, we first focused on Streptococcus pyogenes, one of the species for which the signature
of gBGC is strong (Fig. 1, see also Sup. Mat.). We computed the population-scaled recombina-
tion rate (rho) for each gene of the core genome, using ClonalOrigin [41]. The correlation be-
tween rho and the GC-content at third codon position of each gene (GC3) is higher than what
is observed in humans (R2 = 0.087; p<10−9). This result is remarkable, given that recombina-
tion rates are measured here at the gene scale (typically about 1kb).
To go further, we binned the data set into 20 groups of genes according to their GC3, and
we computed the correlation between the average GC3 and the average rho of each bin. Our
reasoning is that by computing average values, we should get estimates of rho that are more
robust to measurement noise and to possible temporal variations in recombination rates.
Using this approach, we observed a strong correlation between the GC3 and recombination
(R2 = 0.60; Fig. 3A). To investigate the amplitude of this relation in the other bacterial species
studied here, we used PREC (an index based on the proportion of genes in a bin that are detected
as recombinant by PHI), which provides a good estimate of the average recombination rate in
a bin, and is much easier to compute than ClonalOrigin’s rho. We observed a significant corre-
lation for 11 of the 14 species, and these significant correlations were positive in all cases, with
R2 values comprised between 0.24 and 0.68 (on average R2 = 0.43; Fig. 3B). Thus, in many bac-
teria, the average recombination rate in a bin is a good predictor of its average GC-content. We
performed an analogous analysis in human genes using jackknife sampling of the dataset to
scale it to the size of bacterial datasets. The average correlation observed in humans is R2 = 0.55
(Fig 3A). Hence, in many bacteria, the intensity of the relationships between GC-content and
recombination is comparable to that observed in humans, where the impact of gBGC on base
composition is known to be strong [45]. This is consistent with the hypothesis that on the long
term, the gBGC process can have a major influence on the evolution of base composition
in bacteria.
exceeds the number of A/T!G/C changes. Given that genomic base composition strongly
fluctuates over long evolutionary times (as demonstrated by the wide distribution of GC-con-
tent across bacterial species), it is not surprising that many genomes are not at equilibrium.
However, what is unexpected is that this non-stationarity predominantly leads to loosing GC-
content: a priori, at the scale of the entire bacterial biodiversity, one would expect to observe as
many GC-increasing genomes as GC-decreasing genomes. One possible explanation is that the
observed excess G/C!A/T changes among closely related genomes corresponds to polymor-
phic mutations, which eventually do not reach fixation because either selection or gBGC favors
GC-alleles over AT-alleles [23]. Hildebrand and colleagues observed an excess of G/C!A/T
changes even in bacterial genomes that show no evidence of recent recombination population-
wise (i.e. Ner = 0) [23]. They therefore rejected the hypothesis that the fixation bias could be
due to gBGC. However, this conclusion relies on one important assumption: that the Ner
parameter measured in extant populations reflects the long-term average recombination rate.
In fact it is expected that Ne (and hence Ner) should fluctuate over time, as populations go
through periods of bottlenecks and expansion. Immediately after a bottleneck, Ner would be
close to 0, and hence genomes should accumulate G/C!A/T substitutions. However, on the
long term, this can be compensated by an increase in GC-content when the effective population
size becomes larger (and hence B > 1). Thus, the base composition of genomes may remain
above the mutational equilibrium on the long term, even if many lineages go through periods
during which Ner is null (and hence B = 0). Interestingly, the rare species for which the long-
term recombination rate is effectively null (typically endosymbiotic bacteria), generally have
very AT-rich genomes [46], as predicted by the gBGC hypothesis.
[50,51]. In C. jejuni and B. longum, however, we observe patterns similar to the other bacterial
datasets that are in support of the existence of gBGC, indicating that it does not depend on the
presence of a typical MutSL complex. The existence of gBGC in Bacteria and Eukaryotes
however suggests that it may have been present in the last universal common ancestor of all
cellular life forms (LUCA). Unfortunately, the only archaeal dataset matching our criteria was
a group of Sulfolobus sp. genomes for which our analysis showed few evidence of recombina-
tion (Table 1), in agreement with the previously described isolation of endemic clades in this
group [36].
We do not claim that gBGC is the unique determinant of base composition in bacterial
genome: in fact there is evidence that mutation patterns vary significantly among species [22],
and these variations are expected to contribute to differences in genome base composition.
However, the model we propose provides a simple explanation for several important results of
comparative bacterial genomics. First, gBGC can explain why bacterial genomes can maintain
a high GC-content, even though mutation is universally AT-biased [22,23]. Second, gBGC can
explain some of the intragenomic heterogeneity in GC-content observed in bacterial genomes.
Indeed, we observe that genes with evidence of recombination display on average substantially
higher GC-content than other genes. This observation also suggests that the probability of re-
combination is variable among genes in the genome, as proposed under some speciation mod-
els [53]. Furthermore, given that recently acquired genes tend to be AT-rich, gBGC would
contribute to their progressive enrichment in GC-content [54,55].
of homologous proteins as defined in HOGENOM, and averaged these scores to obtain a global
similarity score s. We then took 1-s as a distance and selected groups of genomes with at least
6 members and a distance lower than 0.15. This criterion left 21 groups of species representing
a variety of bacterial and archeal species (Table 1). For each gene family, CDSs were extracted
using ACNUC Python API [56] and re-aligned with MUSCLE [57] using default parameters.
GC% and codon frequencies were computed using custom Python scripts.
Supporting Information
S1 Text. The supplementary text contains a discussion of the various approaches available
for detecting recombination in bacteria and their pertinence for the analysis of our dataset.
(DOC)
S1 Fig. Comparison of several recombination detection programs on the inferred effect of
recombination on core genes GC-content. Legend as in Fig. 1.
(EPS)
S2 Fig. Difference of GC% between intergenes located next to recombining vs. non—
recombining genes. Difference in average GC-content of intergenic regions around each sin-
gle-copy core gene which had a conclusive result of the PHI test. Individual intergenic GC%
values were computed as the average of both flanking intergenes when they were 50bp or lon-
ger, measured on at most 400bp away of the reference gene. Intergenes were classified as re-
combinant and non-recombinant as was the neighbouring gene based on the PHI test. A
positive difference indicates that intergenes next to recombinant families are enriched in GC.
Symbols and dataset abbreviations as in Fig. 1.
(EPS)
Acknowledgments
We thank Gergely Szöllősi, Paul Sharp, and Eduardo PC Rocha, Nicolas Lartillot, Adam Eyre-
Walker for their thoughtful comments and advice about controls to include in the present
work. We thank Simon Penel and Vincent Miele for providing scripts and data used to process
the genomic data from HOGENOM database.
Author Contributions
Conceived and designed the experiments: FL LD VD. Performed the experiments: FL SP VD.
Analyzed the data: FL SP LD TB VD. Wrote the paper: FL TB XN LD VD.
References
1. Doolittle WF (2013) Is junk DNA bunk? A critique of ENCODE. Proc Natl Acad Sci 110: 5294–5300.
doi: 10.1073/pnas.1221376110 PMID: 23479647
2. Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, et al. (1985) The mosaic genome of warm-blood-
ed vertebrates. Science 228: 953–958. PMID: 4001930
3. Sueoka N (1962) ON THE GENETIC BASIS OF VARIATION AND HETEROGENEITY OF DNA BASE
COMPOSITION. Proc Natl Acad Sci 48: 582–592. PMID: 13918161
4. McCutcheon JP, Moran NA (2010) Functional convergence in reduced genomes of bacterial symbionts
spanning 200 My of evolution. Genome Biol Evol 2: 708–718. doi: 10.1093/gbe/evq055 PMID:
20829280
5. Pagani I, Liolios K, Jansson J, Chen I-MA, Smirnova T, et al. (2012) The Genomes OnLine Database
(GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic
Acids Res 40: D571–D579. doi: 10.1093/nar/gkr1100 PMID: 22135293
6. Duret L, Galtier N (2009) Biased gene conversion and the evolution of mammalian genomic land-
scapes. Annu Rev Genomics Hum Genet 10: 285–311. doi: 10.1146/annurev-genom-082908-150001
PMID: 19630562
7. Ratnakumar A, Mousset S, Glémin S, Berglund J, Galtier N, et al. (2010) Detecting positive selection
within genomes: the problem of biased gene conversion. Philos Trans R Soc Lond B Biol Sci 365:
2571–2580. doi: 10.1098/rstb.2010.0007 PMID: 20643747
8. Spencer CC, Deloukas P, Hunt S, Mullikin J, Myers S, et al. (2006). The influence of recombination on
human genetic diversity. PLoS Genetics, 2:e148. doi: 10.1371/journal.pgen.0020148 PMID: 17044736
9. Duret L, Arndt PF (2008) The Impact of Recombination on Nucleotide Substitutions in the Human Ge-
nome. PLoS Genet 4: e1000071. doi: 10.1371/journal.pgen.1000071 PMID: 18464896
10. Webster MT, Hurst LD (2012) Direct and indirect consequences of meiotic recombination: implications
for genome evolution. Trends Genet 28: 101–109. doi: 10.1016/j.tig.2011.11.002 PMID: 22154475
11. Nagylaki T (1983) Evolution of a finite population under gene conversion. Proc Natl Acad Sci U S A 80:
6278–6281. PMID: 6578508
12. Galtier N, Duret L (2007) Adaptation or biased gene conversion? Extending the null hypothesis of mo-
lecular evolution. Trends Genet TIG 23: 273–277. doi: 10.1016/j.tig.2007.03.011 PMID: 17418442
13. Galtier N, Duret L, Glémin S, Ranwez V (2009) GC-biased gene conversion promotes the fixation of
deleterious amino acid changes in primates. Trends Genet TIG 25: 1–5. doi: 10.1016/j.tig.2008.10.011
PMID: 19027980
14. Necşulea A, Popa A, Cooper DN, Stenson PD, Mouchiroud D, et al. (2011) Meiotic recombination fa-
vors the spreading of deleterious mutations in human populations. Hum Mutat 32: 198–206. doi: 10.
1002/humu.21407 PMID: 21120948
15. Mancera E, Bourgon R, Brozzi A, Huber W, Steinmetz LM (2008) High-resolution mapping of meiotic
crossovers and non-crossovers in yeast. Nature 454: 479–485. doi: 10.1038/nature07135 PMID:
18615017
16. Williams A, Geneovese G, Dyer T, Truax K, Jun G, et al. (2014) Non-crossover gene conversions show
strong GC bias and unexpected clustering in humans. bioRxiv: 009175. doi: 10.1101/009175
17. Capra JA, Pollard KS (2011) Substitution patterns are GC-biased in divergent sequences across the
metazoans. Genome Biol Evol 3: 516–527. doi: 10.1093/gbe/evr051 PMID: 21670083
18. Escobar JS, Glémin S, Galtier N (2011) GC-biased gene conversion impacts ribosomal DNA evolution
in vertebrates, angiosperms, and other eukaryotes. Mol Biol Evol 28: 2561–2575. doi: 10.1093/
molbev/msr079 PMID: 21444650
19. Pessia E, Popa A, Mousset S, Rezvoy C, Duret L, et al. (2012) Evidence for widespread GC-biased
gene conversion in eukaryotes. Genome Biol Evol 4: 675–682. doi: 10.1093/gbe/evs052 PMID:
22628461
20. Foerstner KU, von Mering C, Hooper SD, Bork P (2005) Environments shape the nucleotide composi-
tion of genomes. EMBO Rep 6: 1208–1213. doi: 10.1038/sj.embor.7400538 PMID: 16200051
21. Sueoka N (1988) Directional mutation pressure and neutral molecular evolution. Proc Natl Acad Sci U
S A 85: 2653. PMID: 3357886
22. Hershberg R, Petrov DA (2010) Evidence That Mutation Is Universally Biased towards AT in Bacteria.
PLoS Genet 6: e1001115. doi: 10.1371/journal.pgen.1001115 PMID: 20838599
23. Hildebrand F, Meyer A, Eyre-Walker A (2010) Evidence of Selection upon Genomic GC-Content in
Bacteria. PLoS Genet 6. doi: 10.1371/journal.pgen.1001107 PMID: 20838593
24. Touchon M, Hoede C, Tenaillon O, Barbe V, Baeriswyl S, et al. (2009) Organised genome dynamics in
the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet 5: e1000344. doi: 10.
1371/journal.pgen.1000344 PMID: 19165319
25. Rocha EPC, Feil EJ (2010) Mutational Patterns Cannot Explain Genome Composition: Are There Any
Neutral Sites in the Genomes of Bacteria? PLoS Genet 6: e1001104. doi: 10.1371/journal.pgen.
1001104 PMID: 20838590
26. Raghavan R, Kelkar YD, Ochman H (2012) A selective force favoring increased G+C content in bacteri-
al genes. Proc Natl Acad Sci 109: 14504–14507. doi: 10.1073/pnas.1205683109 PMID: 22908296
27. Penel S, Arigon A-M, Dufayard J-F, Sertier A-S, Daubin V, et al. (2009) Databases of homologous gene
families for comparative genomics. BMC Bioinformatics 10: S3. doi: 10.1186/1471-2105-10-S6-S3
PMID: 19534752
28. Bruen TC, Philippe H, Bryant D (2006) A Simple and Robust Statistical Test for Detecting the Presence
of Recombination. Genetics 172: 2665–2681. doi: 10.1534/genetics.105.048975 PMID: 16489234
29. Ussery DW, Kiil K, Lagesen K, Sicheritz-Pontén T, Bohlin J, et al. (2009) The genus burkholderia: anal-
ysis of 56 genomic sequences. Genome Dyn 6: 140–157. doi: 10.1159/000235768 PMID: 19696499
30. Joseph SJ, Didelot X, Gandhi K, Dean D, Read TD (2011) Interplay of recombination and selection in
the genomes of Chlamydia trachomatis. Biol Direct 6: 28. doi: 10.1186/1745-6150-6-28 PMID:
21615910
31. Achtman M, Zurth K, Morelli G, Torrea G, Guiyoule A, et al. (1999) Yersinia pestis, the cause of plague,
is a recently emerged clone of Yersinia pseudotuberculosis. Proc Natl Acad Sci 96: 14043–14048. doi:
10.1073/pnas.96.24.14043 PMID: 10570195
32. Keim P, Johansson A, Wagner DM (2007) Molecular Epidemiology, Evolution, and Ecology of
Francisella. Ann N Y Acad Sci 1105: 30–66. doi: 10.1196/annals.1409.011 PMID: 17435120
33. Supply P, Marceau M, Mangenot S, Roche D, Rouanet C, et al. (2013) Genomic analysis of smooth tu-
bercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis. Nat
Genet 45: 172–179. doi: 10.1038/ng.2517 PMID: 23291586
34. Wattam AR, Williams KP, Snyder EE, Almeida NF Jr, Shukla M, et al. (2009) Analysis of ten Brucella
genomes reveals evidence for horizontal gene transfer despite a preferred intracellular lifestyle. J Bac-
teriol 191: 3569–3579. doi: 10.1128/JB.01767-08 PMID: 19346311
35. Whitaker RJ, Grogan DW, Taylor JW (2003) Geographic Barriers Isolate Endemic Populations of Hy-
perthermophilic Archaea. Science 301: 976–978. doi: 10.1126/science.1086909 PMID: 12881573
36. Reno ML, Held NL, Fields CJ, Burke PV, Whitaker RJ (2009) Biogeography of the Sulfolobus islandicus
pan-genome. Proc Natl Acad Sci 106: 8605–8610. doi: 10.1073/pnas.0808945106 PMID: 19435847
37. McVean GA, Charlesworth B (2000) The effects of Hill-Robertson interference between weakly select-
ed mutations on patterns of molecular evolution and variation. Genetics 155: 929–944. PMID:
10835411
38. Hershberg R, Petrov DA (2009) General Rules for Optimal Codon Choice. PLoS Genet 5: e1000556.
doi: 10.1371/journal.pgen.1000556 PMID: 19593368
39. Wang B, Shao Z-Q, Xu Y, Liu J, Liu Y, et al. (2011) Optimal Codon Identities in Bacteria: Implications
from the Conflicting Results of Two Different Methods. PLoS ONE 6: e22714. doi: 10.1371/journal.
pone.0022714 PMID: 21829489
40. Hershberg R, Petrov DA (2012) On the Limitations of Using Ribosomal Genes as References for the
Study of Codon Usage: A Rebuttal. PLoS ONE 7: e49060. doi: 10.1371/journal.pone.0049060 PMID:
23284622
41. Didelot X, Lawson D, Darling A, Falush D (2010) Inference of Homologous Recombination in Bacteria
Using Whole Genome Sequences. Genetics 186: 1435–1449. doi: 10.1534/genetics.110.120121
PMID: 20923983
42. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generation human haplo-
type map of over 3.1 million SNPs. Nature 449: 851–861. doi: 10.1038/nature06258 PMID: 17943122
43. Balbi KJ, Rocha EPC, Feil EJ (2009) The Temporal Dynamics of Slightly Deleterious Mutations in
Escherichia coli and Shigella spp. Mol Biol Evol 26: 345–355. doi: 10.1093/molbev/msn252 PMID:
18984902
44. Cole F, Baudat F, Grey C, Keeney S, de Massy B, et al. (2014) Mouse tetrad analysis provides insights
into recombination mechanisms and hotspot evolutionary dynamics. Nat Genet 46: 1072–1080. doi:
10.1038/ng.3068 PMID: 25151354
45. Munch K, Mailund T, Dutheil JY, Schierup MH (2014) A fine-scale recombination map of the human-
chimpanzee ancestor reveals faster change in humans than in chimpanzees and a strong impact of
GC-biased gene conversion. Genome Res 24: 467–474. doi: 10.1101/gr.158469.113 PMID:
24190946
46. Moran NA (2002) Microbial minimalism: genome reduction in bacterial pathogens. Cell 108: 583–586.
PMID: 11893328
47. Lynch M (2010) Rate, molecular spectrum, and consequences of human mutation. Proc Natl Acad Sci
107: 961–968. doi: 10.1073/pnas.0912629107 PMID: 20080596
48. Lesecque Y, Mouchiroud D, Duret L (2013) GC-Biased Gene Conversion in Yeast Is Specifically Asso-
ciated with Crossovers: Molecular Mechanisms and Evolutionary Significance. Mol Biol Evol 30: 1409–
1419. doi: 10.1093/molbev/mst056 PMID: 23505044
49. Glémin S (2010) Surprising fitness consequences of GC-biased gene conversion: I. Mutation load and
inbreeding depression. Genetics 185: 939–959. doi: 10.1534/genetics.110.116368 PMID: 20421602
50. Suerbaum S, Smith JM, Bapumia K, Morelli G, Smith NH, et al. (1998) Free recombination within Heli-
cobacter pylori. Proc Natl Acad Sci 95: 12619–12624. doi: 10.1073/pnas.95.21.12619 PMID: 9770535
51. Falush D, Kraft C, Taylor NS, Correa P, Fox JG, et al. (2001) Recombination and mutation during long-
term gastric colonization by Helicobacter pylori: Estimates of clock rates, recombination size, and mini-
mal age. Proc Natl Acad Sci 98: 15056–15061. doi: 10.1073/pnas.251396098 PMID: 11742075
52. Lin Z, Nei M, Ma H (2007) The origins and early evolution of DNA mismatch repair genes—multiple hor-
izontal gene transfers and co-evolution. Nucleic Acids Res 35: 7591–7603. doi: 10.1093/nar/gkm921
PMID: 17965091
53. Retchless AC, Lawrence JG (2007) Temporal fragmentation of speciation in bacteria. Science 317:
1093–1096. doi: 10.1126/science.1144876 PMID: 17717188
54. Daubin V, Lerat E, Perrière G (2003) The source of laterally transferred genes in bacterial genomes.
Genome Biol 4: R57. doi: 10.1186/gb-2003-4-9-r57 PMID: 12952536
55. Daubin V, Ochman H (2004) Bacterial Genomes as New Gene Homes: The Genealogy of ORFans in
E. coli. Genome Res 14: 1036–1042. doi: 10.1101/gr.2231904 PMID: 15173110
56. Gouy M, Gautier C, Attimonelli M, Lanave C, di Paola G (1985) ACNUC—a portable retrieval system
for nucleic acid sequence databases: logical and physical designs and usage. Comput Appl Biosci
CABIOS 1: 167–172. PMID: 3880341
57. Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space com-
plexity. BMC Bioinformatics 5: 113. doi: 10.1186/1471-2105-5-113 PMID: 15318951
58. Jakobsen IB, Easteal S (1996) A program for calculating and displaying compatibility matrices as an
aid in determining reticulate evolution in molecular sequences. Comput Appl Biosci CABIOS 12: 291–
295. PMID: 8902355
59. Smith JM (1992) Analyzing the mosaic structure of genes. J Mol Evol 34: 126–129. PMID: 1556748
60. Sawyer S (1989) Statistical tests for detecting gene conversion. Mol Biol Evol 6: 526–538. PMID:
2677599