Genome-wide nucleotide patterns and potential mechanisms of genome divergence following domestication in maize and soybean

Michael Scanlon

Wang et al. Genome Biology (2019) 20:74 https://doi.org/10.1186/s13059-019-1683-6 RESEARCH Open Access Genome-wide nucleotide patterns and potential mechanisms of genome divergence following domestication in maize and soybean Jinyu Wang1, Xianran Li1*, Kyung Do Kim2, Michael J. Scanlon3, Scott A. Jackson2, Nathan M. Springer4 and Jianming Yu1* Abstract Background: Plant domestication provides a unique model to study genome evolution. Many studies have been conducted to examine genes, genetic diversity, genome structure, and epigenome changes associated with domestication. Interestingly, domesticated accessions have significantly higher [A] and [T] values across genomewide polymorphic sites than accessions sampled from the corresponding progenitor species. However, the relative contributions of different genomic regions to this genome divergence pattern and underlying mechanisms have not been well characterized. Results: Here, we investigate the genome-wide base-composition patterns by analyzing millions of SNPs segregating among 100 accessions from a teosinte-maize comparison set and among 302 accessions from a wild-domesticated soybean comparison set. We show that non-genic part of the genome has a greater contribution than genic SNPs to the [AT]-increase observed between wild and domesticated accessions in maize and soybean. The separation between wild and domesticated accessions in [AT] values is significantly enlarged in non-genic and pericentromeric regions. Motif frequency and sequence context analyses show the motifs (PyCG) related to solar-UV signature are enriched in these regions, particularly when they are methylated. Additional analysis using population-private SNPs also implicates the role of these motifs in relatively recent mutations. With base-composition across polymorphic sites as a genome phenotype, genome scans identify a set of putative candidate genes involved in UV damage repair pathways. Conclusions: The [AT]-increase is more pronounced in genomic regions that are non-genic, pericentromeric, transposable elements; methylated; and with low recombination. Our findings establish important links among UV radiation, mutation, DNA repair, methylation, and genome evolution. Keywords: Evolution, Domestication, Base composition, Genome divergence, Solar UV, Mutation, Methylation, UV damage repair * Correspondence: lixr@iastate.edu; jmyu@iastate.edu 1 Department of Agronomy, Iowa State University, Ames, IA 50011, USA Full list of author information is available at the end of the article © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Wang et al. Genome Biology Page 2 of 16 (2019) 20:74 Background Domestication is a special mode of evolution. Extensive studies have been carried out to understand the domestication process and genes associated with morphological changes [1–4]. Meanwhile, genomes also went through profound changes during domestication. Recent studies documented the base-composition difference and mutation rate difference between populations separated by either domestication or demographic bottleneck event, which provide novel insights on genome evolution [5–7]. Further investigation in DNA base composition, mutation spectrum, and the potential relationship between them is necessary to advance our understanding of genome changes. DNA base composition is an essential genomic feature. Remarkable research progress has been made in several areas, including codon usage bias [8], isochore structure [9, 10], and GC-biased gene conversion [11]. Recently, a conserved base-composition pattern, modern accessions having significantly higher [A] and [T] values across genome-wide polymorphic sites than accessions sampled from their wild relatives, was discovered with natural populations across multiple species [5]. Different genomic regions exhibit different patterns of a number of genomic features such as DNA methylation, GC content, and recombination rate [12–15]. It would be interesting to study the regional variation of genome change pattern, captured by base composition summarized from polymorphic sites. Mutation is a fundamental factor that generates the genetic variation upon which selection, drift, and recombination act. Point mutations are the most common type of mutations with a universal bias toward high AT, primarily due to the high rate of transition mutations [16]. Recent studies indicated that mutation rate can be different across populations [6, 7]. Divergence in mutation rates or types between populations are one of several factors that affect genetic variation patterns [17]. Analysis of data from multiple mutation accumulation experiments, either accumulating spontaneous or induced mutations, demonstrated higher [AT] values across mutation sites in derived lines at the end of mutation experiments than in ancestral lines, which suggested that base-composition difference can emerge from mutation sites [5]. Characterization of mutation spectrum in natural populations may help unravel the mechanism of genome change [18]. Organisms have evolved a complex system to monitor and repair DNA damage caused by various exogenous mutagens, such as solar-ultraviolet (UV) radiation, reactive oxygen species, excess boron or aluminum, and pathogenic microorganisms [19]. For plants, solar-UV radiation is a major exogenous mutagen as they use sunlight for photosynthesis. The primary solar UV-induced DNA lesion, cyclobutane pyrimidine dimers (CPDs), induces C→T base transitions [20]. CPDs distort the DNA’s double-helix structure, which influences DNA unwinding and DNA replication, and ultimately affect cell cycle [21]. Using sets of SNPs private to different human populations, a recent study suggested that UV might have been involved in the mutation spectrum change [6]. DNA methylation is a major form of epigenetic modification in many eukaryotic genomes. It not only regulates gene expression and silences transposons and repeat sequences, but also affects mutation rates [22–25]. DNA methylation occurs in CG, CHG (where H = A, C, or T), and CHH sequence contexts in plants [26, 27]. The relative frequency of DNA methylation varies substantially along chromosome. DNA methylation is primarily distributed in the heterochromatin regions that are mostly composed of tandem repeats and transposons [12, 13, 28]. It has been shown that methylation of cytosine residue at CpG sites can enhance the solar UV-promoted CPD formation [25]. We can ask whether the rate of solar UV-induced mutations varies along the chromosome and whether base composition can summarize such variation. In this study, we report findings from the analysis of millions of SNPs segregating among 100 accessions from a teosinte-maize comparison set and among 302 accessions from a wild-domesticated soybean comparison set. First, we show that higher [AT] values in domesticated accessions relative to wild accessions, or [AT]-increase, are consistently observed for SNPs found in either genic or non-genic portions of the genome, with non-genic SNPs having a greater contribution to the [AT]-increase. Interestingly, we also find that the divergence in [AT] is much higher in pericentromeric regions than in other regions. All 4 sequence motifs related to solar-UV signature consistently have higher frequencies in methylated regions than unmethylated regions. With a different set of population-private SNPs, we also discover the enrichment of mutations related to the solar-UV signature in domesticated accessions. Using base-composition across polymorphic sites as the phenotype, genome-wide scans identify a set of putative candidate genes involved in UV damage repair pathways. Together, these findings seem to suggest that solar-UV radiation and differential mutation repair are critical components in the genome divergence process that resulted in domesticated accessions’ greater numbers of nucleotides A and T. Results Genome-wide [AT]-increase We obtained a set of 8,852,678 SNPs in 100 teosinte-maize accessions and a set of 4,870,265 SNPs in 302 wild-domesticated soybean accessions from the original studies [29, 30] (Additional file 1: Figure S1). These SNPs are designated as common SNP sets to compute Wang et al. Genome Biology (2019) 20:74 the genome-wide base-composition across polymorphic sites without concerning about sampling issues due to low minor allele frequency (MAF) or high missing rate [5]. For each accession, we obtained an [AT] value calculated as the fraction of SNP alleles that are either base A or T. The choice of [AT] was based on the finding that single-strand parity rule 2 (PR2) applies to base composition across SNPs [5], i.e., [A] ≈ [T] and [G] ≈ [C]. In both maize and soybean sets, wild and domesticated (including landraces and improved cultivars) accessions are clearly separated by [AT] (P value is 1.49e−14 for maize and 1.02e−44 for soybean). Domesticated accessions have more nucleotides A and T at the polymorphic sites (Fig. 1), termed as [AT]-increase (domesticated > wild accessions). In maize, the average value of [AT] in wild accessions is 0.380 (SD = 0.006), while the average values of [AT] in landraces and improved cultivars are 0.414 (SD = 0.003) and 0.417 (SD = 0.003), respectively. In soybean, the average value of [AT] in wild accessions is 0.449 (SD = 0.010), while the average values of [AT] in landraces and improved cultivars are 0.492 (SD = 0.006) and 0.494 (SD = 0.003), respectively. Base composition among DNA substitution types Bi-allelic SNPs can be grouped into six substitution types without defining the ancestral allele. To further understand the consistent [AT]-increase pattern, we examined the contribution to [AT]-increase from each substitution type (Additional file 1: Figure S2). Two transition types, A/G and C/T, are the major types detected in maize and soybean genomes, with each having a frequency of ~ 34%, much higher than the expected frequency by chance (i.e., ~ 17% or 1/6). The average frequency for each of four transversion types (A/C, A/T, C/G, and G/T) is less than 10%, with C/G type being the least frequent one. Page 3 of 16 We then calculated the base-composition value across polymorphic sites conditional on each substitution type. The contribution to the overall [AT]-increase varied among substitution types (Fig. 2). Two transition types (A/G and C/T) are the major contributors due to their high frequencies and that the majority of wild accessions possess G or C allele for these types, while the domesticated typically have A or T. For A/C and G/T types, significant base-composition differences between wild and domesticated groups are also evident, and the proportional increase in A or T is similar to that of A/G and C/T types. However, because of their relatively low frequencies (≤9%), these two types contribute less to the overall [AT]-increase. Neither A/T nor C/G type contributes to the overall [AT]-increase. Base-composition pattern at different genomic regions It is known that different genomic regions exhibit different patterns for a number of genomic features including DNA methylation, GC content, and recombination rate [12–15], which naturally led us to investigate the base-composition distribution at different parts of the genome. To facilitate this, we first classified the genome-wide SNPs to 7 genomic annotation sets: intergenic, gene-proximal, UTRs, synonymous, missense, intronic, and other genic [31, 32] (Fig. 3). Intergenic SNPs are the most common group (65.1% in maize and 57.4% in soybean), followed by gene-proximal (15.3% in maize and 26.6% in soybean) and intronic (10.9% in maize and 8.98% in soybean). Because the numbers of SNPs were relatively too small in several genomic annotation sets, we combined intergenic and gene-proximal sets to form the non-genic SNP set and combined the rest five original genomic annotation sets to form the genic SNP set. The non-genic set contains 7,120,981 SNPs in maize and 4,088,443 SNPs in soybean, Fig. 1 Genome-wide base-composition pattern in maize and soybean. a The distribution of [AT] among 8.9 million SNPs in 100 maize accessions. b The distribution of [AT] across 4.9 million SNPs in 302 soybean accessions Wang et al. Genome Biology (2019) 20:74 Page 4 of 16 Fig. 2 Base-composition distribution at each of the six substitution types in maize (a) and soybean (b). The genome-wide SNPs were classified into six substitution types. Base composition was calculated for each accession conditional on each substitution type. The red arrows show the [A] and [T] increase at A/G and C/T substitution types and the genic set contains 1,731,687 SNPs in maize and 781,822 SNPs in soybean. We calculated the [AT] value for each accession from genic and non-genic SNP sets separately. [AT] of domesticated accessions is consistently higher than that of wild accessions in both genic and non-genic SNPs (Fig. 3). However, non-genic SNPs have greater contributions to the [AT]-increase, and the [AT]-difference between wild and domesticated accessions is about twice that of genic SNPs. Since the total number of non-genic SNPs are 4 to 5.5 times larger than genic SNPs, we randomly sampled an equal number of SNPs from genic and non-genic SNP sets Fig. 3 The distribution of base composition calculated with genic and non-genic SNPs in maize (a) and soybean (b). The upper panel shows the distribution of SNPs across different genomic annotation sets. The middle panel shows the base-composition distribution with genic and nongenic SNPs. The lower panel illustrates the base-composition distribution across 5 Mb segments with genic and non-genic SNPs. To simplify the plot in the lower panel, landraces and improved cultivars are combined to be the domesticated group to compare with the wild group. For each accession, base composition was calculated using a moving average approach with a 5-Mb window size and a 4-Mb step size. Each point in the plot represents the mean [AT] of the specified group across a 5-Mb window. The gray bar in the bottom indicates the position of the pericentromeric region, and the red bar within the gray bar shows the position of the centromeric region Wang et al. Genome Biology (2019) 20:74 to obtain the [AT] value for comparison. We obtained a consistent trend from 100 subsets, demonstrating that the greater contribution to the overall [AT]-increase from non-genic SNPs is not only because of its larger SNP number but also due to its higher proportional increase in [AT] than genic SNPs (Additional file 1: Figure S3). As expected, further comparisons of [AT] distribution between missense, synonymous, and intergenic SNP sets (Additional file 1: Figure S4A-B) show that while [AT]-difference between wild and domesticated accessions from missense and synonymous SNP sets are similar to each other, both of them are smaller than intergenic SNP set. We also evaluated the impact of allele frequency on the different contributions from genic and non-genic SNPs. Compared with non-genic SNP set, genic SNP set generally has more SNPs with high MAF and fewer SNPs with low MAF (Additional file 1: Figure S5), which may suggest that the genic region is more conserved than the non-genic regions. Both species are known to have low gene density in pericentromeric regions [29, 33, 34], so we examined the [AT] distribution with genic and non-genic SNPs along chromosomes (Fig. 3, Additional file 1: Figures S6-S8). Along each chromosome, (a) higher [AT] in domesticated group than wild group is consistently observed for both genic and non-genic SNPs; (b) [AT]-difference between domesticated and wild group for non-genic SNPs is generally larger than that for genic SNPs; and (c) [AT] for each accession is higher for genic SNPs than for non-genic SNPs. More interestingly, the divergence in [AT] is significantly enlarged in the pericentromeric regions, especially for non-genic SNPs. Because of the dramatic difference of [AT] distributions between pericentromeric regions and chromosome arms, we further compared the [AT] distribution between genic and non-genic regions conditional on the pericentromeric regions and chromosome arms separately (Additional file 1: Figure S4C-F). The [AT]-difference between wild and domesticated accessions at the non-genic region is consistently about twice that of the genic regions for both pericentromeric regions and chromosomal arms. And the [AT]-difference between wild and domesticated accessions at the pericentromeric region is much larger than that of chromosome arms, which is true for both non-genic and genic SNPs. We speculate the enlarged [AT]-difference in the pericentromeric regions is associated with the fact that these regions mainly consist of repetitive sequences and transposable elements [33–36] that are mostly arranged in heterochromatin [37] and generally have low recombination rates [30, 33, 34, 38]. To verify the speculation, we first examined the distribution of base composition at transposable element (TE) and non-transposable element (non-TE) regions. The [AT]-differences at TE regions are much larger than non-TE regions (Additional file 1: Figure S9). We then plotted the Page 5 of 16 [AT]-difference and crossover rate for maize and recombination rate for soybean along each chromosome (Additional file 1: Figures S10-S12). Negative correlations between [AT]-difference and crossover/recombination rate are significant for all 10 maize chromosomes and 18 soybean chromosomes. We observed relatively low and fluctuating MAF within the pericentromeric regions (Additional file 1: Figures S13-S15), which may be related to the low efficiencies in purging out deleterious alleles [39]. As the phenotypic differences between the wild and domesticated accessions mainly shaped by the artificial selection, we then compared the base-composition distribution at domestication selective sweep and non-selective sweep regions to test if the domestication process was partially responsible for the detected base-composition difference. The [AT]-difference between wild and domesticated accessions at selective sweep regions is much larger than that at non-selective sweep regions (Additional file 1: Figure S16). This suggests that the domestication process indeed have an effect on the detected base-composition difference at the polymorphic sites. Enrichment of motifs related to solar-UV signature surrounding SNP sites To test whether SNPs occurred more frequently in certain sequence contexts, we first classified SNPs into 96 tri-nucleotide motifs by considering 1 base directly adjacent upstream and downstream of the SNP site. Then, we examined the frequency and the enrichment of tri-nucleotide motifs. With 96 possible motifs, the expected frequency is 0.010 (≈ 1/96) and a ratio of 1.000 between the frequency of motif at SNP sites and that at random sites if SNPs occurred randomly in the genome. We detected 14 common motifs between maize and soybean with both frequencies and ratios greater than the expected, and 11 out of 14 were from A/G and C/T transition types (Fig. 4). In both species, 5′-CNG-3′ (N is the polymorphic site) around C/T type has the highest ratio with 2.007 in maize and 2.228 in soybean. In addition, 5′-TNG-3′ is enriched around C/T type, with a ratio of 1.477 in maize and 1.311 in soybean. Because most wild accessions have C allele at C/T type (Fig. 2), these SNPs were more likely changed from 5′-PyCG-3′ to 5′-PyTG-3′, where Py is either pyrimidine C or T. Correspondingly, the reverse and complementary motifs 5′-CNG-3′ and 5′-CNA-3′ around A/G type are also overrepresented, which suggests the high chance of 5′-CGPu-3′ to 5′-CAPu-3′ mutations, where Pu is purine G or A. Solar UV induces CPDs preferentially at 5-methylcytosine-containing dipyrimidine sites (5′-Py-mCG-3′), termed as solar-UV signature [20, 40]. Thus, the overrepresented motif 5′-PyCG-3′ around C/T (the reverse and complementary motif 5′-CGPu-3′ around A/G) is the same as the solar-UV signature if C is methylated. Hereafter, we refer to the four aforementioned sequence motifs as Wang et al. Genome Biology (2019) 20:74 Page 6 of 16 Fig. 4 Motif enrichment analysis in maize and soybean. The upper panel illustrates the composition of tri-nucleotide motifs and the induction of motifs related to solar-UV signature on double strand DNA. Each tri-nucleotide motif is formed by incorporating reference base pairs immediately upstream and downstream to the middle SNP site. Ninety-six motifs are divided into 6 classes based on the substitution types of the SNP. The lightning sign shows the mutation site, and the purple rectangle highlights the motifs related to the solar-UV signature. The middle and lower panels show the frequency of motif in maize and soybean, respectively. For each motif, the left bar is the overall frequency around SNP sites, while the right bar is the overall frequency of the same motif around random sites (an empirical 95th percentile drawn from 100 random sample scenarios). The colored bar indicates the common motif between maize and soybean with a frequency greater than 1/96, and the frequency of motif at SNP sites is higher than that at random sites. The bar with a star on top highlights the motif related to solar-UV signature motifs related to solar-UV signature. In both species, mCG level is negatively correlated with gene density and enriched in the pericentromeric regions [12, 13, 28, 41], which suggests that motifs related to solar-UV signature might occur more frequently outside of the genic regions and be overrepresented in the pericentromeric regions. To test this, we performed two sets of comparisons: frequencies of motifs related to solar-UV signature between genic and non-genic SNPs, and between SNPs from pericentromeric and non-pericentromeric regions. As expected, all four motifs related to solar-UV signature have higher frequencies within non-genic SNP sets than genic SNP sets, and they have higher frequencies among SNPs from pericentromeric regions than among SNPs from non-pericentromeric regions (Additional file 1: Figures S17-S18). We then examined the role of DNA methylation by calculating the frequencies of motifs related to solar-UV signature conditional on methylated and unmethylated regions [42–44]. We found that all four motifs related to solar-UV signature consistently have higher frequencies in methylated regions than in unmethylated regions with genic SNPs, non-genic SNPs, SNPs from pericentromeric regions, and SNPs from non-pericentromeric regions (Additional file 1: Figures S17-S18). This suggests the higher probability of C→T and G→A transitions, potentially stimulated by DNA methylation, in non-genic regions and pericentromeric regions, which agrees with our findings of non-genic SNPs’ larger contributions to [AT]-difference and the enlarged [AT]-difference in pericentromeric regions. Mutation spectra of population-private variation The findings of sequence motifs related to solar-UV signature enriched in common SNP sets encourage us to Wang et al. Genome Biology (2019) 20:74 verify the pattern with rare segregating SNPs that occurred as relatively recent mutations [45, 46]. Therefore, following the procedures laid out in a previous study [6], we compiled private SNP sets that contain 2,651,790 population-private SNPs in maize and 681,791 population-private SNPs in soybean from original studies [29, 30] (Additional file 1: Figure S1). These private SNP sets are different from the earlier common SNP sets with a small overlap. A SNP is considered as population private if it is segregating in 1 lineage but fixed ancestral allele in other lineages. For each crop, we obtained 4 population-private SNP sets: private wild SNPs (PW), private domesticated SNPs (PD), private landrace SNPs (PL), and private improved cultivar SNPs (PI). PW designates SNPs that are segregating in the wild group but are fixed ancestral alleles in the landrace and the improved cultivar groups; PL means those SNPs are segregating in the landrace group but are fixed ancestral alleles in the wild and the improved cultivar groups, and similarly for other private SNP sets. Analyzing such SNPs enables us to assess the mutation rate difference among different lineages after diverged from the most recent common ancestor. Next, we tested the differences in the spectrum of mutagenesis between populations with population-private variants as described in the previous study [6]. With ancestral allele information, population-private SNPs can be partitioned into 96 mutation types by considering the base immediately upstream and downstream of the variable site [47]. In both species, most C→T transitions have higher frequencies in PL and PI than in PW, which agrees with the previous finding in a human study [6] (Fig. 5). This observation suggests although genomes of domesticated and wild accessions were continuing to evolve after divergence, domesticated accessions might have higher C→T mutation rate. We observed a higher rate of mutations related to solar-UV signature 5′-TCG-3′→5′-TTG-3′ and 5′-CCG-3′→5′-CTG-3′ (hereafter abbreviated as TCG→T and CCG→T) in domesticated accessions than in wild accessions (Fig. 5, Additional file 1: Figure S19). For instance, in maize, TCG→T has a frequency of 3.45% in PL and 3.55% in PI compared with 2.99% in PW. The higher frequencies of TCG→T and CCG→T in domesticated than wild accessions are consistent for all chromosomes (Additional file 1: Figure S20). We further split each population-private SNP set to genic-private SNPs and non-genic-private SNPs, and pericentromeric-private SNPs and non-pericentromeric-private SNPs. As shown by Additional file 1: Figure S21, in both species, the TCG→T and CCG→T mutations generally have higher frequencies with non-genic-private SNPs and pericentromeric-private SNPs. This overrepresentation of mutations related to solar-UV signature found in the private SNP sets together with the enrichment of motifs related to solar-UV signature found in the common SNP sets suggests that solar Page 7 of 16 UV is potentially one of the major forces driving the [AT]-increase pattern during domestication. Overrepresentation of genes repairing UV-damaged DNA near loci associated with genome divergence With genome-wide association studies (GWAS), the previous study in human found the enrichment of DNA repair genes surrounding loci associated with genome divergence captured by base-composition across polymorphic sites [5]. The enrichment of solar-UV signature mutations in domesticated accessions suggests that solar-UV radiation plays an important role in driving the [AT]-increase pattern. Plant genomes encode a complex system to monitor and repair DNA damage. We assessed whether genes involved in UV damage repair pathways are enriched near loci associated with genome divergence for [AT]. Using the [AT] values obtained from the common SNP sets as a genome phenotype, GWAS identified a series of loci significantly associated with base-composition across polymorphic sites (Additional file 1: Figure S22). Based on either the sequence similarity of rice genes or Arabidopsis genes [48], 334 maize and 107 soybean genes were compiled as related to UV-damaged DNA repair (UV-related gene hereafter). Proportion tests indicate that the UV-related genes were more likely to reside nearby GWAS signals than by chance (Additional file 1: Tables S1-S4). In maize, for the 500-kb segments around significantly associated SNPs, we identified 4.2% of UV-related genes, but these regions only encode 1.8% of all annotated genes. In soybean, for the 500-kb segments around significantly associated SNPs, 20.6% of UV-related genes were identified, while only 13.8% of annotated genes were encoded in these regions. The tagged genes involved in all the steps for global genome nucleotide excision repair (NER) pathway to repair UV damage are shown in Additional file 1: Figure S23. We performed a detailed analysis of several UV-related genes located near significant GWAS SNPs (Fig. 6). A SNP located within maize ATR (Zm00001d014813) is significantly associated with base-composition across polymorphic sites. The ATR encodes a putative ATR protein which functions in a wide range of responses to DNA damage, including sensing and activating a cell cycle arrest in response to UV-B-caused DNA damage [19]. We found eight nonsynonymous variants located in ATR in this maize population. In soybean, a SNP located 11 kb downstream of Ligase1 (Glyma.11g193100, Lig1) on chromosome 11 is strongly associated with [AT] variation. Lig1 in soybean encodes a putative DNA ligase 1 protein which functions in sealing the nick of DNA at the last step of the repairing process. Besides one nonsense and two nonsynonymous SNPs, we also detected a 1.8-kb deletion at the fifth intron in wild soybean accessions (Additional file 1: Wang et al. Genome Biology (2019) 20:74 Page 8 of 16 Fig. 5 Enrichment test of mutations related to solar-UV signature with population-private SNPs. a, b Compare the mutation frequency between landraces and wild accessions in maize and soybean, respectively, and the x coordinate of each point indicates the fold frequency difference (fPL(m) − fPW(m))/fPW(m). c, d Compare the mutation frequency between improved cultivars and wild accessions in maize and soybean, respectively, and the x coordinate of each point indicates the fold frequency difference (fPI(m) − fPW(m))/fPW(m). The y coordinate indicates Pearson’s χ2 value that measures the significance of the difference between fm(P1) and fm(P2). Outlier points are labeled with the ancestral state of the mutant nucleotide flanked by two neighboring bases, and the color of the points indicate the ancestral and derived alleles of the mutant site. The purple rectangle highlights the mutations related to solar-UV signature. Here, TCG on the plot represents mutation 5′-TCG-3′→5′-TTG-3′ and its reverse complement 5′-CGA-3′→5′-CAA-3′, CCG represents mutation 5′-CCG-3′→5′-CTG-3′ and its reverse complement 5′-CGG-3′→5′-CAG3′, and similarly for all the other dots on the plot Figure S24). Soybean genome encodes two copies of Lig1, and we did not detect signals for Lig1 on chromosome 12. Both ATR and Lig1 are located within selective sweep regions identified in previous studies [30, 49], which suggest the possibility that polymorphisms within ATR and Lig1 went through domestication bottleneck. We then conducted haplotype network analysis of these two genes. There are two distinct clusters of haplotypes in both ATR and Lig1 (Fig. 6), one composed mostly of domesticated accession haplotypes and the other composed mostly of wild accession haplotypes. We refer to these clusters as domesticated cluster haplotype (DCH) and the wild cluster haplotype. In ATR, DCH is present in > 98% of maize but < 18% of teosinte; while in Lig1, DCH is present in > 97% of domesticated soybean but < 5% of wild soybean. Intriguingly, the major haplotype (haplotype2) in both genes are shared by most of the domesticated accessions and a small number of wild accessions. Haplotype2 in ATR is shared among 86.7% of maize and 17.6% of teosinte, and haplotype2 in Lig1 is shared by 86.7% of domesticated soybean and 2% of wild soybean. Considering that domestication largely involved selection of favorable alleles from standing allelic variation in wild ancestors [1], it is likely that the major haplotypes for both ATR and Lig1 were present in the ancestral populations with low frequency, and their frequencies increased rapidly during domestication. Discussion Our understanding of how plant genomes have changed following domestication bottlenecks remains limited. In this study, we aim to address the question from a novel angle by surveying the genome-wide base-composition Wang et al. Genome Biology (2019) 20:74 Page 9 of 16 Fig. 6 UV-related DNA repair genes implicated by trait-associated SNPs (TASs) and haplotype demographic distributions. a ATR in maize is tagged by a TAS (PZE0561610418) on chromosome 5. b DNA ligase1 (Lig1) in soybean is tagged by a TAS (rs1126618459) on chromosome 11. The upper panel shows the box plot of base composition between accessions carrying different alleles at the TASs. The middle panel shows the regional Manhattan plot around ATR and Lig1 locus (ATR and Lig1 are shown in red, others in blue). Dot size is proportional to the magnitude of significance for the SNP’s association with [AT] variation. Dot color indicates its LD with the TAS. The lower panel shows the haplotype networks inferred from 8 SNPs within ATR gene and 16 SNPs within Lig1 gene, respectively. Each circle represents one haplotype. Size of the circle is proportional to the number of accessions possessed the haplotype. Size of each colored slice within a circle is proportional to the number of accessions possessed the haplotype from the corresponding group pattern and its potential associated mechanisms. Focusing on a genome phenotype summarized from millions of polymorphic sites along the chromosome, we provide novel insights on genome evolution at different parts of the genome: genic versus non-genic, pericentromeric versus non-pericentromeric, and methylated versus unmethylated. This study also presents a first case where a few critical components in genome evolution are brought together: “base composition”, “mutation”, “UV radiation”, “DNA repair”, and “methylation”. The [AT]-increase in domesticated over wild accessions is consistently observed with the overall genome-wide SNPs, SNPs within major genomic annotation sets, and SNPs from different genomic regions. These findings indicate the presence of common underlying mechanisms that drive the domesticated accessions to build their genomes with more A and T nucleotides. Demographical analyses have shown that plant and animal species experienced population size changes associated with domestication and range expansion [50–54]. The effective population size of maize has decreased strikingly from the onset of domestication (≈ 10,000 years ago) to the recent past (≈ 1100–2400 years ago) and increased during post-domestication expansion [50]. In contrast to maize, the wild parviglumis experienced an increase in Wang et al. Genome Biology (2019) 20:74 effective population size which also lasts until the recent past (≈ 1100–1800 years ago) [50]. In plants, the increased mutational load has been observed in populations that undergo declines in effective population size [50, 55, 56]. Thus, one interpretation for our findings is that domesticated populations have historically lower effective population size, which results in a stronger genetic drift, and consequently lead to higher mutation numbers compared with their wild relatives. Meanwhile, our discovery of the overrepresentation of mutations related to a solar-UV signature in domesticated accessions indicated a varied mutation rate across populations. Therefore, an alternative interpretation is that alleles of UV damage repair genes have different repair efficiency (lower in domesticated accessions) and affect the number of de novo mutations in different lineages. Regarding the increased [AT] in domesticated accessions, one natural question to ask is: What is the consequence of building genomes with more A and T nucleotides? One possibility will be more efficient energy usage. Energy usage efficiency is a trait under universal selection that has shaped various genomic aspects. For example, highly expressed proteins use cheaper amino acids [57–60] and are generally shorter than lowly expressed ones [61, 62]. Synthesizing a G+C basepair requires a larger amount of energy and nitrogen than producing an A+T basepair [63]. Base stacking for G and C is more energetically expensive compared with that for A and T, as G binds to C with three hydrogen bonds while A binds to T with two hydrogen bonds [64]. Therefore, it may be interesting to ask whether domesticated accessions build their genomes with more A and T so that more energy is saved for other biological processes toward better yield potential. Recent studies have shown the high heterogeneity of mutation rate across genomic regions [65–67]. Our survey discovered the enrichment of motifs related to solar-UV signature surrounding SNPs, especially for SNPs located in non-genic and pericentromeric regions, which suggests solar-UV radiation is likely one of the major contributors for plant genome divergence. In general, DNA methylation level of non-genic regions is higher than that of genic regions, and pericentromeric regions higher than non-pericentromeric regions [13, 28]. Higher methylation levels in non-genic and pericentromeric regions potentially provide a greater amount of base materials for solar UV-induced C→T transition at the 5′-Py-mCG-3′ context, which is also supported by our findings of higher frequencies of motifs related to the solar-UV signature from methylated regions than unmethylated regions. DNA methylation is highly enriched within transposable elements and repetitive sequences [12, 13, 28]. Thus, this interesting connection between DNA methylation and solar UV-induced Page 10 of 16 mutation propels us to ask a critical question: Is the frequent transition of methylated C to T actually a cost that genomes have to pay for having transposons and repetitive sequences methylated? Compared with chromosome arms, pericentromeric regions are highly enriched with repetitive sequences and transposable elements and generally have higher methylation levels, lower gene density, and lower recombination rates [13, 28, 30, 33–36]. In this study, we observed associations between [AT]-difference and methylation level, transposable element, and recombination rate. A previous study illustrated that DNA transposon activity is associated with an increased number of mutations in the sequences close to the transposon [68]. This suggests that enriched transposable elements at pericentromeric regions may contribute to the increased accumulation of mutations within these regions. In sexual organisms, non-recombining regions of a genome were found to be subjected to Muller’s ratchet [69–72], and regions with active recombination are more efficient in the purging of the deleterious mutations [39]. This may also partially explain the findings of enriched mutations related to solar-UV signature and enlarged [AT]-difference in the pericentromeric regions. Solar UV primarily induces C→T base transition at 5′-PymCG-3′ sequence context [20, 40, 73], and CG methylation can enhance solar-UV-induced mutation at 5′-PymCG-3′ sites [25]. However, a few questions still need to be addressed to understand the increased rate of mutations related to a solar-UV signature in domesticated accessions. The first question is how DNA methylation varies across populations as variation in DNA methylation level may lead to the observed difference in the rate of mutations related to solar-UV signature between domesticated and wild groups. A recent study on 51 diverse maize inbred lines identified 172 maize-teosinte differentially methylated regions (DMRs), which are biased toward more examples of higher methylation levels in teosinte than maize [74]. Because those DMRs only represent a very small portion of the genome and the majority of the methylated regions are conserved within the maize, the identified DMRs should not be a major contributor to the observed difference in the rate of mutations related to solar-UV signature between the two groups. The other question is how UV could induce germline mutations as germline cells are generally shielded from direct solar radiation. The damaging effects of solar UV are often limited to the epidermis cells due to low UV-B penetration into plant tissues through flavonoid layer [75, 76]. However, some evidences suggest that UV-B may penetrate into meristematic tissues as increased genome instability in plant germline has been observed even with low UV-B radiation [77]. In addition, plant germline cells divide several times during the vegetative growth stage and separated Wang et al. Genome Biology Page 11 of 16 (2019) 20:74 into sex-specific lineages only during late flower development [78]. Thus, we suspect that mutations induced by solar UV during vegetative growth in cells of the apical meristem may be inherited into the progeny. Using a phenotype summarized from millions of SNPs, we identified a set of UV-related genes nearby signals associated with genome divergence. We speculate at some point before domestication, during gametogenesis, spontaneous mutations randomly took place within a UV-related gene. The gene with altered sequence may have a mild difference in terms of locating or repairing DNA errors [79]. Therefore, the lineages in which mutations in UV-related genes were segregating began to accumulate systematic difference in DNA repair, which contributed to the genome divergence patterns captured by base composition. In the mutation accumulation experiments, once an Escherichia coli lineage acquired 1 bp insertion in mutT gene at the 26,500th generation, the later generations from this lineage began to show greatly elevated mutation rates and bias toward substitution type from A to C than the progenies from other lineages [73]. The recent study that compared the accumulated mutations after 20 generations between wild-type and DNA repair-deficient mice suggested different patterns in rate and direction between 2 lineages [80]. A similar phenomenon has been observed for somatic mutations in cancer cell. The substitution type and rate vary for patients with different variations in DNA repair genes [81]. The varied mutation rate has been reported in natural populations at the genome level [82], the family level [83], and the subpopulation level [6]. These findings suggested the hypothesis that polymorphisms within UV-related genes played a role in different DNA repair efficiency, which in turn affected the mutation rate differently in different lineages. Initiation of domestication typically involved a set of key genes controlling for domestication syndrome, a set of traits differentiating wild and domesticated accessions. The causal polymorphisms underlying the domestication syndrome are sought to be the direct targets of artificial selection [1, 3, 4]. Although the UV-related genes were detected through a genome phenotype clearly separated between domesticated and wild accessions, we speculate that these genes were probably not the direct targets because these polymorphisms were less likely to lead to visible agronomic traits that human ancestors desired. The observation that wild and domesticated accessions share the same haplotype for ATR and Lig1 suggested that the polymorphisms in these two genes more likely emerge earlier than the onset of domestication. The consequence of changing these UV-related genes probably promoted the occurrence of desired traits, which was subject to the direct selection. The identified UV-related genes indicate almost every step in the NER pathway contributes to the overall [AT]-increase (Additional file 1: Figure S23), suggesting the complexity of molecular mechanisms. Molecular experiments need to be carried out to provide evidence supporting the function of these UV-related genes and their connection to the base-composition pattern. Although it is beyond the scope of this study to address the functional difference between wild and domesticated alleles and the molecular mechanisms affecting the repair efficiency, this study pointed to a new direction for addressing some fundamental questions about the genome itself. We think that mutation repair genes, like ATR and Lig1, harboring significant changes such as altered gene structure, should be the next priority to study and provide molecular evidences. Induced mutation accumulation experiments with UV as the mutagen and near-isogenic lines (NILs) segregating only at the regions surrounding mutation repair genes as starting materials will be preferable to demonstrate the connection between UV-induced mutation and base composition change. Sequencing lines that derived from starting materials carrying mutations at UV-damaged DNA repair gene regions may also provide additional support. Conclusions Base-composition difference between domesticated accessions and wild accessions at the dynamic part of the genome suggests the important role of AT-bias mutation in shaping the overall pattern of base-composition variation. Regional variations of base-composition pattern indicate that non-genic SNPs and pericencentromeric regions have greater contributions to the observed pattern. This finding together with the discovery of solar UV’s potential role in driving the genome divergence establishes the connection between DNA methylation and base-composition variation. By focusing on the evolutionary outcome, our genome scans in maize and soybean identified a set of UV damage repair genes. Rapidly improved genomics and epigenomics capacity would further facilitate our efforts to probe potential connections among base composition, mutation, methylation, DNA repair, and genome evolution. Methods Sequence information and SNP extraction In maize, the original SNP set with B73 genome (AGPv2) as references was obtained from 103 maize genomes of Maize Hapmap2 (19 wild accessions, 23 landraces, and 61 improved cultivars) [29]. Three lines, 2 wild accessions and 1 improved cultivar, were removed due to low sequence coverage and a small number of SNPs. In soybean, the original SNP set with Williams 82 genome (version 1.1) as references was obtained from 302 soybean genomes (62 wild accessions, 130 landraces, Wang et al. Genome Biology Page 12 of 16 (2019) 20:74 and 110 improved cultivars) [30]. Information for maize and soybean accessions are provided in Additional file 1: Table S5 and Table S6, respectively. With CrossMap v0.2.5 [84], genome coordinates of the original SNP sets in B73 AGPv2 and Williams 82 version 1.1 were converted to that in B73 AGPv4 and Williams 82 version 2.0, respectively. In maize, the assembly chain file for CrossMap is available at ftp://ftp.ensemblgenomes.org/ pub/plants/release-39/assembly_chain/zea_mays/ AGPv2_to_AGPv4.chain.gz. And in soybean, the assembly chain file is available at ftp://ftp.ensemblgenomes. org/pub/plants/release-39/assembly_chain/glycine_max/ V1.0_to_Glycine_max_v2.0.chain.gz. Then, for each species, we obtained 2 sets of SNPs (common SNP set and population-private SNP set) from the original SNP sets by applying different filtering criteria (Additional file 1: Figure S1). The common SNP sets containing 8,852,678 SNPs in maize and 4,870,265 in soybean are obtained by filtering with a MAF threshold of 5% and a missing rate threshold of 20%. These common SNP sets are used for all analyses except population-private SNP analysis. For population-private SNP sets, we followed the procedure laid out in a previous study [6] to obtain 2,651,790 population-private SNPs in maize and 681,791 population-private SNPs in soybean. The private SNP sets are different from the common SNP sets with a small overlap. Ancestral state of the maize allele was inferred based on the allele of Tripsacum [49]. To infer the ancestral state of the soybean allele, BLASTN [85] (version 2.2.28+) was used to identify the orthologous regions between soybean and Medicago truncatula. Each SNP and its 58 bases flanking sequences were extracted from the soybean genome then blasted to the Medicago truncatula genome sequence [86] with an e value <1e−1 and only the best hit was considered. A SNP is considered as population private if it is segregating in 1 group but fixed ancestral allele in other groups. Based on this definition, we obtained 1,137,732 private wild SNPs (PW) that are segregating in the wild group but fixed ancestral allele in the landrace and improved cultivar groups; 1,514,058 private domesticated SNPs (PD) that are segregating in either the landrace or improved group but fixed ancestral allele in the wild group; 270,390 private landrace SNPs (PL) that are segregating in the landrace group but fixed ancestral allele in the wild and improved cultivar groups; and 537,259 private improved cultivar SNPs (PI) that are segregating in the improved cultivar group but fixed ancestral allele in the wild and landrace groups. In soybean, we obtained 571,756 PW, 110,035 PD, 20,543 PL, and 1798 PI. The total numbers of SNPs (2,651,790 in maize and 681,791 in soybean) in private SNP sets are obtained by summing up PW and PD because there are no overlapping SNPs between the two population-private SNP sets by definition. For maize, all analyses were done using maize B73 genome (version AGPv4) as references. For soybean, all analyses were done using soybean Williams 82 genome (version 2.0) as references. Medicago truncatula genome sequence (version Mt4.0) was downloaded from Phytozome. Short reads from representative soybean accessions were downloaded from GenBank. Bioinformatics DNA reads were mapped to the soybean reference genome by BWA with the BWA-MEM algorithm [87]. R packages Rsamtools [88] and GenomeGraphs [89] were used to analyze and display the sequence coverage in candidate genes. The missing genotypes in candidate genes were imputed by fastPhase under the context including up- and downstream 20 kb regions [90]. R package pegas was used to reconstruct the haplotype networks with SNPs detected in the genes [91]. All the other analyses are done with in-house scripts written in Perl or R. Base-composition across genome-wide SNP sites was calculated as described in a previous study [5]. Because of PR2, i.e., nucleotide A content ([A]) from SNP sites is roughly equals to [T] ([A] ≈ [T]) and [C] ≈ [G] [5], the value of [AT] was used in this study. Base-composition distribution among substitution types Bi-allelic SNPs can be grouped into 6 substitution types (A/C, A/G, A/T, C/G, C/T, and G/T) without a defined ancestral allele. For example, if C and T alleles are detected in 1 SNP site, which might arise either from C to T change or from T to C change, it is a C/T substitution type. For each substitution type, the total number of each nucleotide type possessed by each accession was counted and divided by the total number of polymorphic sites (8.9 million in maize and 4.9 million in soybean for the accession without missing calls). Base-composition distribution at different genomic regions SNP effects were predicted with the SnpEff v4.3 [92]. In maize, we built the database with reference genome sequences available at ftp://ftp.ensemblgenomes.org/pub/ plants/release-39/fasta/zea_mays/dna/Zea_mays.AGPv4. dna.toplevel.fa.gz and gene annotation available at ftp:// ftp.ensemblgenomes.org/pub/plants/release-39/gff3/zea_ mays/Zea_mays.AGPv4.39.chr.gff3.gz. In soybean, we built the database with reference genome sequences available at ftp://ftp.ensemblgenomes.org/pub/plants/release-39/fasta/ glycine_max/dna/Glycine_max.Glycine_max_v2.0.dna. toplevel.fa.gz and gene annotation available at ftp://ftp. ensemblgenomes.org/pub/plants/release-39/gff3/glycine_ max/Glycine_max.Glycine_max_v2.0.39.chr.gff3.gz. Seven genomic annotation sets (intergenic, gene-proximal, UTRs, synonymous, missense, intronic, and other genic) Wang et al. Genome Biology Page 13 of 16 (2019) 20:74 were obtained by classifying SNPs based on the predicted SNP effect. SNPs were classified to be gene-proximal if they fell within 5 kb upstream of the transcription start site. Then, intergenic set together with the gene-proximal set is considered as non-genic SNP set, and the rest five SNP sets are considered to be genic SNP set. After that, base-composition across polymorphic sites was calculated for genic SNP set and non-genic SNP set separately. The physical positions for maize centromeric corresponding to the genome (version AGPv4) were referred from a previous study [93]. Then, a 40-Mb segment directly adjacent upstream and downstream of the centromeric region was considered as pericentromeric regions based on a previous study [33]. And the physical coordinates for soybean centromeric and pericentromeric regions were obtained from [34] and Soybean Genome Browser at SoyBase https://soybase.org/gb2/gbrowse/gmax2.0/. To analyze the base-composition distribution along chromosomes, we calculated the [AT] for each accession with a moving average approach of a 5-Mb window size and a 4-Mb step size on each of the maize and soybean chromosomes with both genic and non-genic SNPs. Indeed, we examined the [AT] distribution with a series of window size including 1 Mb, 2 Mb, 5 Mb, and 10 MB. The patterns for all of those window sizes are similar. We decided to go with the 5 Mb for the analyses because it contains a good amount of SNPs in each window and the line of [AT] distribution is smoother than the smaller window size. The position of crossovers (COs) in maize was referred from [39]. Then, [AT]-difference and crossover (CO) rate were calculated using a 5-Mb sliding window. Recombination rate data in soybean was referred from [30]. [AT]-difference and recombination rate were calculated using a 1-Mb window. The correlation was calculated between [AT]-difference and CO rate or recombination rate for each chromosome. Transposable element (TE) regions in maize and soybean are referred from [93, 94]. Then, base-composition across polymorphic sites was calculated for SNPs within TE regions and non-TE regions separately. Selective sweep regions in maize and soybean are referred from [29, 30]. Then, base-composition across polymorphic sites was calculated for SNPs within selective sweep and non-selective sweep regions separately. The maize methylation data was generated from whole-genome bisulfite sequencing (WGBS) of the leaf tissue of maize B73 seedling [42]. Genome coordinates of B73 methylation data in AGPv2 were converted to that in AGPv4 with the CrossMap v0.2.5 [84]. Then, the maize genome was separated into methylated and unmethylated regions based on whether the percentage of CG methylation within each 100 bp non-overlapping window is greater than 40% or not. The soybean methylation data was generated from WGBS of the leaf of soybean Williams 82 [43] and GsojaD [44]. MethylC-seq reads of GsojaD were first mapped to its own genome assembly to get methylation call. Then, the genome coordinates of GsojaD methylation were converted to the coordinates in Williams 82 genome version 2. Genome coordinates of Williams 82 methylation data in Williams 82 version 1.1 were converted to those in version 2.0 with the CrossMap v0.2.5 [84]. Then, the soybean genome was separated into the methylated and unmethylated regions based on CG methylation sites that are common to both Williams 82 and GsojaD. Motif enrichment analysis For each SNP site, the directly adjacent upstream and downstream bases were extracted from reference genomes; meanwhile, the adjacent sequences of 1 randomly selected site from 1 kb flanking region were also extracted. For each of the 96 possible tri-nucleotide motifs (5′-NXN-3′, X is the polymorphic site or randomly selected site), an empirical threshold at the 95th percentile was drawn from 100 random sample scenarios. A motif is considered as enriched if the ratio of its frequency at SNP site over the 95th percentile at random site is greater than 1. Population-private SNP analysis We used the procedure laid out in a previous study [6] to test the mutation spectrum differences between populations with population-private SNPs. SNPs within each private SNP set were partitioned into 96 mutation types through considering the base immediately upstream and downstream of the variable site [47]. Count data Cp(m) of type m mutations in set P for each mutation type m = B50 BA B30 → B50 BD B30 of each private SNP set P were obtained. Then, with a χ2 test, fPI(m) and fPL(m) were compared with fPW(m). For the χ2 test, we used χ2 value instead of P value to indicate the significance of difference because P value cannot be obtained for very large χ2 value in our data. To assess the variance of f(TCG → T) and f(CCG → T), private SNP sets PL, PI, and PW in maize and PD and PW in soybean was partitioned into non-overlapping bins of 1000 consecutive SNPs. Then, f(TCG → T) and f(CCG → T) for each bin were calculated. GWAS for base composition in maize and soybean Following our earlier study in human [5], [AT] values across 8,852,678 maize SNPs and 4,870,265 soybean SNPs were used as the genome phenotype for GWAS. In the genome scan for both maize and soybean, a mixed linear model (MLM) with both fixed covariates and a random kinship matrix was used to detect SNPs associated with the base-composition variation [95, 96] in Wang et al. Genome Biology Page 14 of 16 (2019) 20:74 GAPIT version 3.35 [97]. Parameters in MLM were determined by model selection process [95, 96]. Five principle components (PC2-PC6) were selected in maize, and 0 PC was selected in soybean. PC1 was not under the model selection process because of its near-perfect correlation with [AT] [5]. The significance threshold P value was determined by Bonferroni correction. The 334 maize genes and 107 soybean genes associated with repairing UV-damaged DNA were compiled based on either the sequence similarity of rice genes or Arabidopsis genes [48]. We conducted enrichment test of UV-related genes with a series of window sizes centered by significantly associated SNPs as described in a previous study [31]. The proportion of UV-related genes within each window was compared with its genome-wide proportion. The gene was counted when it was tagged by at least 2 significantly associated SNPs. Then, we tested whether the proportion of UV-related genes within the window is significantly higher than that across the whole genome using a proportion test. The window size smaller than 500 kb in maize and 200 kb in soybean was not tested because their numbers of tagged UV-related genes were less than 10, which violated the condition of the proportion test. Abbreviations CO: Crossover; CPDs: Cyclobutane pyrimidine dimers; DCH: Domesticated cluster haplotype; DMRs: Differentially methylated regions; GWAS: Genomewide association studies; MAF: Minor allele frequency; NER: Nucleotide excision repair; PD: Private domesticated SNPs; PI: Private improved cultivar SNPs; PL: Private landrace SNPs; PW: Private wild SNPs; TE: Transposable element; UV: Ultraviolet; WGBS: Whole-genome bisulfite sequencing Acknowledgements We thank our colleagues Dr. Tingting Guo, Dr. Adam Vanous, Matthew Dzievit, James McNellie, Qi Mu, Jialu Wei, and Laura Tibbs for their useful suggestions about the research. We are also grateful to the critical comments from anonymous reviewers. Funding This work was supported by the National Science Foundation grant IOS1238142, by the Iowa State University Raymond F. Baker Center for Plant Breeding, and by the Iowa State University Plant Science Institute. Availability of data and materials The original datasets analyzed in the current study were reported previously [29, 30], and can be downloaded from http://www.panzea.org/genotypes and http://figshare.com/articles/Soybean_resequencing_project/1176133. The lists of accessions used in the current study were provided in Additional file 1: Table S5 and Table S6. The pipeline and custom scripts utilized in this paper are documented in the following Zenodo repository (DOI: https://doi.org/10.5281/zenodo.2566552 [98]). Authors’ contributions JY, XL, and JW designed the study. JW, XL, KKD, MJS, SAJ, and NMS conducted the analyses. JW, XL, and JY wrote the manuscript with inputs from all authors. All authors read and approved the final manuscript. Ethics approval and consent to participate Not applicable Additional file Additional file 1: Figure S1. Diagram of SNP filtering process. Figure S2. Frequency of SNP substitution types. Figure S3. Base-composition distribution for randomly sampled genic and non-genic SNPs. Figure S4. Comparison of base-composition distribution between different regions of the genome. Figure S5. Distribution of MAF for genome-wide genic and non-genic SNPs. Figure S6-S8. Base-composition distribution for genic and non-genic SNPs across chromosomes. Figure S9. Base-composition distribution at TE and non-TE regions. Figure S10. Base-composition distribution between domesticated and wild accessions and crossover rate for maize chromosomes. Figure S11-S12. Base-composition distribution between domesticated and wild accessions and recombination rate for soybean chromosomes. Figure S13-S15. Distribution of MAF calculated with genic and non-genic SNPs across chromosomes. Figure S16. Basecomposition distribution at selective sweep and non-selective-sweep regions. Figure S17. Frequencies of motifs related to solar-UV signature among genic and non-genic SNPs conditional on methylated and unmethylated regions. Figure S18. Frequencies of motifs related to solarUV signature among SNPs from pericentromeric and non-pericentromeric regions under methylated and unmethylated conditions. Figure S19. Enrichment test of mutations related to solar-UV signature with populationprivate SNPs. Figure S20. Distribution of f(TCG) and f(CCG) across population-private SNPs. Figure S21. Distribution of f(TCG) and f(CCG) at different genomic regions. Figure S22. GWAS-identified genomic regions underlying base-composition variation. Figure S23. GWAS tagged genes in NER pathway. Figure S24. Polymorphisms in soybean DNA ligase1. Table S1. UV-related genes are enriched near the associated loci in maize. Table S2. UV-related genes tagged by the associated SNPs in maize. Table S3. UV-related genes are enriched near the associated loci in soybean. Table S4. UV-related genes tagged by the associated SNPs in soybean. Table S5. Summary of 100 maize accessions. Table S6. Summary of 302 soybean accessions. (PDF 3092 kb) Consent for publication Not applicable Competing interests The authors declare that they have no competing interests Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Author details 1 Department of Agronomy, Iowa State University, Ames, IA 50011, USA. 2 Center for Applied Genetic Technologies, University of Georgia, Athens, GA 30602, USA. 3Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA. 4Department of Plant and Microbial Biology, University of Minnesota, St. Paul, MN 55108, USA. Received: 2 October 2018 Accepted: 28 March 2019 References 1. Doebley JF, Gaut BS, Smith BD. The molecular genetics of crop domestication. Cell. 2006;127:1309–21. 2. Purugganan MD, Fuller DQ. The nature of selection during plant domestication. Nature. 2009;457:843–8. 3. Meyer RS, Purugganan MD. Evolution of crop species: genetics of domestication and diversification. Nat Rev Genet. 2013;14:840–52. 4. Olsen KM, Wendel JF. A bountiful harvest: genomic insights into crop domestication phenotypes. Annu Rev Plant Biol. 2013;64:47–70. 5. Li X, Scanlon MJ, Yu J. Evolutionary patterns of DNA base composition and correlation to polymorphisms in DNA repair systems. Nucleic Acids Res. 2015;43:3614–25. 6. Harris K. Evidence for recent, population-specific evolution of the human mutation rate. Proc Natl Acad Sci U S A. 2015;112:3439–44. Wang et al. Genome Biology 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. Page 15 of 16 (2019) 20:74 Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature. 2016;538:201–6. Sharp PM, Matassi G. Codon usage and genome evolution. Curr Opin Genet Dev. 1994;4:851–60. Bernardi G. The isochore organization of the human genome. Annu Rev Genet. 1989;23:637–61. Bernardi G. Isochores and the evolutionary genomics of vertebrates. Gene. 2000;241:3–17. Duret L, Galtier N. Biased gene conversion and the evolution of mammalian genomic landscapes. Annu Rev Genomics Hum Genet. 2009;10:285–311. Springer NM, Schmitz RJ. Exploiting induced and natural epigenetic variation for crop improvement. Nat Rev Genet. 2017;18:563–75. Song QX, Lu X, Li QT, Chen H, Hu XY, Ma B, Zhang WK, Chen SY, Zhang JS. Genome-wide analysis of DNA methylation in soybean. Mol Plant. 2013;6: 1961–74. Glemin S, Clement Y, David J, Ressayre A. GC content evolution in coding regions of angiosperm genomes: a unifying hypothesis. Trends Genet. 2014; 30:263–70. Nachman MW. Variation in recombination rate across the genome: evidence and implications. Curr Opin Genet Dev. 2002;12:657–63. Hershberg R, Petrov DA. Evidence that mutation is universally biased towards AT in bacteria. PLoS Genet. 2010;6:e1001115. Mathieson I, Reich D. Differences in the rare variant spectrum among human populations. PLoS Genet. 2017;13:e1006581. Massey DJ, Koren A. Mismatch repair prefers exons. Nature Genet. 2017;49: 1673–4. Hu Z, Cools T, De Veylder L. Mechanisms used by plants to cope with DNA damage. Annu Rev Plant Biol. 2016;67:439–62. Ikehata H, Ono T. The mechanisms of UV mutagenesis. J Radiat Res. 2011;52: 115–25. Nawkar GM, Maibam P, Park JH, Sahi VP, Lee SY, Kang CH. UV-induced cell death in plants. Int J Mol Sci. 2013;14:1608–28. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012;13:484–92. Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat Rev Genet. 2010;11:204–20. Walser JC, Ponger L, Furano AV. CpG dinucleotides and the mutation rate of non-CpG DNA. Genome Res. 2008;18:1403–14. Tommasi S, Denissenko MF, Pfeifer GP. Sunlight induces pyrimidine dimers preferentially at 5-methylcytosine bases. Cancer Res. 1997;57:4727–30. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008; 452:215–9. Feng S, Jacobsen SE, Reik W. Epigenetic reprogramming in plant and animal development. Science. 2010;330:622–7. West PT, Li Q, Ji L, Eichten SR, Song J, Vaughn MW, Schmitz RJ, Springer NM. Genomic distribution of H3K9me2 and DNA methylation in a maize genome. PLoS One. 2014;9:e105267. Chia JM, Song C, Bradbury PJ, Costich D, de Leon N, Doebley J, Elshire RJ, Gaut B, Geller L, Glaubitz JC, et al. Maize HapMap2 identifies extant variation from a genome in flux. Nat Genet. 2012;44:803–7. Zhou Z, Jiang Y, Wang Z, Gou Z, Lyu J, Li W, Yu Y, Shu L, Zhao Y, Ma Y, et al. Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat Biotechnol. 2015;33: 408–14. Li X, Zhu C, Yeh CT, Wu W, Takacs EM, Petsch KA, Tian F, Bai G, Buckler ES, Muehlbauer GJ, et al. Genic and nongenic contributions to natural variation of quantitative traits in maize. Genome Res. 2012;22:2436–44. Wallace JG, Bradbury PJ, Zhang N, Gibon Y, Stitt M, Buckler ES. Association mapping across numerous traits reveals patterns of functional variation in maize. PLoS Genet. 2014;10:e1004845. Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, Ross-Ibarra J, et al. A first-generation haplotype map of maize. Science. 2009;326:1115–7. Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, et al. Genome sequence of the palaeopolyploid soybean. Nature. 2010;463:178–83. Wolfgruber TK, Sharma A, Schneider KL, Albert PS, Koo DH, Shi J, Gao Z, Han F, Lee H, Xu R, et al. Maize centromere structure and evolution: sequence 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. analysis of centromeres 2 and 5 reveals dynamic loci shaped primarily by retrotransposons. PLoS Genet. 2009;5:e1000743. Lin JY, Jacobus BH, SanMiguel P, Walling JG, Yuan Y, Shoemaker RC, Young ND, Jackson SA. Pericentromeric regions of soybean (Glycine max L. Merr.) chromosomes consist of retroelements and tandemly repeated DNA and are structurally and evolutionarily labile. Genetics. 2005;170:1221–30. Wang Y, Tang X, Cheng Z, Mueller L, Giovannoni J, Tanksley SD. Euchromatin and pericentromeric heterochromatin: comparative composition in the tomato genome. Genetics. 2006;172:2529–40. Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, et al. A high-resolution recombination map of the human genome. Nat Genet. 2002;31:241–7. Rodgers-Melnick E, Bradbury PJ, Elshire RJ, Glaubitz JC, Acharya CB, Mitchell SE, Li C, Li Y, Buckler ES. Recombination in diverse maize is stable, predictable, and associated with genetic load. Proc Natl Acad Sci. 2015;112: 3823–8. Ikehata H, Ono T. Significance of CpG methylation for solar UV-induced mutagenesis and carcinogenesis in skin. Photochem Photobiol. 2007;83: 196–204. Wang P, Xia H, Zhang Y, Zhao S, Zhao C, Hou L, Li C, Li A, Ma C, Wang X. Genome-wide high-resolution mapping of DNA methylation identifies epigenetic variation across embryo and endosperm in maize (Zea may). BMC Genomics. 2015;16:21. Li Q, Gent JI, Zynda G, Song JW, Makarevitch I, Hirsch CD, Hirsch CN, Dawe RK, Madzima TF, McGinnis KM, et al. RNA-directed DNA methylation enforces boundaries between heterochromatin and euchromatin in the maize genome. Proc Natl Acad Sci U S A. 2015;112:14728–33. Kim KD, El Baidouri M, Abernathy B, Iwata-Otsubo A, Chavarro C, Gonzales M, Libault M, Grimwood J, Jackson SA. A comparative epigenomic analysis of polyploidy-derived genes in soybean and common bean. Plant Physiol. 2015;168:1433–47. El Baidouri M, Do Kim K, Abernathy B, Li Y-H, Qiu L-J, Jackson SA. Genic Cmethylation in soybean is associated with gene paralogs relocated to transposable element-rich pericentromeres. Mol Plant. 2018;11:485–95. Genomes Project C, Abecasis GR, Auton A, Brooks LD, De Pristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. Genomes Project C, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–73. Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013;3:246–59. Ganpudi AL, Schroeder DF. UV damaged DNA repair & tolerance in plants. Croatia: Intech Open Access Publisher; 2011. Hufford MB, Xu X, van Heerwaarden J, Pyhajarvi T, Chia JM, Cartwright RA, Elshire RJ, Glaubitz JC, Guill KE, Kaeppler SM, et al. Comparative population genomics of maize domestication and improvement. Nat Genet. 2012;44: 808–11. Wang L, Beissinger TM, Lorant A, Ross-Ibarra C, Ross-Ibarra J, Hufford MB. The interplay of demography and selection during maize domestication and expansion. Genome Biol. 2017;18:215. Beissinger TM, Wang L, Crosby K, Durvasula A, Hufford MB, Ross-Ibarra J. Recent demography drives changes in linked selection across the maize genome. Nat Plants. 2016;2:16084. Zhou Y, Massonnet M, Sanjak JS, Cantu D, Gaut BS. Evolutionary genomics of grape (Vitis vinifera ssp. vinifera) domestication. Proc Natl Acad Sci U S A. 2017;114:11715–20. Marsden CD, Ortega-Del Vecchyo D, O’Brien DP, Taylor JF, Ramirez O, Vila C, Marques-Bonet T, Schnabel RD, Wayne RK, Lohmueller KE. Bottlenecks and selective sweeps during domestication have increased deleterious genetic variation in dogs. Proc Natl Acad Sci U S A. 2016;113:152–7. McCoy RC, Akey JM. Patterns of deleterious variation between human populations reveal an unbalanced load. Proc Natl Acad Sci U S A. 2016;113: 809–11. Liu Q, Zhou Y, Morrell PL, Gaut BS. Deleterious variants in Asian rice and the potential cost of domestication. Mol Bio Evol. 2017;34:908–24. Zhang M, Zhou L, Bawa R, Suren H, Holliday JA. Recombination rate variation, hitchhiking, and demographic history shape deleterious load in poplar. Mol Biol Evol. 2016;33:2899–910. Wang et al. Genome Biology (2019) 20:74 57. Akashi H, Gojobori T. Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc Natl Acad Sci U S A. 2002;99:3695–700. 58. Raiford DW, Heizer EM Jr, Miller RV, Doom TE, Raymer ML, Krane DE. Metabolic and translational efficiency in microbial organisms. J Mol Evol. 2012;74:206–16. 59. Swire J. Selection on synthesis cost affects interprotein amino acid usage in all three domains of life. J Mol Evol. 2007;64:558–71. 60. Heizer EM, Raiford DW, Raymer ML, Doom TE, Miller RV, Krane DE. Amino acid cost and codon-usage biases in 6 prokaryotic genomes: a wholegenome analysis. Mol Bio Evol. 2006;23:1670–80. 61. Castillo-Davis CI, Mekhedov SL, Hartl DL, Koonin EV, Kondrashov FA. Selection for short introns in highly expressed genes. Nat Genet. 2002;31:415–8. 62. Li SW, Feng L, Niu DK. Selection for the miniaturization of highly expressed genes. Biochem Biophys Res Commun. 2007;360:586–92. 63. Chen WH, Lu G, Bork P, Hu S, Lercher MJ. Energy efficiency trade-offs drive nucleotide usage in transcribed regions. Nat Commun. 2016;7:11334. 64. Ussery DW, Wassenaar TM, Borini S. Computing for comparative microbial genomics: bioinformatics for microbiologists. London: Springer Science & Business Media; 2009. 65. Schuster-Bockler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012;488:504–7. 66. Frigola J, Sabarinathan R, Mularoni L, Muinos F, Gonzalez-Perez A, LopezBigas N. Reduced mutation rate in exons due to differential mismatch repair. Nat Genet. 2017;49:1684–92. 67. Belfield EJ, Ding ZJ, Jamieson FJC, Visscher AM, Zheng SJ, Mithani A, Harberd NP. DNA mismatch repair preferentially protects genes from mutation. Genome Res. 2018;28:66–74. 68. Wicker T, Yu Y, Haberer G, Mayer KF, Marri PR, Rounsley S, Chen M, Zuccolo A, Panaud O, Wing RA. DNA transposon activity is associated with increased mutation rates in genes of rice and other grasses. Nat Commun. 2016;7:12790. 69. Muller HJ. Some genetic aspects of sex. Amer Nat. 1932;66:118–38. 70. Muller HJ. The relation of recombination to mutational advance. Mutat Res. 1964;106:2–9. 71. Felsenstein J. The evolutionary advantage of recombination. Genetics. 1974; 78:737–56. 72. Charlesworth B. The evolution of sex chromosomes. Science. 1991;251:1030–3. 73. Alexandrov LB, Stratton MR. Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr Opin Genet Dev. 2014;24:52–60. 74. Eichten SR, Briskine R, Song J, Li Q, Swanson-Wagner R, Hermanson PJ, Waters AJ, Starr E, West PT, Tiffin P, et al. Epigenetic and genetic influences on DNA methylation variation in maize populations. Plant Cell. 2013;25: 2783–97. 75. Turunen M, Vogelmann T, Smith W. UV screening in lodgepole pine (Pinus contorta ssp. latifolia) cotyledons and needles. Int J Plant Sci. 1999;160:315–20. 76. Mazza CA, Boccalandro HE, Giordano CV, Battista D, Scopel AL, Ballaré CL. Functional significance and induction by solar radiation of ultraviolet-absorbing sunscreens in field-grown soybean crops. Plant Physiol. 2000;122:117–26. 77. Ries G, Heller W, Puchta H, Sandermann H, Seidlitz HK, Hohn B. Elevated UVB radiation reduces genome stability in plants. Nature. 2000;406:98–101. 78. Meyerowitz EM. Plants compared to animals: the broadest comparative study of development. Science. 2002;295:1482–5. 79. Mohrenweiser HW, Wilson DM 3rd, Jones IM. Challenges and complexities in estimating both the functional impact and the disease risk associated with the extensive genetic variation in human DNA repair genes. Mutat Res. 2003;526:93–125. 80. Uchimura A, Higuchi M, Minakuchi Y, Ohno M, Toyoda A, Fujiyama A, Miura I, Wakana S, Nishino J, Yagi T. Germline mutation rates and the long-term phenotypic effects of mutation accumulation in wild-type laboratory mice and mutator mice. Genome Res. 2015;25:1125–34. 81. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q, McMichael JF, Wyczalkowski MA, et al. Mutational landscape and significance across 12 major cancer types. Nature. 2013;502:333–9. 82. Hodgkinson A, Eyre-Walker A. Variation in the mutation rate across mammalian genomes. Nat Rev Genet. 2011;12:756–66. 83. Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, et al. Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011;43:712–4. 84. Zhao H, Sun Z, Wang J, Huang H, Kocher J-P, Wang L. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics. 2013;30:1006–7. Page 16 of 16 85. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. 86. Young ND, Debelle F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, et al. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature. 2011;480:520–4. 87. Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997 2013. 88. Martin Morgan H, Maintainer MBP, ShortRead S, GenomicFeatures T, Biostrings L, biocViews DataImport I: Package ‘Rsamtools’. 2013. 89. Durinck S, Bullard J, Spellman PT, Dudoit S. GenomeGraphs: integrated genomic data visualization with R. BMC Bioinformatics. 2009;10:2. 90. Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–44. 91. Paradis E. pegas: an R package for population genetics with an integratedmodular approach. Bioinformatics. 2010;26:419–20. 92. Cingolani P, Platts A, le Wang L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6:80–92. 93. Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, Campbell MS, Stein JC, Wei X, Chin C-S. Improved maize reference genome with single-molecule technologies. Nature. 2017;546:524–7. 94. Du J, Grant D, Tian Z, Nelson RT, Zhu L, Shoemaker RC, Ma J. SoyTEdb: a comprehensive database of transposable elements in the soybean genome. BMC Genomics. 2010;11:113. 95. Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38:203–8. 96. Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM, Buckler ES. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42:355–60. 97. Lipka AE, Tian F, Wang Q, Peiffer J, Li M, Bradbury PJ, Gore MA, Buckler ES, Zhang Z. GAPIT: genome association and prediction integrated tool. Bioinformatics. 2012;28:2397–9. 98. Wang J, Li X, Kim KD, Scanlon MJ, Jackson SA, Springer NM, Yu J. Genomewide nucleotide patterns and potential mechanisms of genome divergence following domestication in maize and soybean source code. GitHub. 2019. https://doi.org/10.5281/zenodo.2566552.

RELATED PAPERS

RELATED TOPICS

Log In

Genome-wide nucleotide patterns and potential mechanisms of genome divergence following domestication in maize and soybean

Genome-wide nucleotide patterns and potential mechanisms of genome divergence following domestication in maize and soybean

Related Papers

RELATED PAPERS

RELATED TOPICS