Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Clock-like mutational processes in human somatic cells

Nature Genetics 47 (12), 1402–1407, 2015
During the course of a lifetime, somatic cells acquire mutations. Different mutational processes may contribute to the mutations accumulated in a cell, with each imprinting a mutational signature on the cell's genome. Some processes generate mutations throughout life at a constant rate in all individuals, and the number of mutations in a cell attributable to these processes will be proportional to the chronological age of the person. Using mutations from 10,250 cancer genomes across 36 cancer types, we investigated clock-like mutational processes that have been operating in normal human cells. Two mutational signatures show clock-like properties. Both exhibit different mutation rates in different tissues. However, their mutation rates are not correlated, indicating that the underlying processes are subject to different biological influences. For one signature, the rate of cell division may influence its mutation rate. This study provides the first survey of clock-like mutational processes operating in human somatic cells....Read more
1402 VOLUME 47 | NUMBER 12 | DECEMBER 2015 NATURE GENETICS During the course of a lifetime, somatic cells acquire mutations. Different mutational processes may contribute to the mutations accumulated in a cell, with each imprinting a mutational signature on the cell’s genome. Some processes generate mutations throughout life at a constant rate in all individuals, and the number of mutations in a cell attributable to these processes will be proportional to the chronological age of the person. Using mutations from 10,250 cancer genomes across 36 cancer types, we investigated clock-like mutational processes that have been operating in normal human cells. Two mutational signatures show clock-like properties. Both exhibit different mutation rates in different tissues. However, their mutation rates are not correlated, indicating that the underlying processes are subject to different biological influences. For one signature, the rate of cell division may influence its mutation rate. This study provides the first survey of clock-like mutational processes operating in human somatic cells. The mutational processes that generate somatic mutations in normal cells are not well understood, and quantification of their in vivo muta- tion rates is lacking for almost all human cell types. These metrics are likely to be fundamental to an understanding of cancer devel- opment and aging. Comprehensive investigation of in vivo somatic mutation rates will ultimately depend on accurate, single-cell whole- genome sequencing of normal somatic cells. However, all cancers are clonal cell populations expanded from single normal cells. To a first approximation, the catalog of somatic mutations shared by most members of a cancer cell population is the set that was present in the progenitor cell of the final dominant clonal expansion of the cancer. This catalog provides information on the mutational processes to which the lineage of cells from the fertilized egg to that progenitor cell has been exposed 1 . Under a simple model, this line- age has three phases: embryonic and fetal development; postnatal life in normally functioning differentiated cells; and post-neoplastic transformation in cancer cells (Fig. 1). In the time taken to establish this lineage, some mutational proc- esses may have acted in an episodic manner, generating mutations in bursts over short time periods. Others may have operated continu- ously, in a clock-like manner, generating mutations at a steady rate. For such clock-like mutational processes, the number of mutations acquired during embryonic and fetal development will be similar in cancers of the same type from different individuals, as this phase is of a fixed duration. Conversely, the same process operating in normally functioning cells during postnatal life will result in a mutation load that is proportional to the age of the person at the time the cancer is sampled, with more mutations present in older individuals (Fig. 1). The number of mutations acquired after initiation of neoplastic change will be unrelated to age of diagnosis but will depend upon the duration of the period between the first cancer driver mutation and initiation of the final dominant clonal expansion and, potentially, also upon changes to the mutation rate contingent upon acquiring the neoplastic phenotype. The latter features may be highly variable within and between cancer types. Under this simple model, mutations with clock-like features in can- cer genomes predominantly derive from the normal postnatal part of the lineage. However, mutations from the developmental and/or neoplastic phases could obscure the clock-like features of these muta- tions and affect estimation of the mutation rate during the normal postnatal phase. To evaluate this possibility, we performed simulations that showed that clock-like mutational processes can be detected and that the mutation rates estimated are relatively unaffected by muta- tions from other phases, unless the mutations generated during the developmental and/or neoplastic phases constitute the large majority of the total number of mutations in the cancer. Therefore, analysis of the several thousand cancer genomes thus far sequenced can provide a first survey of the clock-like mutational processes operating in a wide range of normal human cell types. Different mutational processes generate distinct combinations of mutation types in cancer genomes 2,3 . These characteristic imprints of mutational processes have been termed ‘mutational signatures’. We previously reported a mathematical approach and computational framework to extract mutational signatures from catalogs of somatic mutations in human cancers 4–6 . Using a 96-category classification of base substitutions based on the type of substitution and the bases immediately 5and 3to the mutated base, we identified 21 mutational signatures operating over 30 cancer types 4 . Among these signatures, the numbers of mutations associated with signature 1 correlated with age of cancer diagnosis for some cancer types 4 . Clock-like mutational processes in human somatic cells Ludmil B Alexandrov 1–3 , Philip H Jones 1,4 , David C Wedge 1 , Julian E Sale 5 , Peter J Campbell 1,6 , Serena Nik-Zainal 1,7 & Michael R Stratton 1 1 Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton, UK. 2 Theoretical Biology and Biophysics (T-6), Los Alamos National Laboratory, Los Alamos, New Mexico, USA. 3 Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico, USA. 4 Medical Research Council (MRC) Cancer Unit, Hutchison/MRC Research Centre, University of Cambridge, Cambridge, UK. 5 MRC Laboratory of Molecular Biology, Cambridge, UK. 6 Department of Haematology, University of Cambridge, Cambridge, UK. 7 Department of Medical Genetics, Addenbrooke’s Hospital National Health Service (NHS) Trust, Cambridge, UK. Correspondence should be addressed to L.B.A. (lba@lanl.gov) or M.R.S. (mrs@sanger.ac.uk). Received 19 April; accepted 14 October; published online 9 November 2015; doi:10.1038/ng.3441 ANALYSIS npg © 2015 Nature America, Inc. All rights reserved.
NATURE GENETICS VOLUME 47 | NUMBER 12 | DECEMBER 2015 1403 Our previous analysis extracted mutational signatures separately from each cancer type and then quantified the mutations contributed by these signatures to each case of that cancer type. Many mutational signatures are found in multiple different cancer types, and a central scientific question to address is how the contributions of such signa- tures compare across cancer types. However, a particular mutational signature found in multiple cancer types will be contaminated to dif- fering extents by other signatures and by noise in each of the different cancer types. Hence, our previous approach did not allow accurate quantification of mutation rates for direct comparisons between can- cer types. We have, therefore, reformulated the approach to derive a single consensus version of each signature, and we used these con- sensus signatures to estimate the number of mutations contributed to each cancer sample across all cancer types (Online Methods). Our refined approach was applied to a larger data set of 7,329,860 somatic mutations from 10,250 cancer genomes (Supplementary Data Sets 1 and 2) derived from diverse epithelial, mesenchymal, glial, hemat- opoietic and lymphoid cells that collectively constitute an extensive, albeit incomplete sampling of normal cell types in the human body. This analysis has then allowed us to estimate the contributions of mutations to individual cancer cases across cancer types and hence enabled comparison of the clock-like mutation rates that reflect muta- tions in normal tissues. RESULTS Applying our refined approach to 10,250 cancer samples resulted in delineation of the patterns of 33 distinct mutational signatures. We were able to perform validation for 29 of these 33 mutational signatures using our established methodology for validating muta- tional signatures 4 . This new analysis confirmed the patterns of the 21 previously identified mutational signatures 4 , demonstrating the robustness of the computational approach. Additionally, examining this substantially larger data set allowed us to disentangle the patterns of another eight distinct mutational signatures. A curated list of the validated mutational signatures and the cancer types in which they are present can be found at our Catalogue of Somatic Mutations in Cancer (COSMIC) signatures website (see URLs). Note that signatures 25, 29 and 30 are not part of the analysis presented here because the relevant samples were either cancer cell lines or lacked information about age of diagnosis. Further, the list of mutational signatures on our website does not include signatures corresponding to sequencing artifacts and signatures for which validation has not been performed. We have, however, included these mutational signatures in the current analysis, and their patterns are shown in Supplementary Figure 1. To identify mutational signatures showing clock-like behavior, we first combined the mutations and samples from all cancer types. Of the 33 signatures examined, signatures 1 and 5 showed a cor- relation between numbers of mutations and age of diagnosis, and, for both signatures, the numbers of mutations increased with age (signature 1, Spearman rank correlation = 0.34, false discovery rate (FDR) corrected for all 33 signatures (q value) = 4.7 × 10 −162 ; sig- nature 5, Spearman rank correlation = 0.13, q value = 2.1 × 10 −46 ; combining the numbers of mutations attributed to signatures 1 and 5 resulted in Spearman rank correlation = 0.37 and P value = 8.2 × 10 −254 ). No other mutational signature exhibited a statistically significant correlation Mutation classes All signatures (all somatic mutations) Clock-like signature: total Clock-like signature: postnatal normal Non-clock-like signatures Clock-like signature: neoplastic Clock-like signature: embryonic and fetal Individual cancers Observed somatic mutations Biological age A A B C D E B C D E Time a b Figure 1 A model for the accumulation of somatic mutations in cancers. (a) Cell lineages are shown leading from fertilized egg to cancer cell, in five different individuals with cancer; A, B, C, D and E. Orange, embryonic and fetal cell divisions; blue, postnatal divisions of normal cells; brown, cell divisions after neoplastic change. (b) Accumulation of somatic mutations due to clock-like and non-clock-like mutational signatures in the same five patients. The correlation between age and the number of somatic mutations due to a clock- like mutational process operating in normal postnatal cells is detectable using the mutations found in cancers, with the rate relatively unaffected if the number of mutations acquired during the embryonic and fetal phase and the neoplastic phase is limited. Note that this figure is provided as a simple illustration of the activity of clock-like mutational processes, and it is not intended to be a realistic representation of actual cancer samples. In reality, the numbers of cellular divisions will be dependent on the tissue type and the numbers of neoplastic mutations may be many folds of magnitude higher. 20 10 Signature 1 Signature 5 0 5 0 Mutation type probability (%) Mutation type probability (%) C>A C>G C>T T>A T>C T>G C>A C>G C>T T>A T>C T>G Figure 2 Patterns of mutational signatures 1 and 5. The signatures are displayed according to the 96-substitution classification defined by substitution class and sequence context immediately 5and 3to the mutated base. The probability bars for the six substitution classes are displayed in different colors. Mutation types are shown on the x axes, and the y axes show the percentage of mutations in the signature attributed to each mutation type. Signatures are displayed on the basis of the trinucleotide frequencies of the whole human genome. ANALYSIS npg © 2015 Nature America, Inc. All rights reserved.
A N A LY S I S Clock-like mutational processes in human somatic cells npg © 2015 Nature America, Inc. All rights reserved. Ludmil B Alexandrov1–3, Philip H Jones1,4, David C Wedge1, Julian E Sale5, Peter J Campbell1,6, Serena Nik-Zainal1,7 & Michael R Stratton1 During the course of a lifetime, somatic cells acquire mutations. Different mutational processes may contribute to the mutations accumulated in a cell, with each imprinting a mutational signature on the cell’s genome. Some processes generate mutations throughout life at a constant rate in all individuals, and the number of mutations in a cell attributable to these processes will be proportional to the chronological age of the person. Using mutations from 10,250 cancer genomes across 36 cancer types, we investigated clock-like mutational processes that have been operating in normal human cells. Two mutational signatures show clock-like properties. Both exhibit different mutation rates in different tissues. However, their mutation rates are not correlated, indicating that the underlying processes are subject to different biological influences. For one signature, the rate of cell division may influence its mutation rate. This study provides the first survey of clock-like mutational processes operating in human somatic cells. The mutational processes that generate somatic mutations in normal cells are not well understood, and quantification of their in vivo mutation rates is lacking for almost all human cell types. These metrics are likely to be fundamental to an understanding of cancer development and aging. Comprehensive investigation of in vivo somatic mutation rates will ultimately depend on accurate, single-cell wholegenome sequencing of normal somatic cells. However, all cancers are clonal cell populations expanded from single normal cells. To a first approximation, the catalog of somatic mutations shared by most members of a cancer cell population is the set that was present in the progenitor cell of the final dominant clonal expansion of the cancer. This catalog provides information on the mutational processes to which the lineage of cells from the fertilized egg to that progenitor cell has been exposed1. Under a simple model, this lineage has three phases: embryonic and fetal development; postnatal 1Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton, UK. Biology and Biophysics (T-6), Los Alamos National Laboratory, Los Alamos, New Mexico, USA. 3Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, New Mexico, USA. 4Medical Research Council (MRC) Cancer Unit, Hutchison/MRC Research Centre, University of Cambridge, Cambridge, UK. 5MRC Laboratory of Molecular Biology, Cambridge, UK. 6Department of Haematology, University of Cambridge, Cambridge, UK. 7Department of Medical Genetics, Addenbrooke’s Hospital National Health Service (NHS) Trust, Cambridge, UK. Correspondence should be addressed to L.B.A. (lba@lanl.gov) or M.R.S. (mrs@sanger.ac.uk). 2Theoretical Received 19 April; accepted 14 October; published online 9 November 2015; doi:10.1038/ng.3441 1402 life in normally functioning differentiated cells; and post-neoplastic transformation in cancer cells (Fig. 1). In the time taken to establish this lineage, some mutational processes may have acted in an episodic manner, generating mutations in bursts over short time periods. Others may have operated continuously, in a clock-like manner, generating mutations at a steady rate. For such clock-like mutational processes, the number of mutations acquired during embryonic and fetal development will be similar in cancers of the same type from different individuals, as this phase is of a fixed duration. Conversely, the same process operating in normally functioning cells during postnatal life will result in a mutation load that is proportional to the age of the person at the time the cancer is sampled, with more mutations present in older individuals (Fig. 1). The number of mutations acquired after initiation of neoplastic change will be unrelated to age of diagnosis but will depend upon the duration of the period between the first cancer driver mutation and initiation of the final dominant clonal expansion and, potentially, also upon changes to the mutation rate contingent upon acquiring the neoplastic phenotype. The latter features may be highly variable within and between cancer types. Under this simple model, mutations with clock-like features in cancer genomes predominantly derive from the normal postnatal part of the lineage. However, mutations from the developmental and/or neoplastic phases could obscure the clock-like features of these mutations and affect estimation of the mutation rate during the normal postnatal phase. To evaluate this possibility, we performed simulations that showed that clock-like mutational processes can be detected and that the mutation rates estimated are relatively unaffected by mutations from other phases, unless the mutations generated during the developmental and/or neoplastic phases constitute the large majority of the total number of mutations in the cancer. Therefore, analysis of the several thousand cancer genomes thus far sequenced can provide a first survey of the clock-like mutational processes operating in a wide range of normal human cell types. Different mutational processes generate distinct combinations of mutation types in cancer genomes2,3. These characteristic imprints of mutational processes have been termed ‘mutational signatures’. We previously reported a mathematical approach and computational framework to extract mutational signatures from catalogs of somatic mutations in human cancers4–6. Using a 96-category classification of base substitutions based on the type of substitution and the bases immediately 5′ and 3′ to the mutated base, we identified 21 mutational signatures operating over 30 cancer types4. Among these signatures, the numbers of mutations associated with signature 1 correlated with age of cancer diagnosis for some cancer types4. VOLUME 47 | NUMBER 12 | DECEMBER 2015 NATURE GENETICS A N A LY S I S a Time Figure 1 A model for the accumulation of somatic mutations in cancers. (a) Cell lineages A Individual B are shown leading from fertilized egg to cancer cancers C cell, in five different individuals with cancer; D A, B, C, D and E. Orange, embryonic and E fetal cell divisions; blue, postnatal divisions A B C D E of normal cells; brown, cell divisions after neoplastic change. (b) Accumulation of somatic Mutation classes mutations due to clock-like and non-clock-like All signatures (all somatic mutations) mutational signatures in the same five Clock-like signature: total patients. The correlation between age and the Clock-like signature: postnatal normal number of somatic mutations due to a clockNon-clock-like signatures like mutational process operating in normal Clock-like signature: neoplastic postnatal cells is detectable using the mutations Clock-like signature: embryonic and fetal found in cancers, with the rate relatively unaffected if the number of mutations acquired during the embryonic and fetal phase and the neoplastic phase is limited. Note that this figure Biological age is provided as a simple illustration of the activity of clock-like mutational processes, and it is not intended to be a realistic representation of actual cancer samples. In reality, the numbers of cellular divisions will be dependent on the tissue type and the numbers of neoplastic mutations may be many folds of magnitude higher. npg © 2015 Nature America, Inc. All rights reserved. Observed somatic mutations b We were able to perform validation for 29 of these 33 mutational signatures using our established methodology for validating mutational signatures4. This new analysis confirmed the patterns of the 21 previously identified mutational signatures4, demonstrating the robustness of the computational approach. Additionally, examining this substantially larger data set allowed us to disentangle the patterns of another eight distinct mutational signatures. A curated list of the validated mutational signatures and the cancer types in which they are present can be found at our Catalogue of Somatic Mutations in Cancer (COSMIC) signatures website (see URLs). Note that signatures 25, 29 and 30 are not part of the analysis presented here because the relevant samples were either cancer cell lines or lacked information about age of diagnosis. Further, the list of mutational signatures on our website does not include signatures corresponding to sequencing artifacts and signatures for which validation has not been performed. We have, however, included these mutational signatures in the current analysis, and their patterns are shown in Supplementary Figure 1. To identify mutational signatures showing clock-like behavior, we first combined the mutations and samples from all cancer types. Of the 33 signatures examined, signatures 1 and 5 showed a correlation between numbers of mutations and age of diagnosis, and, for both signatures, the numbers of mutations increased with age (signature 1, Spearman rank correlation = 0.34, false discovery rate (FDR) corrected for all 33 signatures (q value) = 4.7 × 10−162; sigRESULTS nature 5, Spearman rank correlation = 0.13, q value = 2.1 × 10−46; Applying our refined approach to 10,250 cancer samples resulted combining the numbers of mutations attributed to signatures 1 and in delineation of the patterns of 33 distinct mutational signatures. 5 resulted in Spearman rank correlation = 0.37 and P value = 8.2 × 10−254). No other mutational signature C>A C>G C>T T>A T>C T>G exhibited a statistically significant correlation Our previous analysis extracted mutational signatures separately from each cancer type and then quantified the mutations contributed by these signatures to each case of that cancer type. Many mutational signatures are found in multiple different cancer types, and a central scientific question to address is how the contributions of such signatures compare across cancer types. However, a particular mutational signature found in multiple cancer types will be contaminated to differing extents by other signatures and by noise in each of the different cancer types. Hence, our previous approach did not allow accurate quantification of mutation rates for direct comparisons between cancer types. We have, therefore, reformulated the approach to derive a single consensus version of each signature, and we used these consensus signatures to estimate the number of mutations contributed to each cancer sample across all cancer types (Online Methods). Our refined approach was applied to a larger data set of 7,329,860 somatic mutations from 10,250 cancer genomes (Supplementary Data Sets 1 and 2) derived from diverse epithelial, mesenchymal, glial, hematopoietic and lymphoid cells that collectively constitute an extensive, albeit incomplete sampling of normal cell types in the human body. This analysis has then allowed us to estimate the contributions of mutations to individual cancer cases across cancer types and hence enabled comparison of the clock-like mutation rates that reflect mutations in normal tissues. 20 Signature 1 Mutation type probability (%) 10 0 5 Signature 5 Mutation type probability (%) 0 C>A NATURE GENETICS C>G C>T T>A VOLUME 47 | NUMBER 12 | DECEMBER 2015 T>C T>G Figure 2 Patterns of mutational signatures 1 and 5. The signatures are displayed according to the 96-substitution classification defined by substitution class and sequence context immediately 5′ and 3′ to the mutated base. The probability bars for the six substitution classes are displayed in different colors. Mutation types are shown on the x axes, and the y axes show the percentage of mutations in the signature attributed to each mutation type. Signatures are displayed on the basis of the trinucleotide frequencies of the whole human genome. 1403 npg © 2015 Nature America, Inc. All rights reserved. A N A LY S I S (q value < 0.05) between the number of Table 1 Rates of somatic substitution accumulation for clock-like mutational signatures mutations and age of cancer diagnosis. Signature 1 Signature 5 Number of The total number of somatic mutations in Cancer type samples Slope P value Slope P value each sample (Fig. 1) also exhibited a corre- Acute lymphoid leukemia (ALL) 141 6.45 0.24 8.55 5.80 × 10−4 lation with age of diagnosis across all sam- Acute myeloid leukemia (AML) 202 0.80 0.77 2.89 0.02 ples (Spearman rank correlation = 0.37 and Adrenocortical carcinoma 92 2.56 0.89 3.94 0.78 P value = 3.1 × 10−215). However, after subtract- Bladder cancer 238 8.07 2.54 × 10−3 11.87 0.82 ing the numbers of mutations in signatures Brain, adult lower-grade glioma 465 10.02 1.00 × 10−5 12.70 7.00 × 10−5 1 and 5, which in aggregate only accounted Breast cancer 1,170 3.71 1.00 × 10−5 5.31 1.00 × 10−5 −3 for 23% of the total number of mutations, Cervical cancer 198 14.14 3.70 × 10 6.57 0.73 131 −1.45 0.50 5.52 0.07 no correlation was found (P value = 0.21), Chronic lymphocytic leukemia (CLL) 559 23.43 1.00 × 10−5 −3.97 0.80 indicating that the correlation for all muta- Colorectal cancer Esophageal cancer 329 19.66 9.00 × 10−4 −0.42 0.94 tions is predominantly explained by muta332 19.85 3.40 × 10−4 3.44 0.70 tions belonging to signatures 1 and 5. C>T Glioblastoma multiforme 591 10.20 5.60 × 10−4 2.20 0.97 mutations at NpCpG trinucleotides (often Head and neck cancer 65 3.18 0.27 5.16 0.03 termed CpG dinucleotides) constitute the Kidney, chromophobe Kidney, renal clear cell carcinoma 468 0.26 0.93 22.75 1.00 × 10−5 major component of signature 1, and their 169 −0.29 0.88 31.86 8.00 × 10−5 numbers also showed correlation with age Kidney, renal papillary cell carcinoma Liver cancer 290 −1.93 0.44 7.81 0.02 (P value = 1.0 × 10−189) (Fig. 2). Subtracting Lung, adenocarcinoma 795 6.30 0.03 0.00 1.00 the numbers of C>T mutations at NpCpG Lung, small cell carcinoma 69 0.6 0.99 5.58 0.81 sites from the numbers attributed to signature Lung, squamous cell carcinoma 176 6.00 0.03 8.22 0.91 1 left a residual correlation with age of cancer Lymphoma, B cell 24 0.90 0.34 5.46 0.05 diagnosis (P value = 1.4 × 10−19), indicating Medulloblastoma 100 16.16 1.00 × 10−5 3.06 2.60 × 10−4 that, in addition to C>T mutations at NpCpG Melanoma 514 3.25 2.13 × 10−3 0.00 1.00 sites, other components of this signature also Multiple myeloma 69 3.11 1.94 × 10−3 0.17 0.93 behave in a clock-like manner. Nasopharyngeal carcinoma 55 2.62 0.87 −4.44 0.71 Twenty-six of 36 cancer types individually Neuroblastoma 231 −0.23 0.89 25.80 1.00 × 10−5 showed correlations with age (P value < 0.05) Ovarian cancer 466 4.01 2.50 × 10−4 2.42 0.84 for signature 1 and/or signature 5 mutations Pancreatic cancer 231 14.73 0.04 7.47 0.69 (Fig. 2, Table 1 and Supplementary Fig. 2). Paraganglioma 179 1.85 0.08 2.49 0.06 Mutations associated with signature 1 were Pilocytic astrocytoma 101 0.65 0.01 1.05 0.10 520 5.62 0.41 8.31 0.02 correlated with age of diagnosis in 17 of Prostate cancer 472 23.73 2.30 × 10−4 6.04 0.56 the cancer types, and mutations associated Stomach cancer Thyroid cancer 404 0.66 0.33 6.39 1.00 × 10−5 with signature 5 were correlated with age of 26 −4.85 0.94 −15.75 0.75 diagnosis in 12 of the cancer types. In three Urothelial carcinoma 241 4.28 0.76 9.68 0.06 cancer types (breast cancer, low-grade glioma Uterine carcinoma Uterine carcinosarcoma 57 4.51 0.82 5.53 0.84 and glioblastoma), the mutational burdens of Uveal melanoma 80 1.97 0.55 2.26 0.77 both signatures correlated with age of cancer diagnosis. Although some cancer types exhib- Somatic substitutions per gigabase per year for signatures 1 and 5 for all examined cancer types, including P values and the number of samples examined in each cancer type. Rates of mutation accumulation and P values for all mutaited negative correlations, in all such cases the tional signatures in all cancer types are provided in Supplementary Data Set 3. correlations were statistically not significant (Table 1). As in the analysis of all samples, no other mutational signature showed a correlation with age of diagnosis simply due to differences in the extent of CpG methylation, as methin any individual cancer type, although there was some correlation ylation levels at these dinucleotide are similar in most cell types3,8, with the total number of mutations and the number of C>T mutations although it could be due to differences in rates of cytosine deaminaat NpCpG sites (Supplementary Data Sets 3–5). tion and/or thymine excision at T•G mismatches by thymine DNA We then compared the signature 1 and signature 5 mutation rates glycosylase or mismatch repair. between different tissue types. Signature 1 mutation rates showed It is notable, however, that many cancer types with high signasubstantial variation, being high in stomach cancer (23.7 mutations/ ture 1 mutation rates are derived from normal epithelia with high Gb/year), colorectal cancer (23.4), glioblastoma multiforme (19.8), turnover, for example, stomach and colorectum (P value = 0.0033; esophagus cancer (19.6), medulloblastoma (16.1) and pancreas cancer Supplementary Fig. 5, Supplementary Table 1 and Supplementary (14.7) in comparison to ovary cancer (4.0 mutations/Gb/year), breast Data Set 6). Because DNA replication without previous repair will concancer (3.7), melanoma (3.2), myeloma (3.1) and pilocytic astrocytoma vert T•G mismatches arising from deamination of 5-methylcytosine (0.65) (Fig. 3 and Supplementary Fig. 3). In breast, the rates were into C>T mutations, it is plausible that cell types with high mitotic similar for estrogen receptor–positive (3.9 mutations/Gb/year) and rates exhibit higher mutation rates as a result of this mutational estrogen receptor–negative (3.1) cancers (Supplementary Fig. 4). process. If correct, this interpretation indicates that the signature 1 On the basis of similarities of mutational signature, the mutational mutation rate can serve as a clock registering the number of mitoses process underlying signature 1 is likely to be deamination of 5-meth- a cell has experienced during the lineage of cell divisions from the ylcytosine at CpG dinucleotides leading to T•G mismatches, which fertilized egg. are not repaired before DNA replication7. It seems unlikely that the The signature 5 mutation rate also showed substantial variation observed variation in signature 1 mutation rate between cell types is between cancer types. It was high in papillary cell kidney cancer 1404 VOLUME 47 | NUMBER 12 | DECEMBER 2015 NATURE GENETICS A N A LY S I S Colorectal cancer Stomach cancer Signature 1 Mutations/Gb 4,000 Mutations/Gb –4 P value = 1.00 × 10 –5 P value = 3.40 × 10 Esophagus cancer –4 P value = 9.00 × 10 –4 Medulloblastoma Pancreas cancer Cervix cancer P value = 1.00 × 10 –5 P value = 0.04 P value = 3.70 × 10 P value = 2.60 × 10 –4 P value = 0.69 P value = 0.73 Head and neck cancer –3 –4 Glioma, low grade –5 P value = 5.60 × 10 P value = 1.00 × 10 P value = 0.97 P value = 7.00 × 10 3,000 2,000 1,000 0 4,000 Signature 5 P value = 2.30 × 10 Glioblastoma multiforme P value = 0.56 P value = 0.80 P value = 0.70 P value = 0.94 –5 3,000 2,000 1,000 60 8 10 0 0 0 20 40 60 80 10 0 40 0 20 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 Signature 1 Mutations/Gb Age of cancer diagnosis (years) Bladder cancer 4,000 P value = 2.54 × 10 –3 ALL P value = 0.24 Mutations/Gb Prostate cancer P value = 0.03 P value = 0.03 P value = 0.41 P value = 0.76 P value = 2.50 × 10 P value = 1.00 P value = 0.91 P value = 0.02 P value = 0.06 P value = 0.84 Uterine cancer Ovarian carcinoma –4 Breast cancer Melanoma –5 P value = 2.13 × 10 –5 P value = 1.00 P value = 1.00 × 10 –3 2,000 1,000 P value = 0.82 P value = 5.80 × 10 –4 P value = 1.00 × 10 3,000 2,000 1,000 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 10 10 0 0 20 40 60 80 0 0 20 40 60 80 10 0 20 40 60 80 0 Age of cancer diagnosis (years) Kidney, chromophobe Signature 1 Mutations/Gb 4,000 Signature 5 Myeloma P value = 0.27 P value = 1.94 × 10 P value = 0.03 P value = 0.93 AML –3 Thyroid cancer P value = 0.77 P value = 0.33 P value = 0.02 P value = 1.00 × 10 Pilocytic astrocytoma Kidney, clear cell P value = 0.01 P value = 0.93 P value = 0.10 P value = 1.00 × 10 Neuroblastoma P value = 0.89 Kidney, papillary Liver cancer P value = 0.44 P value = 0.88 3,000 2,000 1,000 0 4,000 Mutations/Gb –5 –5 P value = 1.00 × 10 –5 –5 P value = 8.00 × 10 P value = 0.02 3,000 2,000 1,000 0 npg 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 0 20 40 60 80 10 0 © 2015 Nature America, Inc. All rights reserved. Lung squamous cell carcinoma 3,000 0 4,000 Signature 5 Lung adenocarcinoma Age of cancer diagnosis (years) Figure 3 Correlations between ages of cancer diagnosis and mutations attributed to signatures 1 and 5. The y axes show the numbers of somatic substitutions per gigabase attributed to either signature 1 or signature 5, and the x axes show ages of cancer diagnosis. Each panel corresponds to a cancer type, and panels are sorted in decreasing order by the estimated slope for signature 1. Each dot represents the median number of somatic mutations for all cancers from individuals of a given age. Red and green lines show best estimates for the slopes (that is, somatic mutations accumulated with time) of signatures 1 and 5, respectively; 95% confidence intervals for the slopes are shown in lighter green and lighter red shading. Note that, for several cancer types, slopes extend far beyond the available data points; this representation is not intended to be a prediction, but rather it is shown for consistent presentation across all panels in the figure. Slopes and P values are also provided in Table 1. Panels showing mutational burdens in individual samples in each of the cancer types are provided in Supplementary Figure 3. Furthermore, the slopes for each cancer type are depicted in Supplementary Figures 8–43. ALL, acute lymphoid leukemia; AML, acute myeloid leukemia. (31.8 mutations/Gb/year), neuroblastoma (25.8) and clear cell kidney cancer (22.7) in comparison to breast cancer (5.3 mutations/Gb/year), chromophobe kidney cancer (5.1), medulloblastoma (3.0) and acute myeloid leukemia (2.8). The mutational process underlying signature 5 is not well understood. This signature primarily features C>T and T>C transitions. Such mutations can be explained by replication of deaminated cytosine (uracil, which is read as thymine) and adenine (hypoxanthine, which is read as guanine, resulting in A>G•T>C transition). However, in addition, the T>C mutations identified exhibit transcriptional strand bias, potentially indicating that some of these mutations arise from adducts subject to transcription-coupled repair9. The signature 5 mutation rate is high in clear cell and papillary kidney cancers, NATURE GENETICS VOLUME 47 | NUMBER 12 | DECEMBER 2015 which are thought to originate from kidney proximal tubular epithelium, which absorbs metabolites, but low in chromophobe kidney tumors, which may arise from cells of the cortical collecting duct10. This observation raises the possibility that continuous exposure to a ubiquitous metabolic mutagen, which is actively reabsorbed in the kidney proximal tubule resulting in elevated exposure in these cells, may underlie signature 5. In some tumor types, a correlation with age of diagnosis was not observed, despite substantial numbers of signature 5 mutations (for example, head and neck cancer, colorectal carcinoma and lung squamous cell carcinoma; Fig. 3), and the absence of correlation is thus unlikely to be due to limitations of statistical power. One possible explanation is that the mutational process underlying signature 5 is 1405 A N A LY S I S npg © 2015 Nature America, Inc. All rights reserved. substantially activated by other factors during life or as part of the neoplastic phenotype in some tumor classes, thus obscuring the correlation between signature 5 mutations generated by the clock-like process and age of diagnosis. Across tumor types, signature 1 and 5 mutation rates do not closely correlate with each other (Spearman rank correlation = −0.08 and P value = 0.63). For example, in medulloblastoma, the signature 1 rate is 16.1 mutations/Gb/year and the signature 5 rate is 3.0, whereas, in papillary kidney cancer, the rates for signatures 1 and 5 are −0.3 and 31.9, respectively (Fig. 3 and Table 1). Thus, the biological determinants of the mutation rates of the two processes may be different, and cell proliferation rate may not be a major factor for signature 5 in contrast to its influence on signature 1. DISCUSSION Peering through the ‘cracked lens’ of cancer genomes may obscure or distort the estimates of clock-like mutation rates for the normal cells that are progenitors of the cancers. The data originate from dozens of laboratories, multiple sequencing platforms and many mutationcalling algorithms. They include subclonal mutations, which occur after the last dominant clonal expansion, to different extents and are from samples with varying amounts of normal tissue contamination. The numbers of signature 1 and 5 mutations have been estimated from mutational catalogs to which multiple other mutational processes have often contributed and may still be affected by the presence of these processes, despite extraction by our method. The simple, pragmatic classification of cancer types used is likely, in many instances, to hide greater complexity of biological subclass, and each subclass could derive from a distinct type of non-neoplastic cell characterized by different signature 1 and 5 mutation rates. The mutation rate estimates are based on age of cancer diagnosis as a surrogate for the age at which the driver mutation initiated the last clonal expansion, and several years may intervene between these two points in time (and the length of this interval may differ between tumor types). As shown in the simulations, if substantial numbers of signature 1 and 5 mutations occur after neoplastic transformation, they could obscure clock-like processes and affect the estimated mutation rates. Finally, the profiles of signatures 1 and 5 may be further refined in future, and this may also affect estimates of mutation rate. Signatures 1 and 5 demonstrate variability in the numbers of mutations per megabase, even for samples of the same cancer type and age of diagnosis (Supplementary Data Set 2). Although some of this variability may be attributable to limitations of the data and analysis described above, it is also plausible that some reflects biological variation. For example, the rates of the clock-like mutational processes may vary between individuals depending on environment or lifestyle and inherited predisposition, and the ancestor cells of some tumors may acquire mutator phenotypes for signatures 1 or 5 decades before the last clonal expansion. Future studies will be needed to evaluate the effects of these factors on the rates of clock-like mutational processes. Remarkably, despite these multiple muddying influences, the clocklike nature of signatures 1 and/or 5 is visible in most cancer types. The proposition that these clock-like mutations derive from normal cells is supported by the observation that the profile of signatures 1 and 5 combined is very similar to the somatic mutational patterns observed in the small set of non-neoplastic human somatic cells sequenced thus far11. Moreover, the combination of signatures 1 and 5 also recapitulates the pattern of de novo mutations found in the human germ line (data from refs. 12–14), and this de novo germline pattern cannot be parsimoniously generated by other combinations of known mutational signatures (Supplementary Fig. 6). 1406 The results therefore provide the first survey of clock-like somatic mutational processes over a broad range of normal human cell types and quantification of the mutation rates exhibited by these mutational processes. They indicate that there are two clock-like mutational signatures, that both signature 1 and signature 5 mutation rates differ widely between cell types, that the biological factors that determine these rates are different for signatures 1 and 5, that cell proliferation rate may be one of the dominant factors influencing the mutation rate of signature 1 and that signature 5 may be activated by non-clocklike influences and/or as part of the neoplastic process. Despite the ubiquity of both signatures in normal somatic cells and their likely presence in the germ line, generating the sequence variation underlying human phenotypes in health and disease, we have hardly any understanding of the biological processes underlying at least one of them, signature 5. The signature 1 and 5 mutation rates will be refined over the next several years by the direct deployment of largescale sequencing of single normal cells and will provide the basis for future exploration of the range of mutational processes and their rates in human cells affected by mutagenic exposures, in precancerous neoplastic cells and in cells involved in disease processes other than cancer in which mutation rates may be affected. URLs. COSMIC Signatures of Mutational Processes in Human Cancer website, http://cancer.sanger.ac.uk/cosmic/signatures; MathWorks Mutational Signature Framework, http://www.mathworks.com/ matlabcentral/fileexchange/38724. METHODS Methods and any associated references are available in the online version of the paper. Note: Any Supplementary Information and Source Data files are available in the online version of the paper. ACKNOWLEDGMENTS We would like to thank M.E. Hurles and R. Durbin for early discussions about the analyses performed. We would like to thank The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC) and the authors of all previous studies cited in Supplementary Data Set 1 for providing free access to their somatic mutational data. This work was supported by the Wellcome Trust (grant 098051). S.N.-Z. is a Wellcome-Beit Prize Fellow and is supported through a Wellcome Trust Intermediate Fellowship (grant WT100183MA). P.J.C. is personally funded through a Wellcome Trust Senior Clinical Research Fellowship (grant WT088340MA). J.E.S. is supported by an MRC grant to the Laboratory of Molecular Biology (MC_U105178808). L.B.A. is supported through a J. Robert Oppenheimer Fellowship at Los Alamos National Laboratory. P.H.J. is supported by the Wellcome Trust, an MRC Grant-in-Aid and Cancer Research UK (programme grant C609/A17257). This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the US Department of Energy National Nuclear Security Administration under contract DE-AC52-06NA25396. Research performed at Los Alamos National Laboratory was carried out under the auspices of the National Nuclear Security Administration of the US Department of Energy. AUTHOR CONTRIBUTIONS L.B.A. and M.R.S. conceived the overall approach and wrote the manuscript. L.B.A., P.H.J., S.N.-Z. and M.R.S. carried out signature and/or statistical analyses with assistance from D.C.W., J.E.S. and P.J.C. COMPETING FINANCIAL INTERESTS The authors declare competing financial interests: details are available in the online version of the paper. Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. 1. Stratton, M.R., Campbell, P.J. & Futreal, P.A. The cancer genome. Nature 458, 719–724 (2009). VOLUME 47 | NUMBER 12 | DECEMBER 2015 NATURE GENETICS A N A LY S I S 8. Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 14, R115 (2013). 9. Fousteri, M. & Mullenders, L.H. Transcription-coupled nucleotide excision repair in mammalian cells: molecular mechanisms and biological effects. Cell Res. 18, 73–84 (2008). 10. Davis, C.F. et al. The somatic genomic landscape of chromophobe renal cell carcinoma. Cancer Cell 26, 319–330 (2014). 11. Welch, J.S. et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 150, 264–278 (2012). 12. Kong, A. et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488, 471–475 (2012). 13. Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012). 14. Conrad, D.F. et al. Variation in genome-wide mutation rates within and between human families. Nat. Genet. 43, 712–714 (2011). npg © 2015 Nature America, Inc. All rights reserved. 2. Alexandrov, L.B. & Stratton, M.R. Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr. Opin. Genet. Dev. 24, 52–60 (2014). 3. Helleday, T., Eshtad, S. & Nik-Zainal, S. Mechanisms underlying mutational signatures in human cancers. Nat. Rev. Genet. 15, 585–598 (2014). 4. Alexandrov, L.B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013). 5. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast cancers. Cell 149, 979–993 (2012). 6. Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Campbell, P.J. & Stratton, M.R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013). 7. Bell, S.P. & Dutta, A. DNA replication in eukaryotic cells. Annu. Rev. Biochem. 71, 333–374 (2002). NATURE GENETICS VOLUME 47 | NUMBER 12 | DECEMBER 2015 1407 ONLINE METHODS npg © 2015 Nature America, Inc. All rights reserved. Curation of freely available cancer samples. No data were generated for this study. Rather, data curation was performed to annotate freely available cancer genomes. Somatic mutations from 10,250 genome pairs (consisting of a cancer genome and the genome of a matched normal tissue) were curated. Of the 10,250 matched normal pairs, 607 had their whole genome sequenced, whereas 9,643 underwent whole-exome sequencing. Data were retrieved from three sources: (i) the data portal of The Cancer Genome Atlas (TCGA), (ii) the data portal of the International Cancer Genome Consortium (ICGC) and (iii) previously published data. Information for each sample is provided in Supplementary Data Set 1. The somatic mutations for all samples are freely available and can be retrieved on the basis of the information provided in Supplementary Data Set 1. Filtering mutations, generating mutational catalogs and displaying signatures. This study relies on previously sequenced cancer genomes and on the subsequently used bioinformatics to identify cancer-specific somatic mutations. The data were filtered before analysis as previously described in ref. 4, and mutational catalogs were generated using Ensembl Core APIs for human genome build GRCh37. The prevalence of somatic mutations in each sample was estimated on the basis of a haploid human genome after filtering as previously described in ref. 4. Mutational signatures are displayed according to the observed trinucleotide frequency of the human genome. Refined approach for deciphering mutational signatures. The mutational catalogs of all samples were examined following two steps. Initially, de novo extraction was performed to derive the set of novel consensus mutational signatures. Briefly, mutational signatures were deciphered independently for each of the 36 cancer types using our MATLAB framework4. The computational framework for deciphering mutational signatures is freely available for download from MathWorks. The algorithm deciphers the minimal set of mutational signatures that optimally explains the proportion of each mutation type found in each catalog and then estimates the contribution of each signature to each mutational catalog. Mutational signatures were extracted separately for genomes and exomes. Mutational signatures extracted from exomes were normalized to the trinucleotide frequency of the human genome. All mutational signatures were clustered using unsupervised agglomerative hierarchical clustering, and a threshold was selected to identify the set of consensus mutational signatures. Misclustering of signatures was avoided as previously described in ref. 4. Overall, we identified 33 consensus mutational signatures. Signatures 1 through 28 (note that signature 25 is not found in this data set) were validated, and these processes thus most likely reflect biological processes. Signatures R1 through R3 were previously found in ref. 4 and attributed to sequencing artifacts. We were not able to perform validation for signatures U2 through U4 because we did not have access to the respective biological samples or BAM files. The de novo extraction was used to identify the set of consensus mutational signatures across the 10,250 samples examined. This first step of extracting mutational signatures and generating consensus patterns follows the previously introduced approach in ref. 4. However, our previous methodology did not use consensus mutational signatures to evaluate their contributions in each sample, thus not allowing accurate comparison of mutation rates between different cancer types. To address this limitation, we refined our approach by introducing another step of analysis, which is focused on accurately estimating the numbers of mutations associated with each consensus signature in each sample. We usually refer to this number of somatic mutations as either the ‘contribution’ of a mutational signature or the ‘exposure’ to a mutational signature. Calculating the contributions of all mutational signatures was performed by estimating the number of mutations associated with the consensus patterns of the signatures of all operative mutational processes in each cancer sample. This approach allows direct comparison between cancer types because identical signatures were used to estimate the contributions in each NATURE GENETICS cancer type. More specifically, all consensus signatures were examined as a set P containing 33 vectors   p1   p1   p1   p1     1   2   32   33   P =    ,  …   ,      96   96   96   96     p1   p2   p32   p33   where each of the vectors is a discrete probability density function reflecting a consensus mutational signature. The 96 non-negative components of each vector correspond to mutation types (substitutions and their immediate sequence context) of the signatures. The contributions of the signatures were estimated independently for each of the 10,250 samples with a subset of consensus mutational signatures. For each sample, the estimation algorithm consists of finding the minimum of the Frobenius norm of a constrained linear function (see below for constraints) for a set of vectors S1…q, q ≤ 33 belonging to the subset Q, where Q # P: q    min M − ∑ (Si × Ei ) i =1 F (1) 2 Q is determined on the basis of the known operative mutational processes in the cancer type of the examined sample from the signature extraction process described above. For example, for any neuroblastoma sample, Q will contain signatures 1, 5 and 18 because these are the only known signatures of mutational processes  operative   in neuroblastoma (Supplementary Data Set 2). In equation (1), Si and M represent vectors with 96 non-negative components reflecting, respectively, a consensus mutational signature and the mutational    96 catalog of a sample. Hence, Si ∈ℜ96 + and M ∈ N 0 . Further, both  vectors have known numerical values either from the de novo extraction ( Si ) or from gen  erating the original mutational catalog of the sample ( M ). In contrast, Ei corresponds to anunknown scalar reflecting the   number of mutations contributed by signature Si in the mutational catalog M . Minimization of equation (1) is performed under several biologically meaningful constraints. The set of vectors in the examined set Q is constrained on the basis of previously identified biological features of the consensus mutational signatures. For example, consensus signature 6 causes high levels of indels at mono- and polynucleotide repeats4. Thus, this mutational signature will be excluded from the set Q when the mutational catalog of an examined sample has only a few such indels. Similarly, there are signatures associated with other types of indels, transcriptional strand bias, dinucleotide mutations, hypermutator phenotypes, etc., and these signatures are included in the set Q only when the sample in question exhibits one or more of these features. Lists of features associated with mutational signatures can be found in ref. 4. In addition to sample-specific constraints to the set Q, equation (1) was universally constrained with regard to the parameter Ei. These  constraints can be mathematically expressed as 0 ≤ Ei ≤ Si , i = 1...q and q 1  ∑ Ei = Si 1 . All results for the contributions of all operative signatures i =1 in all samples from the hitherto described approach are provided in Supplementary Data Set 2, and the original somatic mutations can be found using Supplementary Data Set 1. Factors influencing signature extraction. We have previously simulated data to describe a plethora of factors influencing the accuracy of mutational signatures extraction6. Such factors include the number of available samples, the number of somatic mutations in a sample, the number of mutations contributed by different mutational signatures and the similarity between the patterns of the signatures of mutational processes operative in cancer samples, as well as the computational limitations of our framework. Nevertheless, in the past 3 years, our framework has proven robust and has described multiple similar and validated signatures across the spectrum of human cancer3–5,15–22. doi:10.1038/ng.3441 npg © 2015 Nature America, Inc. All rights reserved. Patterns of signatures 1 and 5. In a previous analysis4, we extracted 21 mutational signatures and noted that signatures 1A and 1B correlate with age of diagnosis for some cancer types. Further, we noted that, because signatures 1A and 1B “are almost mutually exclusive among tumour types they probably represent the same underlying process, with signature 1B representing less efficient separation from other signatures in some cancer types” (ref. 4). In our previous report, we referred to these two signatures as a single signature termed signature 1A/B. In the current analysis, encompassing ~50% more data and a refined algorithm, signature 1A is found in more cancer types, including some in which signature 1B was seen previously. A detailed examination of the pattern of signature 1B showed that this mutational signature is a linear combination of signatures 1A and 5. More specifically, a combination of signatures 1A and 5 can be used to account for 0.97 of the pattern of signature 1B (where 1.00 is perfect correlation), and no other combination of signature can be used to explain signature 1B. Thus, in the current manuscript, signature 1B is no longer referred to, and, in cancer types from which it was extracted, signatures 1A and 5 have been reintroduced to assess mutation contributions. Robustness and reproducibility of mutational signatures. In this analysis, we use an elaborated version of our framework for extracting mutational signatures and apply it to a much larger data set. Comparison between the set of previously extracted mutational signatures4 and the set of mutational signatures found here shows both stability and reproducibility of mutational signatures. This reproducibility can be observed by comparing the mutational signatures on the COSMIC signatures website with the ones from ref. 4. Further, the similarity can also be quantified using a cosine similarity as previously done in ref. 6. The cosine similarity between any combination of signatures that were derived in ref. 4 and also found in this analysis (that is, signatures 1 through 21) is more than 0.97, where a similarity of 1.00 is an exact match. Statistical analysis of relationships between age and mutations. Global analysis was performed for the 33 mutational signatures identified across all samples in all cancer types. Zero mutations were attributed to all signatures that were not found in a sample. The data heteroscedasticity and presence of outliers mandates the use of an appropriate statistical approach23. We leveraged robust linear regression to evaluate linear dependencies between the numbers of mutations associated with each mutational signature across all samples examined and the ages of cancer diagnosis of these samples. The calculated P values from the applied robust regression were corrected for multiple-hypothesis testing using the Benjamini-Hochberg procedure. Only signatures 1 and 5 exhibited statistically significant correlation (q value < 0.05) with age of cancer diagnosis. Each cancer type was examined independently for a linear dependence between the ages of cancer diagnosis for the curated samples in that cancer type and the numbers of mutations attributed to each of the signatures of the operative mutational processes in that cancer type. Because most traditional or generalized linear regression approaches are very sensitive to outliers23 and because many of the cancer samples examined were hypermutators (outliers), we leveraged a robust regression model. The robust regression iteratively reweights least squares with a bi-square weighting function and overcomes some (if not most) of the limitations of traditional approaches24–26. Similarly, we report results using Spearman’s rank correlation coefficient because it is more robust to data outliers when compared to Pearson’s product-moment correlation coefficient27. It should be noted that, although samples with missing information about their age of cancer diagnosis were excluded from this analysis, these samples were used in the de novo extraction of mutational signatures and subsequent estimation of the signatures’ contributions. Each signature was examined separately in each cancer type in which that signature was identified. Examination was based on a robust linear regression model that estimates the slope of the line and whether this slope is significantly different from a horizontal line with a slope of zero (F test, P value <0.05) as well as by calculating the Spearman’s rank correlation coefficient. Although robust linear regression models provide confidence intervals and P values, we decided to take a more conservative approach and report the results after bootstrapping the data. Bootstrapping (random sampling with replacement) was used to derive measures of accuracies: the best fit for the slope and the slope’s 95% confidence interval. In total, we performed 100,000 bootstrapping doi:10.1038/ng.3441 iterations per signature per cancer type (total of ~2 × 107 iterations). Each of the iterations for which the robust regression returned a P value <0.05 was considered statistically significant, whereas iterations with P value ≥0.05 were considered not statistically significant. The overall P value per signature per cancer type was calculated as follows: Number of non-significant iterations + 1 100, 000 + 1 It should be noted that the number of iterations limits the minimum possible P value, in this case, 9.99 × 10−6, and P values reported to be equal to 9.99 × 10−6 are most likely lower. The reported P values and confidence intervals are the ones after applying the bootstrapping procedure. MATLAB code for calculating the P values across individual cancer types is provided as Supplementary Software. The results of estimating the line’s slope by robust regression and the Spearman’s rank correlation coefficient for each of the signatures in each of the cancer types can be found in Supplementary Data Set 3. As before, we have used P value <0.05 to identify statistically significant dependencies. However, this cutoff is arbitrary, and summarized results using different cutoff values are shown in Supplementary Figure 2. Examination of individual cancer types is based on the hypothesis that signatures 1 and 5 are the only signatures reflecting the activity of clock-like mutational processes. This hypothesis was constructed by examining the activity of all signatures across all cancer types (signature 1, FDR-corrected for all 33 signatures q value = 4.7 × 10−162; signature 5, FDR-corrected for all 33 signatures q value = 2.1 × 10−46). Thus, for our analysis of individual cancer types, the P values reported in the main manuscript have not been corrected for multiple-hypothesis testing. Nevertheless, for consistency, we have provided P values corrected for multiple-hypothesis testing using the Benjamini-Hochberg procedure for each cancer type in Supplementary Data Set 3. It should be noted that using FDR-corrected P values to evaluate the significance of the analysis does not affect the overall message of the manuscript. Lastly, we also evaluated (following the hitherto described approach) whether there is a linear dependency between the total numbers of somatic mutations and/or the C>T mutations at CpG dinucleotides and the ages of cancer diagnosis. Similarly, this examination was done separately for each of the cancer types, and the results from the analysis can be found in Supplementary Data Set 5. Evaluating the robustness and limitations of the analysis performed with simulated data. A myriad of known and unknown processes may be affecting the analyses performed. Some of these include data generation by different institutes and laboratories, contamination of subclonal mutations, endogenous or exogenous factors affecting the rates of signatures 1 and 5, inaccuracies of the patterns of signatures 1 and 5, mutations generated during the developmental and/or neoplastic phases, limitations of the signature extraction algorithm, small numbers of samples and/or somatic mutations, misannotation of samples, etc. In principle, quantifying the overall error introduced by even a subset of these processes is impractical. To evaluate the robustness and limitations of our analysis, we simulated data with two types of mixture noise: (i) noise affecting the bona fide somatic mutations associated with a clock-like signature of a mutational process operative in a sample and (ii) noise affecting the age of cancer diagnosis of a sample. It was assumed that the mixture of all factors affecting the bona fide number of mutations associated with a clock-like signature in a sample reflects a mixture of random processes and, it thus can be approximated by white additive Gaussian noise. Further, folded normal Gaussian noise (half-normal distribution) with a mean value of 2 years and s.d. of 4 years was added to the age of cancer diagnosis of a sample. This noise reflects average cancer detection within 2 years of neoplastic initiation with cancers detected in 84% and 98% of patients within 6 and 10 years, respectively. The distribution is half-bounded, as a cancer cannot be detected before it has occurred. Clock-like mutational signatures were simulated in 100 cancers. The ages of diagnosis of the cancers were sampled with replacement from the data in Supplementary Data Set 1, and the mutational rates per year per gigabase (the slope) were taken from a uniform distribution between the minimum NATURE GENETICS Displaying age of diagnosis and clock-like mutational signatures. Linear relationships between the ages of cancer diagnosis and the mutations attributed to mutational signatures are displayed only for signatures 1 and 5, as no other mutational signature displayed statistically significant correlations (Supplementary Figs. 8–43 and Supplementary Data Set 3). These linear relationships are displayed both for the average mutational burden attributed to a signature (Fig. 3) and for all individual mutational burdens attributed to a signature (Supplementary Fig. 3). In both cases, the displayed slopes and their confidence intervals are those derived by the hitherto described analysis and do not depend on the choice of depiction. For brevity, in Figure 3, linear relationships are displayed for only 27 of the 36 cancer types analyzed. Nevertheless, the data provided in Supplementary Data Sets 2 and 3 can be used to display all linear relationships. 15. Behjati, S. et al. Genome sequencing of normal cells reveals developmental lineages and mutational processes. Nature 513, 422–425 (2014). 16. Bolli, N. et al. Heterogeneity of genomic evolution and mutational profiles in multiple myeloma. Nat. Commun. 5, 2997 (2014). 17. Ju, Y.S. et al. Origins and functional consequences of somatic mitochondrial DNA mutations in human cancer. eLife 3 (2014). 18. Murchison, E.P. et al. Transmissible [corrected] dog cancer genome reveals the origin and history of an ancient cell lineage. Science 343, 437–440 (2014). 19. Nik-Zainal, S. et al. Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations in breast cancer. Nat. Genet. 46, 487–491 (2014). 20. Gerlinger, M. et al. Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nat. Genet. 46, 225–233 (2014). 21. Yates, L.R. et al. Subclonal diversification of primary breast cancer revealed by multiregion sequencing. Nat. Med. 21, 751–759 (2015). 22. Wagener, R. et al. Analysis of mutational signatures in exomes from B-cell lymphoma cell lines suggest APOBEC3 family members to be involved in the pathogenesis of primary effusion lymphoma. Leukemia 29, 1612–1615 (2015). 23. Barnett, V. & Lewis, T. Outliers in Statistical Data (Wiley, 1994). 24. Holland, P.W. & Welsch, R.E. Robust regression using iteratively reweighted leastsquares. Comm. Stat. Theory Methods A6, 813–827 (1977). 25. Huber, P.J. & Ronchetti, E. Robust Statistics (Wiley, 2009). 26. Street, J., Carroll, R. & Ruppert, D. A note on computing robust regression estimates via iteratively reweighted least squares. Am. Stat. 42, 152–154 (1988). 27. Abdullah, M.B. On a robust correlation coefficient. Statistician 39, 455–460 (1990). npg © 2015 Nature America, Inc. All rights reserved. and maximum statistically significant rates detected by the analysis performed (Supplementary Data Set 3). In total, 17 simulation scenarios were performed, each with different percentages of white additive Gaussian noise (Supplementary Fig. 7). The noise to bona fide somatic mutations was varied between 0% and 200%, where 0% reflects no noise and 200% corresponds to twice as much noise as compared to genuine somatic mutations. Note that most cancer genomics papers report sensitivity and specificity rates of more than 90%, and thus the false positive rates derived from our simulations are probably overly pessimistic. In all scenarios, the noise added to the ages of diagnosis followed the hitherto described folded normal distribution. Each simulation scenario was repeated 1,000 times, and the simulated data were analyzed to identify clock-like mutational signatures in exactly the same way as the experimental data used in this study. Any iteration with a statistically significant P value for a slope (P value <0.05) in which the simulated slope was within ±10% of the derived slope or within the 95% confidence intervals of the derived slope was considered a genuine detection and, thus, a true positive result. In contrast, any other iteration with a statistically significant P value for a slope (P value <0.05) that did not satisfy the abovementioned conditions was considered a false positive result. The results from the 17 scenarios performed showed that, when noise levels are less than 35%, our analysis is able to find the genuine slopes in ~90% of the iterations while yielding no more than 0.55% false positives (Supplementary Fig. 7). Increasing the noise levels does not increase the number of false positives but rather reduces the number of genuinely detected slopes (true positives). Our simulations indicate that the confidence intervals of the majority of detected slopes include the genuine slope of a clock-like mutational signature, whereas the approach used yields few false positives. NATURE GENETICS doi:10.1038/ng.3441
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Prof. Dr. Rasime Kalkan
European University of Lefke
Branka Vasiljevic
University of Belgrade
Jon R Sayers
The University of Sheffield
Sebastian Furness
Monash University