A N A LY S I S
Clock-like mutational processes in human somatic cells
npg
© 2015 Nature America, Inc. All rights reserved.
Ludmil B Alexandrov1–3, Philip H Jones1,4, David C Wedge1, Julian E Sale5, Peter J Campbell1,6,
Serena Nik-Zainal1,7 & Michael R Stratton1
During the course of a lifetime, somatic cells acquire
mutations. Different mutational processes may contribute to
the mutations accumulated in a cell, with each imprinting a
mutational signature on the cell’s genome. Some processes
generate mutations throughout life at a constant rate in all
individuals, and the number of mutations in a cell attributable
to these processes will be proportional to the chronological
age of the person. Using mutations from 10,250 cancer
genomes across 36 cancer types, we investigated clock-like
mutational processes that have been operating in normal
human cells. Two mutational signatures show clock-like
properties. Both exhibit different mutation rates in different
tissues. However, their mutation rates are not correlated,
indicating that the underlying processes are subject to
different biological influences. For one signature, the rate
of cell division may influence its mutation rate. This study
provides the first survey of clock-like mutational processes
operating in human somatic cells.
The mutational processes that generate somatic mutations in normal
cells are not well understood, and quantification of their in vivo mutation rates is lacking for almost all human cell types. These metrics
are likely to be fundamental to an understanding of cancer development and aging. Comprehensive investigation of in vivo somatic
mutation rates will ultimately depend on accurate, single-cell wholegenome sequencing of normal somatic cells. However, all cancers
are clonal cell populations expanded from single normal cells. To
a first approximation, the catalog of somatic mutations shared by
most members of a cancer cell population is the set that was present
in the progenitor cell of the final dominant clonal expansion of
the cancer. This catalog provides information on the mutational
processes to which the lineage of cells from the fertilized egg to that
progenitor cell has been exposed1. Under a simple model, this lineage has three phases: embryonic and fetal development; postnatal
1Cancer
Genome Project, Wellcome Trust Sanger Institute, Hinxton, UK.
Biology and Biophysics (T-6), Los Alamos National Laboratory,
Los Alamos, New Mexico, USA. 3Center for Nonlinear Studies, Los Alamos
National Laboratory, Los Alamos, New Mexico, USA. 4Medical Research
Council (MRC) Cancer Unit, Hutchison/MRC Research Centre, University of
Cambridge, Cambridge, UK. 5MRC Laboratory of Molecular Biology, Cambridge,
UK. 6Department of Haematology, University of Cambridge, Cambridge, UK.
7Department of Medical Genetics, Addenbrooke’s Hospital National Health
Service (NHS) Trust, Cambridge, UK. Correspondence should be addressed to
L.B.A. (lba@lanl.gov) or M.R.S. (mrs@sanger.ac.uk).
2Theoretical
Received 19 April; accepted 14 October; published online 9 November 2015;
doi:10.1038/ng.3441
1402
life in normally functioning differentiated cells; and post-neoplastic
transformation in cancer cells (Fig. 1).
In the time taken to establish this lineage, some mutational processes may have acted in an episodic manner, generating mutations in
bursts over short time periods. Others may have operated continuously, in a clock-like manner, generating mutations at a steady rate.
For such clock-like mutational processes, the number of mutations
acquired during embryonic and fetal development will be similar in
cancers of the same type from different individuals, as this phase is of
a fixed duration. Conversely, the same process operating in normally
functioning cells during postnatal life will result in a mutation load
that is proportional to the age of the person at the time the cancer is
sampled, with more mutations present in older individuals (Fig. 1).
The number of mutations acquired after initiation of neoplastic
change will be unrelated to age of diagnosis but will depend upon
the duration of the period between the first cancer driver mutation
and initiation of the final dominant clonal expansion and, potentially,
also upon changes to the mutation rate contingent upon acquiring
the neoplastic phenotype. The latter features may be highly variable
within and between cancer types.
Under this simple model, mutations with clock-like features in cancer genomes predominantly derive from the normal postnatal part
of the lineage. However, mutations from the developmental and/or
neoplastic phases could obscure the clock-like features of these mutations and affect estimation of the mutation rate during the normal
postnatal phase. To evaluate this possibility, we performed simulations
that showed that clock-like mutational processes can be detected and
that the mutation rates estimated are relatively unaffected by mutations from other phases, unless the mutations generated during the
developmental and/or neoplastic phases constitute the large majority
of the total number of mutations in the cancer. Therefore, analysis of
the several thousand cancer genomes thus far sequenced can provide a
first survey of the clock-like mutational processes operating in a wide
range of normal human cell types.
Different mutational processes generate distinct combinations of
mutation types in cancer genomes2,3. These characteristic imprints
of mutational processes have been termed ‘mutational signatures’.
We previously reported a mathematical approach and computational
framework to extract mutational signatures from catalogs of somatic
mutations in human cancers4–6. Using a 96-category classification
of base substitutions based on the type of substitution and the bases
immediately 5′ and 3′ to the mutated base, we identified 21 mutational
signatures operating over 30 cancer types4. Among these signatures,
the numbers of mutations associated with signature 1 correlated with
age of cancer diagnosis for some cancer types4.
VOLUME 47 | NUMBER 12 | DECEMBER 2015
NATURE GENETICS
A N A LY S I S
a
Time
Figure 1 A model for the accumulation of
somatic mutations in cancers. (a) Cell lineages
A
Individual B
are shown leading from fertilized egg to cancer
cancers C
cell, in five different individuals with cancer;
D
A, B, C, D and E. Orange, embryonic and
E
fetal cell divisions; blue, postnatal divisions
A
B
C
D
E
of normal cells; brown, cell divisions after
neoplastic change. (b) Accumulation of somatic
Mutation classes
mutations due to clock-like and non-clock-like
All signatures (all somatic mutations)
mutational signatures in the same five
Clock-like signature: total
patients. The correlation between age and the
Clock-like signature: postnatal normal
number of somatic mutations due to a clockNon-clock-like signatures
like mutational process operating in normal
Clock-like signature: neoplastic
postnatal cells is detectable using the mutations
Clock-like signature: embryonic and fetal
found in cancers, with the rate relatively
unaffected if the number of mutations acquired
during the embryonic and fetal phase and the
neoplastic phase is limited. Note that this figure
Biological age
is provided as a simple illustration of the activity
of clock-like mutational processes, and it is not intended to be a realistic representation of actual cancer samples. In reality, the numbers of cellular
divisions will be dependent on the tissue type and the numbers of neoplastic mutations may be many folds of magnitude higher.
npg
© 2015 Nature America, Inc. All rights reserved.
Observed somatic
mutations
b
We were able to perform validation for 29 of these 33 mutational
signatures using our established methodology for validating mutational signatures4. This new analysis confirmed the patterns of the
21 previously identified mutational signatures4, demonstrating the
robustness of the computational approach. Additionally, examining
this substantially larger data set allowed us to disentangle the patterns
of another eight distinct mutational signatures. A curated list of the
validated mutational signatures and the cancer types in which they are
present can be found at our Catalogue of Somatic Mutations in Cancer
(COSMIC) signatures website (see URLs). Note that signatures 25, 29
and 30 are not part of the analysis presented here because the relevant
samples were either cancer cell lines or lacked information about age
of diagnosis. Further, the list of mutational signatures on our website
does not include signatures corresponding to sequencing artifacts
and signatures for which validation has not been performed. We have,
however, included these mutational signatures in the current analysis,
and their patterns are shown in Supplementary Figure 1.
To identify mutational signatures showing clock-like behavior,
we first combined the mutations and samples from all cancer types.
Of the 33 signatures examined, signatures 1 and 5 showed a correlation between numbers of mutations and age of diagnosis, and,
for both signatures, the numbers of mutations increased with age
(signature 1, Spearman rank correlation = 0.34, false discovery rate
(FDR) corrected for all 33 signatures (q value) = 4.7 × 10−162; sigRESULTS
nature 5, Spearman rank correlation = 0.13, q value = 2.1 × 10−46;
Applying our refined approach to 10,250 cancer samples resulted combining the numbers of mutations attributed to signatures 1 and
in delineation of the patterns of 33 distinct mutational signatures. 5 resulted in Spearman rank correlation = 0.37 and P value = 8.2
× 10−254). No other mutational signature
C>A
C>G
C>T
T>A
T>C
T>G
exhibited a statistically significant correlation
Our previous analysis extracted mutational signatures separately
from each cancer type and then quantified the mutations contributed
by these signatures to each case of that cancer type. Many mutational
signatures are found in multiple different cancer types, and a central
scientific question to address is how the contributions of such signatures compare across cancer types. However, a particular mutational
signature found in multiple cancer types will be contaminated to differing extents by other signatures and by noise in each of the different
cancer types. Hence, our previous approach did not allow accurate
quantification of mutation rates for direct comparisons between cancer types. We have, therefore, reformulated the approach to derive a
single consensus version of each signature, and we used these consensus signatures to estimate the number of mutations contributed
to each cancer sample across all cancer types (Online Methods). Our
refined approach was applied to a larger data set of 7,329,860 somatic
mutations from 10,250 cancer genomes (Supplementary Data Sets 1
and 2) derived from diverse epithelial, mesenchymal, glial, hematopoietic and lymphoid cells that collectively constitute an extensive,
albeit incomplete sampling of normal cell types in the human body.
This analysis has then allowed us to estimate the contributions of
mutations to individual cancer cases across cancer types and hence
enabled comparison of the clock-like mutation rates that reflect mutations in normal tissues.
20
Signature 1
Mutation
type
probability
(%)
10
0
5
Signature 5
Mutation
type
probability
(%)
0
C>A
NATURE GENETICS
C>G
C>T
T>A
VOLUME 47 | NUMBER 12 | DECEMBER 2015
T>C
T>G
Figure 2 Patterns of mutational signatures 1
and 5. The signatures are displayed according
to the 96-substitution classification defined
by substitution class and sequence context
immediately 5′ and 3′ to the mutated base. The
probability bars for the six substitution classes
are displayed in different colors. Mutation types
are shown on the x axes, and the y axes show
the percentage of mutations in the signature
attributed to each mutation type. Signatures
are displayed on the basis of the trinucleotide
frequencies of the whole human genome.
1403
npg
© 2015 Nature America, Inc. All rights reserved.
A N A LY S I S
(q value < 0.05) between the number of Table 1 Rates of somatic substitution accumulation for clock-like mutational signatures
mutations and age of cancer diagnosis.
Signature 1
Signature 5
Number of
The total number of somatic mutations in Cancer type
samples
Slope
P value
Slope
P value
each sample (Fig. 1) also exhibited a corre- Acute lymphoid leukemia (ALL)
141
6.45
0.24
8.55
5.80 × 10−4
lation with age of diagnosis across all sam- Acute myeloid leukemia (AML)
202
0.80
0.77
2.89
0.02
ples (Spearman rank correlation = 0.37 and Adrenocortical carcinoma
92
2.56
0.89
3.94
0.78
P value = 3.1 × 10−215). However, after subtract- Bladder cancer
238
8.07
2.54 × 10−3
11.87
0.82
ing the numbers of mutations in signatures Brain, adult lower-grade glioma
465
10.02
1.00 × 10−5
12.70
7.00 × 10−5
1 and 5, which in aggregate only accounted Breast cancer
1,170
3.71
1.00 × 10−5
5.31
1.00 × 10−5
−3
for 23% of the total number of mutations, Cervical cancer
198
14.14
3.70 × 10
6.57
0.73
131
−1.45
0.50
5.52
0.07
no correlation was found (P value = 0.21), Chronic lymphocytic leukemia (CLL)
559
23.43
1.00 × 10−5
−3.97
0.80
indicating that the correlation for all muta- Colorectal cancer
Esophageal
cancer
329
19.66
9.00
× 10−4
−0.42
0.94
tions is predominantly explained by muta332
19.85
3.40 × 10−4
3.44
0.70
tions belonging to signatures 1 and 5. C>T Glioblastoma multiforme
591
10.20
5.60 × 10−4
2.20
0.97
mutations at NpCpG trinucleotides (often Head and neck cancer
65
3.18
0.27
5.16
0.03
termed CpG dinucleotides) constitute the Kidney, chromophobe
Kidney,
renal
clear
cell
carcinoma
468
0.26
0.93
22.75
1.00 × 10−5
major component of signature 1, and their
169
−0.29
0.88
31.86
8.00 × 10−5
numbers also showed correlation with age Kidney, renal papillary cell carcinoma
Liver
cancer
290
−1.93
0.44
7.81
0.02
(P value = 1.0 × 10−189) (Fig. 2). Subtracting
Lung, adenocarcinoma
795
6.30
0.03
0.00
1.00
the numbers of C>T mutations at NpCpG
Lung, small cell carcinoma
69
0.6
0.99
5.58
0.81
sites from the numbers attributed to signature
Lung, squamous cell carcinoma
176
6.00
0.03
8.22
0.91
1 left a residual correlation with age of cancer
Lymphoma, B cell
24
0.90
0.34
5.46
0.05
diagnosis (P value = 1.4 × 10−19), indicating Medulloblastoma
100
16.16
1.00 × 10−5
3.06
2.60 × 10−4
that, in addition to C>T mutations at NpCpG Melanoma
514
3.25
2.13 × 10−3
0.00
1.00
sites, other components of this signature also Multiple myeloma
69
3.11
1.94 × 10−3
0.17
0.93
behave in a clock-like manner.
Nasopharyngeal carcinoma
55
2.62
0.87
−4.44
0.71
Twenty-six of 36 cancer types individually Neuroblastoma
231
−0.23
0.89
25.80
1.00 × 10−5
showed correlations with age (P value < 0.05) Ovarian cancer
466
4.01
2.50 × 10−4
2.42
0.84
for signature 1 and/or signature 5 mutations Pancreatic cancer
231
14.73
0.04
7.47
0.69
(Fig. 2, Table 1 and Supplementary Fig. 2). Paraganglioma
179
1.85
0.08
2.49
0.06
Mutations associated with signature 1 were Pilocytic astrocytoma
101
0.65
0.01
1.05
0.10
520
5.62
0.41
8.31
0.02
correlated with age of diagnosis in 17 of Prostate cancer
472
23.73
2.30 × 10−4
6.04
0.56
the cancer types, and mutations associated Stomach cancer
Thyroid
cancer
404
0.66
0.33
6.39
1.00 × 10−5
with signature 5 were correlated with age of
26
−4.85
0.94
−15.75
0.75
diagnosis in 12 of the cancer types. In three Urothelial carcinoma
241
4.28
0.76
9.68
0.06
cancer types (breast cancer, low-grade glioma Uterine carcinoma
Uterine
carcinosarcoma
57
4.51
0.82
5.53
0.84
and glioblastoma), the mutational burdens of
Uveal
melanoma
80
1.97
0.55
2.26
0.77
both signatures correlated with age of cancer
diagnosis. Although some cancer types exhib- Somatic substitutions per gigabase per year for signatures 1 and 5 for all examined cancer types, including P values
and the number of samples examined in each cancer type. Rates of mutation accumulation and P values for all mutaited negative correlations, in all such cases the tional signatures in all cancer types are provided in Supplementary Data Set 3.
correlations were statistically not significant
(Table 1). As in the analysis of all samples, no
other mutational signature showed a correlation with age of diagnosis simply due to differences in the extent of CpG methylation, as methin any individual cancer type, although there was some correlation ylation levels at these dinucleotide are similar in most cell types3,8,
with the total number of mutations and the number of C>T mutations although it could be due to differences in rates of cytosine deaminaat NpCpG sites (Supplementary Data Sets 3–5).
tion and/or thymine excision at T•G mismatches by thymine DNA
We then compared the signature 1 and signature 5 mutation rates glycosylase or mismatch repair.
between different tissue types. Signature 1 mutation rates showed
It is notable, however, that many cancer types with high signasubstantial variation, being high in stomach cancer (23.7 mutations/ ture 1 mutation rates are derived from normal epithelia with high
Gb/year), colorectal cancer (23.4), glioblastoma multiforme (19.8), turnover, for example, stomach and colorectum (P value = 0.0033;
esophagus cancer (19.6), medulloblastoma (16.1) and pancreas cancer Supplementary Fig. 5, Supplementary Table 1 and Supplementary
(14.7) in comparison to ovary cancer (4.0 mutations/Gb/year), breast Data Set 6). Because DNA replication without previous repair will concancer (3.7), melanoma (3.2), myeloma (3.1) and pilocytic astrocytoma vert T•G mismatches arising from deamination of 5-methylcytosine
(0.65) (Fig. 3 and Supplementary Fig. 3). In breast, the rates were into C>T mutations, it is plausible that cell types with high mitotic
similar for estrogen receptor–positive (3.9 mutations/Gb/year) and rates exhibit higher mutation rates as a result of this mutational
estrogen receptor–negative (3.1) cancers (Supplementary Fig. 4).
process. If correct, this interpretation indicates that the signature 1
On the basis of similarities of mutational signature, the mutational mutation rate can serve as a clock registering the number of mitoses
process underlying signature 1 is likely to be deamination of 5-meth- a cell has experienced during the lineage of cell divisions from the
ylcytosine at CpG dinucleotides leading to T•G mismatches, which fertilized egg.
are not repaired before DNA replication7. It seems unlikely that the
The signature 5 mutation rate also showed substantial variation
observed variation in signature 1 mutation rate between cell types is between cancer types. It was high in papillary cell kidney cancer
1404
VOLUME 47 | NUMBER 12 | DECEMBER 2015
NATURE GENETICS
A N A LY S I S
Colorectal
cancer
Stomach
cancer
Signature 1
Mutations/Gb
4,000
Mutations/Gb
–4
P value = 1.00 × 10
–5
P value = 3.40 × 10
Esophagus
cancer
–4
P value = 9.00 × 10
–4
Medulloblastoma
Pancreas
cancer
Cervix
cancer
P value = 1.00 × 10
–5
P value = 0.04
P value = 3.70 × 10
P value = 2.60 × 10
–4
P value = 0.69
P value = 0.73
Head and neck
cancer
–3
–4
Glioma, low grade
–5
P value = 5.60 × 10
P value = 1.00 × 10
P value = 0.97
P value = 7.00 × 10
3,000
2,000
1,000
0
4,000
Signature 5
P value = 2.30 × 10
Glioblastoma
multiforme
P value = 0.56
P value = 0.80
P value = 0.70
P value = 0.94
–5
3,000
2,000
1,000
60
8
10 0
0
0
20
40
60
80
10
0
40
0
20
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
Signature 1
Mutations/Gb
Age of cancer diagnosis (years)
Bladder cancer
4,000
P value = 2.54 × 10
–3
ALL
P value = 0.24
Mutations/Gb
Prostate cancer
P value = 0.03
P value = 0.03
P value = 0.41
P value = 0.76
P value = 2.50 × 10
P value = 1.00
P value = 0.91
P value = 0.02
P value = 0.06
P value = 0.84
Uterine cancer
Ovarian
carcinoma
–4
Breast cancer
Melanoma
–5
P value = 2.13 × 10
–5
P value = 1.00
P value = 1.00 × 10
–3
2,000
1,000
P value = 0.82
P value = 5.80 × 10
–4
P value = 1.00 × 10
3,000
2,000
1,000
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
10
10
0
0
20
40
60
80
0
0
20
40
60
80
10
0
20
40
60
80
0
Age of cancer diagnosis (years)
Kidney,
chromophobe
Signature 1
Mutations/Gb
4,000
Signature 5
Myeloma
P value = 0.27
P value = 1.94 × 10
P value = 0.03
P value = 0.93
AML
–3
Thyroid cancer
P value = 0.77
P value = 0.33
P value = 0.02
P value = 1.00 × 10
Pilocytic
astrocytoma
Kidney, clear cell
P value = 0.01
P value = 0.93
P value = 0.10
P value = 1.00 × 10
Neuroblastoma
P value = 0.89
Kidney, papillary
Liver cancer
P value = 0.44
P value = 0.88
3,000
2,000
1,000
0
4,000
Mutations/Gb
–5
–5
P value = 1.00 × 10
–5
–5
P value = 8.00 × 10
P value = 0.02
3,000
2,000
1,000
0
npg
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
0
20
40
60
80
10
0
© 2015 Nature America, Inc. All rights reserved.
Lung squamous
cell carcinoma
3,000
0
4,000
Signature 5
Lung
adenocarcinoma
Age of cancer diagnosis (years)
Figure 3 Correlations between ages of cancer diagnosis and mutations attributed to signatures 1 and 5. The y axes show the numbers of somatic
substitutions per gigabase attributed to either signature 1 or signature 5, and the x axes show ages of cancer diagnosis. Each panel corresponds
to a cancer type, and panels are sorted in decreasing order by the estimated slope for signature 1. Each dot represents the median number of
somatic mutations for all cancers from individuals of a given age. Red and green lines show best estimates for the slopes (that is, somatic mutations
accumulated with time) of signatures 1 and 5, respectively; 95% confidence intervals for the slopes are shown in lighter green and lighter red shading.
Note that, for several cancer types, slopes extend far beyond the available data points; this representation is not intended to be a prediction, but rather
it is shown for consistent presentation across all panels in the figure. Slopes and P values are also provided in Table 1. Panels showing mutational
burdens in individual samples in each of the cancer types are provided in Supplementary Figure 3. Furthermore, the slopes for each cancer type are
depicted in Supplementary Figures 8–43. ALL, acute lymphoid leukemia; AML, acute myeloid leukemia.
(31.8 mutations/Gb/year), neuroblastoma (25.8) and clear cell kidney
cancer (22.7) in comparison to breast cancer (5.3 mutations/Gb/year),
chromophobe kidney cancer (5.1), medulloblastoma (3.0) and acute
myeloid leukemia (2.8).
The mutational process underlying signature 5 is not well understood. This signature primarily features C>T and T>C transitions.
Such mutations can be explained by replication of deaminated cytosine (uracil, which is read as thymine) and adenine (hypoxanthine,
which is read as guanine, resulting in A>G•T>C transition). However,
in addition, the T>C mutations identified exhibit transcriptional
strand bias, potentially indicating that some of these mutations arise
from adducts subject to transcription-coupled repair9. The signature
5 mutation rate is high in clear cell and papillary kidney cancers,
NATURE GENETICS
VOLUME 47 | NUMBER 12 | DECEMBER 2015
which are thought to originate from kidney proximal tubular epithelium, which absorbs metabolites, but low in chromophobe kidney
tumors, which may arise from cells of the cortical collecting duct10.
This observation raises the possibility that continuous exposure to a
ubiquitous metabolic mutagen, which is actively reabsorbed in the
kidney proximal tubule resulting in elevated exposure in these cells,
may underlie signature 5.
In some tumor types, a correlation with age of diagnosis was not
observed, despite substantial numbers of signature 5 mutations (for
example, head and neck cancer, colorectal carcinoma and lung squamous cell carcinoma; Fig. 3), and the absence of correlation is thus
unlikely to be due to limitations of statistical power. One possible
explanation is that the mutational process underlying signature 5 is
1405
A N A LY S I S
npg
© 2015 Nature America, Inc. All rights reserved.
substantially activated by other factors during life or as part of the
neoplastic phenotype in some tumor classes, thus obscuring the correlation between signature 5 mutations generated by the clock-like
process and age of diagnosis.
Across tumor types, signature 1 and 5 mutation rates do not closely
correlate with each other (Spearman rank correlation = −0.08 and
P value = 0.63). For example, in medulloblastoma, the signature 1 rate
is 16.1 mutations/Gb/year and the signature 5 rate is 3.0, whereas, in
papillary kidney cancer, the rates for signatures 1 and 5 are −0.3 and
31.9, respectively (Fig. 3 and Table 1). Thus, the biological determinants of the mutation rates of the two processes may be different,
and cell proliferation rate may not be a major factor for signature 5
in contrast to its influence on signature 1.
DISCUSSION
Peering through the ‘cracked lens’ of cancer genomes may obscure or
distort the estimates of clock-like mutation rates for the normal cells
that are progenitors of the cancers. The data originate from dozens
of laboratories, multiple sequencing platforms and many mutationcalling algorithms. They include subclonal mutations, which occur
after the last dominant clonal expansion, to different extents and are
from samples with varying amounts of normal tissue contamination.
The numbers of signature 1 and 5 mutations have been estimated from
mutational catalogs to which multiple other mutational processes
have often contributed and may still be affected by the presence of
these processes, despite extraction by our method. The simple, pragmatic classification of cancer types used is likely, in many instances,
to hide greater complexity of biological subclass, and each subclass
could derive from a distinct type of non-neoplastic cell characterized by different signature 1 and 5 mutation rates. The mutation rate
estimates are based on age of cancer diagnosis as a surrogate for the
age at which the driver mutation initiated the last clonal expansion,
and several years may intervene between these two points in time
(and the length of this interval may differ between tumor types). As
shown in the simulations, if substantial numbers of signature 1 and 5
mutations occur after neoplastic transformation, they could obscure
clock-like processes and affect the estimated mutation rates. Finally,
the profiles of signatures 1 and 5 may be further refined in future, and
this may also affect estimates of mutation rate.
Signatures 1 and 5 demonstrate variability in the numbers of mutations per megabase, even for samples of the same cancer type and
age of diagnosis (Supplementary Data Set 2). Although some of this
variability may be attributable to limitations of the data and analysis
described above, it is also plausible that some reflects biological variation. For example, the rates of the clock-like mutational processes may
vary between individuals depending on environment or lifestyle and
inherited predisposition, and the ancestor cells of some tumors may
acquire mutator phenotypes for signatures 1 or 5 decades before the last
clonal expansion. Future studies will be needed to evaluate the effects
of these factors on the rates of clock-like mutational processes.
Remarkably, despite these multiple muddying influences, the clocklike nature of signatures 1 and/or 5 is visible in most cancer types. The
proposition that these clock-like mutations derive from normal cells
is supported by the observation that the profile of signatures 1 and 5
combined is very similar to the somatic mutational patterns observed
in the small set of non-neoplastic human somatic cells sequenced thus
far11. Moreover, the combination of signatures 1 and 5 also recapitulates the pattern of de novo mutations found in the human germ line
(data from refs. 12–14), and this de novo germline pattern cannot be
parsimoniously generated by other combinations of known mutational signatures (Supplementary Fig. 6).
1406
The results therefore provide the first survey of clock-like somatic
mutational processes over a broad range of normal human cell types
and quantification of the mutation rates exhibited by these mutational
processes. They indicate that there are two clock-like mutational signatures, that both signature 1 and signature 5 mutation rates differ
widely between cell types, that the biological factors that determine
these rates are different for signatures 1 and 5, that cell proliferation
rate may be one of the dominant factors influencing the mutation rate
of signature 1 and that signature 5 may be activated by non-clocklike influences and/or as part of the neoplastic process. Despite the
ubiquity of both signatures in normal somatic cells and their likely
presence in the germ line, generating the sequence variation underlying human phenotypes in health and disease, we have hardly any
understanding of the biological processes underlying at least one
of them, signature 5. The signature 1 and 5 mutation rates will be
refined over the next several years by the direct deployment of largescale sequencing of single normal cells and will provide the basis
for future exploration of the range of mutational processes and their
rates in human cells affected by mutagenic exposures, in precancerous
neoplastic cells and in cells involved in disease processes other than
cancer in which mutation rates may be affected.
URLs. COSMIC Signatures of Mutational Processes in Human Cancer
website, http://cancer.sanger.ac.uk/cosmic/signatures; MathWorks
Mutational Signature Framework, http://www.mathworks.com/
matlabcentral/fileexchange/38724.
METHODS
Methods and any associated references are available in the online
version of the paper.
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper.
ACKNOWLEDGMENTS
We would like to thank M.E. Hurles and R. Durbin for early discussions about the
analyses performed. We would like to thank The Cancer Genome Atlas (TCGA),
the International Cancer Genome Consortium (ICGC) and the authors of all
previous studies cited in Supplementary Data Set 1 for providing free access to
their somatic mutational data. This work was supported by the Wellcome Trust
(grant 098051). S.N.-Z. is a Wellcome-Beit Prize Fellow and is supported through
a Wellcome Trust Intermediate Fellowship (grant WT100183MA). P.J.C. is
personally funded through a Wellcome Trust Senior Clinical Research Fellowship
(grant WT088340MA). J.E.S. is supported by an MRC grant to the Laboratory of
Molecular Biology (MC_U105178808). L.B.A. is supported through a J. Robert
Oppenheimer Fellowship at Los Alamos National Laboratory. P.H.J. is supported by
the Wellcome Trust, an MRC Grant-in-Aid and Cancer Research UK (programme
grant C609/A17257). This research used resources provided by the Los Alamos
National Laboratory Institutional Computing Program, which is supported by
the US Department of Energy National Nuclear Security Administration under
contract DE-AC52-06NA25396. Research performed at Los Alamos National
Laboratory was carried out under the auspices of the National Nuclear Security
Administration of the US Department of Energy.
AUTHOR CONTRIBUTIONS
L.B.A. and M.R.S. conceived the overall approach and wrote the manuscript.
L.B.A., P.H.J., S.N.-Z. and M.R.S. carried out signature and/or statistical analyses
with assistance from D.C.W., J.E.S. and P.J.C.
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details are available in the online
version of the paper.
Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
1. Stratton, M.R., Campbell, P.J. & Futreal, P.A. The cancer genome. Nature 458,
719–724 (2009).
VOLUME 47 | NUMBER 12 | DECEMBER 2015
NATURE GENETICS
A N A LY S I S
8. Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol.
14, R115 (2013).
9. Fousteri, M. & Mullenders, L.H. Transcription-coupled nucleotide excision repair in
mammalian cells: molecular mechanisms and biological effects. Cell Res. 18,
73–84 (2008).
10. Davis, C.F. et al. The somatic genomic landscape of chromophobe renal cell
carcinoma. Cancer Cell 26, 319–330 (2014).
11. Welch, J.S. et al. The origin and evolution of mutations in acute myeloid leukemia.
Cell 150, 264–278 (2012).
12. Kong, A. et al. Rate of de novo mutations and the importance of father’s age to
disease risk. Nature 488, 471–475 (2012).
13. Michaelson, J.J. et al. Whole-genome sequencing in autism identifies hot spots for
de novo germline mutation. Cell 151, 1431–1442 (2012).
14. Conrad, D.F. et al. Variation in genome-wide mutation rates within and between
human families. Nat. Genet. 43, 712–714 (2011).
npg
© 2015 Nature America, Inc. All rights reserved.
2. Alexandrov, L.B. & Stratton, M.R. Mutational signatures: the patterns of
somatic mutations hidden in cancer genomes. Curr. Opin. Genet. Dev. 24, 52–60
(2014).
3. Helleday, T., Eshtad, S. & Nik-Zainal, S. Mechanisms underlying
mutational signatures in human cancers. Nat. Rev. Genet. 15, 585–598
(2014).
4. Alexandrov, L.B. et al. Signatures of mutational processes in human cancer. Nature
500, 415–421 (2013).
5. Nik-Zainal, S. et al. Mutational processes molding the genomes of 21 breast
cancers. Cell 149, 979–993 (2012).
6. Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Campbell, P.J. & Stratton, M.R.
Deciphering signatures of mutational processes operative in human cancer.
Cell Rep. 3, 246–259 (2013).
7. Bell, S.P. & Dutta, A. DNA replication in eukaryotic cells. Annu. Rev. Biochem. 71,
333–374 (2002).
NATURE GENETICS
VOLUME 47 | NUMBER 12 | DECEMBER 2015
1407
ONLINE METHODS
npg
© 2015 Nature America, Inc. All rights reserved.
Curation of freely available cancer samples. No data were generated for
this study. Rather, data curation was performed to annotate freely available
cancer genomes. Somatic mutations from 10,250 genome pairs (consisting of
a cancer genome and the genome of a matched normal tissue) were curated.
Of the 10,250 matched normal pairs, 607 had their whole genome sequenced,
whereas 9,643 underwent whole-exome sequencing. Data were retrieved
from three sources: (i) the data portal of The Cancer Genome Atlas (TCGA),
(ii) the data portal of the International Cancer Genome Consortium (ICGC)
and (iii) previously published data. Information for each sample is provided in
Supplementary Data Set 1. The somatic mutations for all samples are freely
available and can be retrieved on the basis of the information provided in
Supplementary Data Set 1.
Filtering mutations, generating mutational catalogs and displaying signatures. This study relies on previously sequenced cancer genomes and on the
subsequently used bioinformatics to identify cancer-specific somatic mutations. The data were filtered before analysis as previously described in ref. 4,
and mutational catalogs were generated using Ensembl Core APIs for human
genome build GRCh37.
The prevalence of somatic mutations in each sample was estimated on
the basis of a haploid human genome after filtering as previously described
in ref. 4.
Mutational signatures are displayed according to the observed trinucleotide
frequency of the human genome.
Refined approach for deciphering mutational signatures. The mutational
catalogs of all samples were examined following two steps. Initially, de novo
extraction was performed to derive the set of novel consensus mutational
signatures. Briefly, mutational signatures were deciphered independently
for each of the 36 cancer types using our MATLAB framework4. The
computational framework for deciphering mutational signatures is freely
available for download from MathWorks. The algorithm deciphers the
minimal set of mutational signatures that optimally explains the proportion of each mutation type found in each catalog and then estimates the
contribution of each signature to each mutational catalog. Mutational
signatures were extracted separately for genomes and exomes. Mutational
signatures extracted from exomes were normalized to the trinucleotide
frequency of the human genome. All mutational signatures were clustered using unsupervised agglomerative hierarchical clustering, and a
threshold was selected to identify the set of consensus mutational signatures. Misclustering of signatures was avoided as previously described in
ref. 4. Overall, we identified 33 consensus mutational signatures. Signatures
1 through 28 (note that signature 25 is not found in this data set) were
validated, and these processes thus most likely reflect biological processes.
Signatures R1 through R3 were previously found in ref. 4 and attributed to
sequencing artifacts. We were not able to perform validation for signatures
U2 through U4 because we did not have access to the respective biological
samples or BAM files.
The de novo extraction was used to identify the set of consensus mutational
signatures across the 10,250 samples examined. This first step of extracting
mutational signatures and generating consensus patterns follows the previously introduced approach in ref. 4. However, our previous methodology did
not use consensus mutational signatures to evaluate their contributions in
each sample, thus not allowing accurate comparison of mutation rates between
different cancer types. To address this limitation, we refined our approach by
introducing another step of analysis, which is focused on accurately estimating the numbers of mutations associated with each consensus signature in
each sample. We usually refer to this number of somatic mutations as either
the ‘contribution’ of a mutational signature or the ‘exposure’ to a mutational
signature. Calculating the contributions of all mutational signatures was performed by estimating the number of mutations associated with the consensus
patterns of the signatures of all operative mutational processes in each cancer sample. This approach allows direct comparison between cancer types
because identical signatures were used to estimate the contributions in each
NATURE GENETICS
cancer type. More specifically, all consensus signatures were examined as a set
P containing 33 vectors
p1 p1 p1 p1
1 2 32 33
P = , … ,
96 96 96 96
p1 p2 p32 p33
where each of the vectors is a discrete probability density function reflecting a consensus mutational signature. The 96 non-negative components of
each vector correspond to mutation types (substitutions and their immediate sequence context) of the signatures. The contributions of the signatures
were estimated independently for each of the 10,250 samples with a subset of
consensus mutational signatures. For each sample, the estimation algorithm
consists of finding the minimum of the Frobenius norm of a constrained linear
function (see below for constraints) for a set of vectors S1…q, q ≤ 33 belonging
to the subset Q, where Q # P:
q
min M − ∑ (Si × Ei )
i =1
F
(1)
2
Q is determined on the basis of the known operative mutational processes in
the cancer type of the examined sample from the signature extraction process
described above. For example, for any neuroblastoma sample, Q will contain
signatures 1, 5 and 18 because these are the only known signatures of mutational processes
operative
in neuroblastoma (Supplementary Data Set 2). In
equation (1), Si and M represent vectors with 96 non-negative components
reflecting, respectively, a consensus
mutational
signature and the mutational
96
catalog of a sample. Hence, Si ∈ℜ96
+ and M ∈ N 0 . Further, both
vectors have
known numerical values either from the de novo extraction
( Si ) or from gen
erating the original mutational catalog of the sample ( M ). In contrast, Ei corresponds to anunknown scalar reflecting the
number of mutations contributed
by signature Si in the mutational catalog M .
Minimization of equation (1) is performed under several biologically meaningful constraints. The set of vectors in the examined set Q is constrained on
the basis of previously identified biological features of the consensus mutational signatures. For example, consensus signature 6 causes high levels of
indels at mono- and polynucleotide repeats4. Thus, this mutational signature
will be excluded from the set Q when the mutational catalog of an examined
sample has only a few such indels. Similarly, there are signatures associated
with other types of indels, transcriptional strand bias, dinucleotide mutations, hypermutator phenotypes, etc., and these signatures are included in
the set Q only when the sample in question exhibits one or more of these
features. Lists of features associated with mutational signatures can be found
in ref. 4. In addition to sample-specific constraints to the set Q, equation
(1) was universally constrained with regard to the parameter Ei. These
constraints can be mathematically expressed as 0 ≤ Ei ≤ Si , i = 1...q and
q
1
∑ Ei = Si 1 . All results for the contributions of all operative signatures
i =1
in all samples from the hitherto described approach are provided in
Supplementary Data Set 2, and the original somatic mutations can be found
using Supplementary Data Set 1.
Factors influencing signature extraction. We have previously simulated
data to describe a plethora of factors influencing the accuracy of mutational
signatures extraction6. Such factors include the number of available samples, the number of somatic mutations in a sample, the number of mutations
contributed by different mutational signatures and the similarity between
the patterns of the signatures of mutational processes operative in cancer
samples, as well as the computational limitations of our framework.
Nevertheless, in the past 3 years, our framework has proven robust and has
described multiple similar and validated signatures across the spectrum of
human cancer3–5,15–22.
doi:10.1038/ng.3441
npg
© 2015 Nature America, Inc. All rights reserved.
Patterns of signatures 1 and 5. In a previous analysis4, we extracted 21 mutational signatures and noted that signatures 1A and 1B correlate with age of
diagnosis for some cancer types. Further, we noted that, because signatures
1A and 1B “are almost mutually exclusive among tumour types they probably
represent the same underlying process, with signature 1B representing less
efficient separation from other signatures in some cancer types” (ref. 4). In
our previous report, we referred to these two signatures as a single signature
termed signature 1A/B. In the current analysis, encompassing ~50% more data
and a refined algorithm, signature 1A is found in more cancer types, including
some in which signature 1B was seen previously. A detailed examination of the
pattern of signature 1B showed that this mutational signature is a linear combination of signatures 1A and 5. More specifically, a combination of signatures
1A and 5 can be used to account for 0.97 of the pattern of signature 1B (where
1.00 is perfect correlation), and no other combination of signature can be used
to explain signature 1B. Thus, in the current manuscript, signature 1B is no
longer referred to, and, in cancer types from which it was extracted, signatures
1A and 5 have been reintroduced to assess mutation contributions.
Robustness and reproducibility of mutational signatures. In this analysis, we
use an elaborated version of our framework for extracting mutational signatures and apply it to a much larger data set. Comparison between the set of previously extracted mutational signatures4 and the set of mutational signatures
found here shows both stability and reproducibility of mutational signatures.
This reproducibility can be observed by comparing the mutational signatures
on the COSMIC signatures website with the ones from ref. 4. Further, the
similarity can also be quantified using a cosine similarity as previously done in
ref. 6. The cosine similarity between any combination of signatures that were
derived in ref. 4 and also found in this analysis (that is, signatures 1 through
21) is more than 0.97, where a similarity of 1.00 is an exact match.
Statistical analysis of relationships between age and mutations. Global
analysis was performed for the 33 mutational signatures identified across all
samples in all cancer types. Zero mutations were attributed to all signatures
that were not found in a sample. The data heteroscedasticity and presence of
outliers mandates the use of an appropriate statistical approach23. We leveraged robust linear regression to evaluate linear dependencies between the
numbers of mutations associated with each mutational signature across all
samples examined and the ages of cancer diagnosis of these samples. The
calculated P values from the applied robust regression were corrected for
multiple-hypothesis testing using the Benjamini-Hochberg procedure. Only
signatures 1 and 5 exhibited statistically significant correlation (q value < 0.05)
with age of cancer diagnosis.
Each cancer type was examined independently for a linear dependence
between the ages of cancer diagnosis for the curated samples in that cancer
type and the numbers of mutations attributed to each of the signatures of the
operative mutational processes in that cancer type. Because most traditional
or generalized linear regression approaches are very sensitive to outliers23 and
because many of the cancer samples examined were hypermutators (outliers),
we leveraged a robust regression model. The robust regression iteratively
reweights least squares with a bi-square weighting function and overcomes
some (if not most) of the limitations of traditional approaches24–26. Similarly,
we report results using Spearman’s rank correlation coefficient because it is
more robust to data outliers when compared to Pearson’s product-moment
correlation coefficient27. It should be noted that, although samples with missing information about their age of cancer diagnosis were excluded from this
analysis, these samples were used in the de novo extraction of mutational
signatures and subsequent estimation of the signatures’ contributions.
Each signature was examined separately in each cancer type in which that
signature was identified. Examination was based on a robust linear regression
model that estimates the slope of the line and whether this slope is significantly
different from a horizontal line with a slope of zero (F test, P value <0.05) as
well as by calculating the Spearman’s rank correlation coefficient. Although
robust linear regression models provide confidence intervals and P values,
we decided to take a more conservative approach and report the results after
bootstrapping the data. Bootstrapping (random sampling with replacement)
was used to derive measures of accuracies: the best fit for the slope and the
slope’s 95% confidence interval. In total, we performed 100,000 bootstrapping
doi:10.1038/ng.3441
iterations per signature per cancer type (total of ~2 × 107 iterations). Each of
the iterations for which the robust regression returned a P value <0.05 was
considered statistically significant, whereas iterations with P value ≥0.05 were
considered not statistically significant. The overall P value per signature per
cancer type was calculated as follows:
Number of non-significant iterations + 1
100, 000 + 1
It should be noted that the number of iterations limits the minimum possible P value, in this case, 9.99 × 10−6, and P values reported to be equal
to 9.99 × 10−6 are most likely lower. The reported P values and confidence
intervals are the ones after applying the bootstrapping procedure. MATLAB
code for calculating the P values across individual cancer types is provided as
Supplementary Software.
The results of estimating the line’s slope by robust regression and the
Spearman’s rank correlation coefficient for each of the signatures in each
of the cancer types can be found in Supplementary Data Set 3. As before,
we have used P value <0.05 to identify statistically significant dependencies.
However, this cutoff is arbitrary, and summarized results using different cutoff
values are shown in Supplementary Figure 2. Examination of individual cancer types is based on the hypothesis that signatures 1 and 5 are the only signatures reflecting the activity of clock-like mutational processes. This hypothesis
was constructed by examining the activity of all signatures across all cancer
types (signature 1, FDR-corrected for all 33 signatures q value = 4.7 × 10−162;
signature 5, FDR-corrected for all 33 signatures q value = 2.1 × 10−46). Thus, for
our analysis of individual cancer types, the P values reported in the main manuscript have not been corrected for multiple-hypothesis testing. Nevertheless,
for consistency, we have provided P values corrected for multiple-hypothesis
testing using the Benjamini-Hochberg procedure for each cancer type in
Supplementary Data Set 3. It should be noted that using FDR-corrected
P values to evaluate the significance of the analysis does not affect the overall
message of the manuscript.
Lastly, we also evaluated (following the hitherto described approach)
whether there is a linear dependency between the total numbers of somatic
mutations and/or the C>T mutations at CpG dinucleotides and the ages of cancer diagnosis. Similarly, this examination was done separately for each of the
cancer types, and the results from the analysis can be found in Supplementary
Data Set 5.
Evaluating the robustness and limitations of the analysis performed with
simulated data. A myriad of known and unknown processes may be affecting
the analyses performed. Some of these include data generation by different
institutes and laboratories, contamination of subclonal mutations, endogenous
or exogenous factors affecting the rates of signatures 1 and 5, inaccuracies of
the patterns of signatures 1 and 5, mutations generated during the developmental and/or neoplastic phases, limitations of the signature extraction algorithm, small numbers of samples and/or somatic mutations, misannotation
of samples, etc. In principle, quantifying the overall error introduced by even
a subset of these processes is impractical.
To evaluate the robustness and limitations of our analysis, we simulated data
with two types of mixture noise: (i) noise affecting the bona fide somatic mutations associated with a clock-like signature of a mutational process operative
in a sample and (ii) noise affecting the age of cancer diagnosis of a sample. It
was assumed that the mixture of all factors affecting the bona fide number of
mutations associated with a clock-like signature in a sample reflects a mixture of random processes and, it thus can be approximated by white additive
Gaussian noise. Further, folded normal Gaussian noise (half-normal distribution) with a mean value of 2 years and s.d. of 4 years was added to the age
of cancer diagnosis of a sample. This noise reflects average cancer detection
within 2 years of neoplastic initiation with cancers detected in 84% and 98% of
patients within 6 and 10 years, respectively. The distribution is half-bounded,
as a cancer cannot be detected before it has occurred.
Clock-like mutational signatures were simulated in 100 cancers. The ages
of diagnosis of the cancers were sampled with replacement from the data in
Supplementary Data Set 1, and the mutational rates per year per gigabase
(the slope) were taken from a uniform distribution between the minimum
NATURE GENETICS
Displaying age of diagnosis and clock-like mutational signatures. Linear
relationships between the ages of cancer diagnosis and the mutations attributed to mutational signatures are displayed only for signatures 1 and 5, as
no other mutational signature displayed statistically significant correlations
(Supplementary Figs. 8–43 and Supplementary Data Set 3). These linear
relationships are displayed both for the average mutational burden attributed
to a signature (Fig. 3) and for all individual mutational burdens attributed to
a signature (Supplementary Fig. 3). In both cases, the displayed slopes and
their confidence intervals are those derived by the hitherto described analysis
and do not depend on the choice of depiction. For brevity, in Figure 3, linear relationships are displayed for only 27 of the 36 cancer types analyzed.
Nevertheless, the data provided in Supplementary Data Sets 2 and 3 can be
used to display all linear relationships.
15. Behjati, S. et al. Genome sequencing of normal cells reveals developmental lineages
and mutational processes. Nature 513, 422–425 (2014).
16. Bolli, N. et al. Heterogeneity of genomic evolution and mutational profiles in
multiple myeloma. Nat. Commun. 5, 2997 (2014).
17. Ju, Y.S. et al. Origins and functional consequences of somatic mitochondrial DNA
mutations in human cancer. eLife 3 (2014).
18. Murchison, E.P. et al. Transmissible [corrected] dog cancer genome reveals the
origin and history of an ancient cell lineage. Science 343, 437–440 (2014).
19. Nik-Zainal, S. et al. Association of a germline copy number polymorphism of
APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations
in breast cancer. Nat. Genet. 46, 487–491 (2014).
20. Gerlinger, M. et al. Genomic architecture and evolution of clear cell renal cell
carcinomas defined by multiregion sequencing. Nat. Genet. 46, 225–233
(2014).
21. Yates, L.R. et al. Subclonal diversification of primary breast cancer revealed by
multiregion sequencing. Nat. Med. 21, 751–759 (2015).
22. Wagener, R. et al. Analysis of mutational signatures in exomes from B-cell lymphoma
cell lines suggest APOBEC3 family members to be involved in the pathogenesis of
primary effusion lymphoma. Leukemia 29, 1612–1615 (2015).
23. Barnett, V. & Lewis, T. Outliers in Statistical Data (Wiley, 1994).
24. Holland, P.W. & Welsch, R.E. Robust regression using iteratively reweighted leastsquares. Comm. Stat. Theory Methods A6, 813–827 (1977).
25. Huber, P.J. & Ronchetti, E. Robust Statistics (Wiley, 2009).
26. Street, J., Carroll, R. & Ruppert, D. A note on computing robust regression estimates
via iteratively reweighted least squares. Am. Stat. 42, 152–154 (1988).
27. Abdullah, M.B. On a robust correlation coefficient. Statistician 39, 455–460
(1990).
npg
© 2015 Nature America, Inc. All rights reserved.
and maximum statistically significant rates detected by the analysis performed (Supplementary Data Set 3). In total, 17 simulation scenarios were
performed, each with different percentages of white additive Gaussian noise
(Supplementary Fig. 7). The noise to bona fide somatic mutations was varied
between 0% and 200%, where 0% reflects no noise and 200% corresponds to
twice as much noise as compared to genuine somatic mutations. Note that
most cancer genomics papers report sensitivity and specificity rates of more
than 90%, and thus the false positive rates derived from our simulations are
probably overly pessimistic. In all scenarios, the noise added to the ages of
diagnosis followed the hitherto described folded normal distribution. Each
simulation scenario was repeated 1,000 times, and the simulated data were
analyzed to identify clock-like mutational signatures in exactly the same way
as the experimental data used in this study. Any iteration with a statistically
significant P value for a slope (P value <0.05) in which the simulated slope
was within ±10% of the derived slope or within the 95% confidence intervals of
the derived slope was considered a genuine detection and, thus, a true positive
result. In contrast, any other iteration with a statistically significant P value
for a slope (P value <0.05) that did not satisfy the abovementioned conditions
was considered a false positive result.
The results from the 17 scenarios performed showed that, when noise levels
are less than 35%, our analysis is able to find the genuine slopes in ~90% of the
iterations while yielding no more than 0.55% false positives (Supplementary
Fig. 7). Increasing the noise levels does not increase the number of false positives but rather reduces the number of genuinely detected slopes (true positives). Our simulations indicate that the confidence intervals of the majority of
detected slopes include the genuine slope of a clock-like mutational signature,
whereas the approach used yields few false positives.
NATURE GENETICS
doi:10.1038/ng.3441