Molecular Genetics and Metabolism 2019
Multigene signatures of responses to chemotherapy derived by biochemically inspired
machine learning
Peter K. Rogan, Ph.D.
Departments of Biochemistry, Oncology, and Computer Science
University of Western Ontario
London, ON N6A 2C1
(519) 661-4255
progan@uwo.ca
Category: Mini-review
1
Abstract
Pharmacogenomic responses to chemotherapy drugs can be modeled by supervised machine
learning of expression and copy number of relevant gene combinations. Such biochemical
evidence can form the basis of derive gene signatures using cell line data, which can subsequently
be examined in patients that have been treated with the same drugs. These gene signatures typically
contain elements of multiple biochemical pathways which together comprise multiple origins of
drug resistance or sensitivity. The signatures can capture variation in these responses to the same
drug among different patients.
Keywords
Chemotherapy Response, Machine Learning, Gene Signatures, Breast Carcinoma, Bladder
Carcinoma, Cancer Cell Lines, Patient Validation, Gene Expression, Copy Number
2
Background
Current pharmacogenetic analysis relates genotypes of various individual genes to their impact
on transport, biotransformation, or disposition of drugs in patients. However, in cancer
chemotherapy, unrelated gene products significantly contribute to the overall cellular and clinical
responses by impacting elements of other biochemical pathways that respond to these agents in
tumours (1). The effects of multiple genes, termed gene signatures, have been used to predict
chemotherapy response in cell lines using differential gene expression (DGE) as well as machine
learning techniques (ML; 2,3). These involve measurements of combinations of expression (GE)
and/or DNA copy number (CN) levels as surrogates for cancer cell growth (4-7; Fig. 1).
Machines learn to classify by means of loss functions. This method evaluates how well a specific
algorithm models the given data. If predictions deviate too much from the actual results in the
training data, the loss function generates a large number. Hinge loss is one type of loss function
that maximizes the impact or weight of training data distant from the threshold that distinguishes
the drug sensitive from resistant classes of patients or cell lines using support vector
machines (SVMs). When the same type of loss among many ML models (each consisting of a
different combination and weights of genes), a lower loss indicates a better predictive model.
DGE gene signatures for drug response have been based on on the average differences in gene
co-expression between sensitive and resistant tumor tissues among a set of patients (9). These
signatures have traditionally selected genes based on the largest overall changes in expression
levels among 2 (or more) groups of patients (for example, complete pathological remission vs
recurrent disease). For a selected gene, the variance among members of the same group can be
large, resulting in overlap in the expression levels between the groups. While DGE maximizes
differences between the mean expression levels of the groups, there is often considerable overlap
in overall expression over the quartiles adjacent the mean, resulting in only a subset of
individuals exhibiting significant differences between classes. By contrast, highly weighted
genes in the best performing ML models can exhibit a lower degree of overlapping expression
between sensitive and resistant categories, due to lower coefficients of variation among these
classes. Another distinction between DGE and ML approaches is that in DGE, while individual
expression values share information with chemotherapy response, many of these genes may
3
reveal similar information, and are often redundant in the signatures themselves. The process of
selecting genes, ie. features, for ML-based models attempts to minimize redundant features. This
reduces a source of noise in the data and mitigates against overfitting of the resultant signature to
a particular dataset of cell lines or tumors.
Direct comparisons of DGE and the ML models, can be challenging because the impact of different
gene combinations is not additive in non-linear models. Correlation of statistical test results with
weights of ML model features may impacted by the order of gene removal when determining
misclassification accuracy corresponding to the weights of individual genes. Equally weighted
genes in the DGE signatures can cause genes with significantly overlapping expression levels to
dilute the contributions of those with discrete distributions, affecting the overall performance of
these signatures in discriminating between the classes of drug response.
The effectiveness of adjuvant chemotherapy agents has been related to changes in the profile of
tumor gene expression (10-15). Genomic signatures for chemotherapy (CT) response using
supervised ML for breast cancer (BRCA) have been developed by discriminating sensitive from
resistant cancer cell lines at the median concentration of drug that inhibits growth of these lines by
half (GI50; 3, 16-19). GI50 is an excellent measure of drug effectiveness and proxy for clinical
outcomes, since it is a holistic approach that incorporates effects on direct targets as well as
comprehensive cellular response, including other biochemical pathways (16). This output is what
is effectively being modeled based on GE and/or CN using biochemically-inspired ML
approaches.
Genes with biological relevance to drug response were identified based on evidence in published
literature and public cancer drug databases. GE and/or CN values were then filtered using available
BRCA and bladder cancer (BLC) cell line microarray or next generation sequencing data by
Multiple Factor Analysis (MFA), a statistical method, similar to principal component analysis, that
quantifies relationships between GI50 and other variables. The genes eligible to be selected as ML
features have GE and/or CN values that either correlate directly or inversely with GI50. One
optimal set of gene features classifies cancer cell lines as either resistant or sensitive to drugs using
their respective GE and/or CN values (4). In general, the median GI50 theshold of cell lines has the
4
highest positive predictive value for distinguishing between these classes (5). ML models select
the GE and CN features and weights with the lowest possible classification error of drug resistance
and sensitivity (20, 21) by backwards feature selection. ML approaches for predicting drug
response have included: (a) Support Vector Machines (SVM; 22), which selects genes that best
discriminate classes along a multidimensional hyperplane consisting of gene features, (b) Random
Forest classifier (RF; 23), a type of decision tree that votes for the most frequently selected subsets
of genes (5, 15) that separate the classes, and (c) a Minimum Redundancy Maximum Relevance
(mRMR; 24) wrapper that incrementally adds features that maximize the mutual information
between gene features and classes, while keeping the redundancy between gene features at a
minimum level (6). CT signatures have also been derived by Simulated Annealing (SA; 25), which
minimizes errors in ML models by iteratively selecting gene combinations that incrementally
improve classification accuracy. The performance of derived signatures are also compared with
other published gene signatures as well as models derived from randomly selected genes - to assess
significance of the null hypothesis (4).
Application of cell line-based gene signatures to CT patients.
ML signatures (2) were used for prognosis of BRCA tumor response to paclitaxel and
gemcitabine, which was more accurate than previous hierarchical clustering of mean differences
in expression (26). Gene signatures were derived by either by backward, forward and complete
feature selection with SVMs (5), then tested by cross-validation with cell lines excluded from the
original signatures. Cell line-derived signatures were validated for paclitaxel, tamoxifen,
methotrexate, 5-fluorouracil, epirubicin, and doxorubicin with a 2000 BRCA patient dataset (3,
27). These SVM and RF signatures were also assessed for their ability to provide prognosis of
response, using thresholds based on the duration of progression free survival (PFS) of patients.
This approach retrospectively identified gene signatures that may have been useful in guiding
selection of specific CT agents for treatment (3). These models exhibited higher accuracy than
traditional signatures based upon DGE. ML signatures for particular chemotherapeutic drugs
performed best on patient datasets where the patients received that specific chemotherapeutic
drug. However, cross-application of ML models for individual drugs in patients with other types
of cancer were found to be less accurate for prognosis (4), as well as in patients receiving drug
combinations (5).
5
The performance of other ML approaches were evaluated for the MFA-derived gene set at different
PFS or time to tumor progression (TTP) thresholds. mRMR feature selection of a paclitaxel-based
SVM classifier based on expression of ABCB11, ABCC1, BAD, BBC3 and BCL2L1 was 79%
accurate in 53 CT patients. By contrast, an RF classifier produced a gene signature (ABCB11,
ABCC1, BAD, BCL2, CYP2C8, CYP3A4, MAP4, MAPT, NR1I2, TUBB1, GBP1, OPRK1) that
gave prognoses of >3 year survival with 82.4% accuracy in 420 hormone therapy (HT) patients.
A similar RF gene signature showed 79.6% accuracy in 504 patients treated with CT and/or HT.
These results suggest that tumor gene expression signatures refined by ML techniques can be
useful for prognosis of PFS and TTP after drug therapies.
Example: Gene signature of cisplatin response.
Cisplatin covalently cross-links adjacent purines in the genome, which elicits increased DNA
damage (28) and anti-oxidant scavenging responses at the highest levels of resistance. There are
differences in expression of G1/S DNA Damage Checkpoint, Base Excision Repair, and
Nucleotide Excision Repair genes that distinguish sensitive and resistant BLC cell lines treated
with cisplatin. MFA for cisplatin on BRCA and BLC cell lines (Fig. 2), shows GI50 to be strongly
correlated with GE of ATP7B, BCL2L1, CDKN2C, GSTP1, MSH2, MAP3K1, MT1A, MT2A, MT4,
NFKB2, SLC22A5, SLC22A7, SLC22A11, SLC22A12, SLC22A15, SLC31A2, SNAI1, and TLR4,
and gene copy number of CFLAR, FOS, and NFKB1. GE of GSTO1, , MAPK13, MT3, PPKAA2,
PRKCA, PRKCB, SLC22A10, SLC22A13 and TP63 and CN of MAPK3 and SLC22A20 were also
correlated, but at lower significance. Results were consistent with published studies (29-31),
however many candidate genes previously associated with cisplatin response were not correlated
based on GI50. The ML-based gene signature was 95% accurate in classifying 41 cell lines with
the median GI50 as the threshold distinguishing sensitivity from resistance.
An SVM signature containing DNA repair genes: CHEK2, NEIL1, PNKP, POLD1, POLD2,
POLD3, POLE, POLR2H, PSMA2, PSMC2, and RFC2 expression was used to predict prognosis
to segregate a set of 30 BLC patients with advanced disease (NCBI GEO database: dataset
GSE5287) (5). Long term survival was clustered according to good (Cluster 1), intermediate
6
(Cluster 2), and poor (Cluster 2.1). These results were consistent for an independent set of BLC
patients (GEO: GSE31684; Fig. 3).
Previously, the threshold that distinguish drug resistance and sensitivity was the median GI50 value,
which consistently has among the highest positive predictive value in different patient datasets
(5,6). However, at different GI50 thresholds, signatures are obtained that can preferentially
distinguish the genes contributing to the highest vs. lowest levels of drug resistance. GI50thresholded ML models were derived by minimizing either misclassification or log-loss function
to evaluate the distribution of selected genes and model accuracy. Log-loss penalizes false
classifications, whose value ranges from zero (or completely accurate) to 1 (or completely
inaccurate). The overall distribution of genes across various GI50 thresholds exhibited both
similarities and differences with gene signatures derived by minimizing classification errors at the
median GI50 threshold by backwards feature selection. However, varying these thresholds
produces imbalances between the numbers of sensitive and resistant cell lines, which can affect
the performance of ML models at extreme GI50 thresholds (32, 33). An important question is
whether the genes contributing to drug responses are consistent among different cell lines, each
with their own unique GI50 values. Different ML gene signatures were obtained by shifting the
GI50 threshold, which changed the labels of resistant and sensitive cell lines. After feature
selection, the compositions of the corresponding gene signatures for each threshold were
compared. Finally, ensemble averaging of all of these optimized SVMs, each derived for different
GI50 thresholds, was used to create a single aggregated, threshold-independent signature with fewer
independent features (i.e., a composite gene signature). Aggregated, threshold-independent models
generated for each of the platin drugs at different GI50 thresholds, ie. by ensemble machine
learning, classified bladder cancer patients with similar accuracy (50-63% disease free; 48-73%
recurrent). Although the compositions of the component GI50-thresholded signatures emphasize
different genes and pathways, their overall performance for discriminating sensitivity from
resistance (between 12-24 months post-treatment) were similar across the set of patient data.
Kinase genes (MAPK3 and MAP3K1) and apoptotic family members (BCL2, BCL2L1) were the
most consistently represented in different cis-platin signatures at multiple GI50 thresholds,; error7
prone and base-excision DNA repair genes were also present, but were less frequent (Fig. 4). The
kinase genes were more concentrated in signatures associated with increased sensitivity to the
drug, whereas BCL2 and BCL2L1 were more ubiquitous and found at all threshold levels. The
error prone polymerases POLD1 and POLQ were more frequently detected in gene signatures with
lower sensitivity thresholds, while the flap endonuclease FEN1 tended to be present in signatures
that distinguished high resistance levels. By contrast, thresholded gene signatures for carboplatinrelated genes commonly contained the apoptotic family member AKT1, transcription regulation
genes ETS2 and TP53, as well as cell growth factors VEGFB and VEGFC, although the latter were
less common at lower sensitivity thresholds. Common oxaliplatin-related genes included the
transporters SLCO1B1 and GRTP1 (but not SLC47A1), transcription-related genes NFE2L2,
PARP15 and CLCN6, and metabolism-related genes. These analyses showed certain pathways that
contributed to higher levels of resistance, whereas others are prevalent at lower levels of
sensitivity.
Rational expansion of gene candidates in signatures
The best ML signatures correctly identify the CT responses of the majority of cell lines and patients
according to GI50 threshold; however, some individuals remain misclassified. Aside from
limitations in the computational approach, it is also reasonable to consider limitations on existing
knowledge regarding the complement of genes associated with drug responses, i.e. the published
literature is an incomplete source of information about all of the gene products that affect CT
response.
We have described ML approaches that derive multigene signatures that predict drug response
containing gene features that are weighted to optimally distinguish sensitive from resistant
classes. As a consequence certain input features will be eliminated or emphasized in performing
this classification, but these do not reveal previously unknown genes that contribute to the
response. Extending signatures to include other features from the same or related pathways is a
foundational strategy that is based on established biochemical mechanisms and relationships
revealed from systems biology. Many of these relationships have not been elucidated, however
the principles of these interactions are well established. Feedback regulation involves the use of
a reaction product to regulate its own or a another reaction. Interacting or multi-subunit protein
8
complexes would also be subject to this type of constraint. These types of regulation tend to
affect activity other elements in the same pathways, however these processes are not exclusive to
any particular pathway or set of interactions. They comprise ubiquitous network motifs in all
kinds of molecular interaction networks, such as for example metabolic networks, regulatory
modules or signaling networks (34). Negative feedback, which counteracts external
perturbations, can cause oscillating behavior, but also has a stabilizing effect. Negative feedback
may endow cells with robustness to internal and external perturbations and play a major role in
maintaining homeostasis (35-37). Positive feedback is also critical in cellular decision processes,
by producing ultrasensitivity and prolonged responses to a transient external signal (37-39).
While positive and negative feedback loops are ubiquitous in biochemical signaling pathways,
additional effect called quasi-bistability can contribute to observed responses to different stimuli.
Quasi-bistability allows monostable systems to maintain two distinct states upon receipt of a
transient input, which may be related to positive feedback loops (40).
The discovery of other genes whose GE and/or CN contribute to chemotherapy response can be
guided by interactions with the gene products present in existing signatures. Gene products within
the same or linked biochemical pathways may contribute to drug response through direct or
indirect feedback, and through epistatic relationships to existing genes in these signatures.
Expansion of the set of candidate signature gene products is therefore based on their established
proximity in biochemical pathways, direct interactions, and regulatory relationships (41) to gene
products in the original signatures. Although software has been developed and applied to automate
this process, the extent to which particular pathways, interactions or forms of regulation are
preferentially incorporated into the expanded signatures has not yet been established. Another
limitation is that the resources used for pathway expansion of existing signatures do not currently
account for steady-state vs. dynamic difference between interactions based indirectly on gene
expression changes and actual biochemical regulation of the drug targets and metabolites.
Neighboring genes may be related either through direct interaction or by substrate-product
dependency within that same or adjacent biochemical pathways, with adjacency defined as the
number of pathway nodes separated from the original signature gene. This involves referencing
comprehensive, updated digital source of biochemical reaction and interaction pathways, for
example, from Pathway Commons (http://www.pathwaycommons.org/), which includes the
9
Reactome, ConsensusPathDB, KEGG databases. To discover either previously unknown genes, or
replace existing genes and create higher accuracy signatures, well-established functional
relationships between original and novel genes are leveraged. This integrated resource enables
efficient computational searching of hierarchical and other relationships between interacting gene
products and reaction dependences of initial signature genes. Candidates are selected based on
known relationships (regulatory, associations) with gene products included in gene signatures, and
systematically, based on network proximity to those in the current signature. By systematically
traversing the pathway graph for each signature gene, candidates are selected for another round of
machine learning based on GE and/or CN features that exhibit significant direct or inverse
correlations to GI50 values (r > 0.8). Neighbors for each signature gene are present at each level of
the biochemical and interaction network. Inclusion of neighboring genes can be influenced by
more distant node interactions that interact with gene products encoded in the original signatures.
Pathway expansion can produce large numbers of potential gene candidates that are correlated
with GI50 (Table 1). To minimize false discovery of novel signature genes, we have limited
permissible distances between original and derivative genes to a depth of up to 3 nodes.
Analysis of extended cisplatin signatures based on manual testing and inclusion of genes from a
single adjacent pathway node demonstrated that two new gene neighbors improved accuracy of
prognosis of remission, and one gene improved prognosis of recurrent disease. Candidate
immediate neighboring genes to expand cisplatin signatures (Table 1) revealed genes in
neighboring biochemical pathways that, when introduced, marginally improved accuracy (2-4%,
depending on the gene) of Cis1, the best performing published signature (7).
Expanded signatures for many drugs are still in the process being analyzed, however results for
tyrosine and serine kinase inhibitors (sunitinib, sorafenib, lapatinib, erlotinib, imatinib and
gefitinib) are currently the most mature. Initial ML-based classifiers were originally based on
backwards feature selection of response genes from published literature on these drugs. Except
for Sunitinib, the performance of all of these ML models appears to have been improved by
extension of the biochemical network to include pathway neighbors. In the best performing
datasets, overall classification accuracies for prognosis of patient response were 73% for erlotinib,
75% for lapatinib, 85% for sorafenib, 83% for sunitinib, 66% for imatinib, and 53% for gefitinib
(however the latter dataset consisted of only 11 patients, of which sensitivity was correctly
10
predicted in 4 of 4 patients and resistance in 2 of 7). Interestingly, the genes in the extended
signatures replace most, but not all, of the literature-based classifiers. The improved signatures
contain genes that are 1 or 2 nodes from the original gene from which it was associated. Generally,
they are each composed of unique sets of genes (there is little overlap between the compositions
of signatures for different kinase inhibitors).
Many additional genes are likely to be found with strong correlations to GI50 levels or other
response measures. These additional features (in either the same signature or in the number of high
performing signatures) could increase susceptibility of ML models to overfitting of the training
data, which would falsely inflate their accuracy. Several strategies can be implemented to reduce
the number of gene features, thereby minimizing the likelihood of model overfitting to specific
datasets, including: (a) limiting genes to pathways that are represented predominantly in the
original signatures with highest performance at extreme responses (most or least chemo-resistant),
(b) filtering epistatic genes that merge gene features present in the same pathway based on
evidence that they both exert the overlapping regulatory effects on a single pathway product or at
the same node in a pathway, and (c) ensemble averaging of multiple gene signatures with
equivalent performance can define consensus ML models with fewer features. The number of
distant extant nodes in pathway networks can also be pruned to mitigate overfitting of genomic
data. Features can also be prioritized by limiting the new genes to pathways that make the most
significant contributions to CT resistance. These strategies could improve reproducibility and
prognostic accuracy compared to the original gene signatures using new data sources.
The revised, expanded signatures are then derived by ML training on cancer cell line GE and/or
CN data at specified GI50 thresholds. Pathway node partners of genes in the existing gene
signatures are retained if they either reduce misclassification rates or improve classification of the
cell lines. The resultant signatures including network partner genes are compared with the existing
signatures and evaluated with patient genomic data to assess whether prognostic accuracy is also
improved for different clinical response duration thresholds. Results are limited solely to genes
that significantly contribute by MFA analysis to chemotherapy response. The relative contribution
of each gene to the chemotherapy response is determined by computing misclassification or log
loss after feature deletion). Robust gene signatures can be determined for different GI50 thresholds
11
with hinge loss functions that weight GE/CN values of cell lines with outlier GI50 values more
highly relative to those with GI50 values close to the mean threshold.
Gene expression signature expansion does not have to be restricted to polyadenylated mRNAs that
encode proteins. A complex taxonomy of the RNA universe includes many other non-coding (nc)
species, such as miRNA, RNA Polymerase III-derived transcripts (tRNA, small nuclear RNA, snoand piwi-RNA, and Alu, small nuclear RNAs), enhancer RNAs, long non-coding (lnc) RNAs,
circular (circ) RNAs, and RNA polymerase I transcripts, for example rRNAs, which comprise the
most predominant nucleic acids in cells. It is feasible to computationally derive expression
signatures that are prognostic for drug response without understanding their basis. For example,
non-coding (nc) RNAs can be used to derive accurate gene signatures for predicting BLCA and
LUAD patient response to platinum therapies based on expression data from The Cancer Genome
Atlas (42).
Nevertheless, inclusion of lncRNA and circRNA evidence in gene signatures will likely require
that these species be related extrinsic properties of drug response over a range of measured
phenotypes. Transcripts with recurrent mutation hotspots in essential coding domains that would
result in localized exon skipping, cryptic splice site activation and/or intron inclusion detected by
analysis of RNASeq would be reasonable candidates to consider for such signatures.
It is notable that different ncRNA species are identified, sequenced and measured by different
protocols. ncRNAs are still not well covered in microarray platforms but are evident by analysis
of RNASeq libraries. This introduces challenges into deriving metaRNA signatures comprising
multiple species, since globally-normalized expression levels are necessary for the signatures to
be reproducible among different datasets. Nevertheless, conventional RNASeq libraries contain
cDNA sequences from polyadenylated circRNA and lncRNAs, and are likely be generated
concurrently with mRNAs. circRNAs are byproducts of mRNA splicing resulting from excision
of non-contiguous donor and acceptor joins, usually by alternative splicing. Although highly
expressed, a unified explanation for the functions of most circRNAs has not been found. The
sequence content of short-read RNASeq cDNA libraries from mutated transcripts of circRNA
and lncRNA could be indistinguishable from those generated by genomically-encoded
mutations. We have found that mutations at splice sites can affect exon definition, resulting in
12
intron inclusion, or exon skipping, cryptic isoforms or combinations of these (43). It seems likely
that some intronic reads originate from circRNAs, which themselves may be derived from
different genomic variants. Common features in ML-based gene signatures could be consolidated
from multiple distinct, rare point mutations that produce the same altered circRNA isoforms.
While gene signatures can be derived by inclusion of lncRNA and circRNA evidence, their
utility remains to be established. Relationships to extrinsic markers of tumor or cell line response
to a drug are needed, and these should be robust over a range of phenotypes. Strong candidates
would include genes with recurrent mutation hotspots in particular essential coding domains that
would result in localized exon skipping, cryptic splice site activation and/or intron inclusion
detected by analysis of RNASeq.
Prognostic utility of cell line-derived signatures for patient CT responses.
Gene signatures based on tumor GE and/or CN data have been used to determine if ML-based
gene signatures for sensitivity are related to patient response. In the absence of GI50, EC50, IC50
ground truth measures of drug response, primary study endpoints of patient clinical trials are used
as surrogate measures for chemotherapy response thresholds. These may vary among available
datasets and can include PFS, TTP, time to treatment failure (TTF) or proportionate complete
response (CR) (44, 45). Gene signatures of invasive breast cancer performed well at the 4 year
PFS threshold in 68.7% patients and 3 year CR in 74% patients (5).
There are limitations on
interpreting drug signatures based on sample size, heterogeneous pathology, biases in patient
ascertainment or study design. Studies with small numbers of enrolled patients should be reported
as categorical results without statistical significance. Analyses of CT response have shown that the
duration of clinical follow-up metadata has been adequate for larger studies (eg. METABRIC and
the Cancer Genome Atlas). While the precise timing of administration of chemotherapy postdiagnosis may confound results, it is difficult to assess the impact of this, though this may explain
some signatures with limited prognostic value (<60% accuracy).
Approaches that are free of cell line endpoints (such as GI50) can be used set thresholds using on
patient-derived outcomes (4). For example, TTP can be used to distinguish resistance from
sensitivity phenotypes. Patients will be categorized as either sensitive or resistant based on tumors
13
status at specific time thresholds until either death or relapse. Varying these clinically-relevant
time thresholds will identify the gene signatures model(s) that optimally discriminate these
categories, with disease-free patients exceeding the threshold classified as sensitive. Patients
treated with surgery alone or radiation can serve as negative controls (4), as their outcomes are not
expected to be related to a particular CT signature. If prognosis of the signature produces similar
proportions of sensitive vs. resistant patients in the CT vs control groups, the signature would not
be related to the tumour GE and/or CN profile. The signature accuracy may also reflect selection
for proliferation of tumour sub-clones, consistent with resistance arising in a subset of cells with
accumulated genetic changes that confer resistance (44).
Datasets contain different numbers of patients with both available meta- and genomic data for a
single drug treatment or a combination therapy. For multiple, independent patient datasets treated
with the same CT, a signature is considered validated in different sets of patients if the accuracies
are comparable. The minimum number of patients required to validate a gene signature depends
on the balance between the census of sensitive vs. resistant individuals. Smaller patient datasets
should only be analyzed by categorical analysis, which may reveal only trends in the data. With
77 or more patients, sensitivity and resistance can be distinguished at p < 0.01 significance with
80% power, assuming each category to be equally prevalent. However, resistance to CT was 1.8
fold more common in the METABRIC breast cancer cohort (3). With this level of imbalance, 84
patients are required (54 and 30 patients, respectively). Our paclitaxel gene signature (4, 26) was
also based on an imbalanced outcome data, with approximately 4.2 patients developing recurrent
disease for each CR. To achieve significance, 124 patients for the signature were required (100
recurrent and 24 CR).
The analysis of expressed genetic polymorphisms in RNA sequencing data from tumors could
also potentially identify clonal events that define subpopulations of cells with different drug
response phenotypes (45). If these subpopulations were identified prior to or early in treatment,
it might assist in timely selection of appropriate therapies. Scuh bioinformatic analysis of deep
sequencing or single cell next generation sequencing data is likely to provide valuable insight
into whether the output of the prognostic signatures reflects panmyctic or heteromyctic cell
populations. These results could be important in understanding whether therapies operate on only
14
a subset or all cells. Where karyotypes are often mosaic and polyclonal, it is conceivable
correlation of genotypes with signature predictions in different cells from the same tumor might
reveal the cellular origins of drug resistant phenotypes.
Summary. Gene signatures that assist decisions to treat cancer with chemotherapy have been
clinically approved by regulators after analysis of large patient cohorts (46-50). These tests have
been endorsed as mainstream medical practice and for government reimbursement (51). The
objective of incorporating such studies into chemotherapy management has a number of
advantages for patients including the elimination of redundant genetic information, their
transferability and prognostic value, and the benefits of modeling global cellular responses that
incorporate different mechanisms of resistance in a single signature. Genomic signatures could
be used by oncologists by avoiding use of drugs with unfavorable signatures or by substituting
other drugs expected to be effective for a patient, either individually or in combination. We
envision panels of CT signatures for multiple drugs evaluated from the complete transcriptome
data of a tumor. Response profiles may be capable of providing a set of prognostic drug
responses for each patient. Such a strategy could assist in decisions to treat individuals with
approved secondary CT rather than standard first line adjuvant therapies.
Acknowledgements
PKR is supported by The Natural Sciences and Engineering Research Council of Canada
(NSERC) [RGPIN-2015-06290], Canadian Foundation for Innovation, Canada Research Chairs,
and CytoGnomix. Compute Canada and Shared Hierarchical Academic Research Computing
Network (SHARCNET) provided high performance computing and storage facilities.
15
Table 1. Correlated genes eligible for
cisplatin extended gene signature
Minimum
correlation
with GI50
≥ 0%
≥ 70%
≥ 80%
≥ 90%
≥ 95%
Number of Correlated Genes by
Node*:
1
2
3
0
2451 10780
340
14
377
2035
82
14
241
1291
50
10
120
621
23
5
61
292
11
4
*Nodes correspond to the number of biochemical
pathway steps separating a gene from the initial gene
signature [Node 0; same pathway], which include:
ATP7B, BARD1, BCL2, CDKN2C, ERCC2, FOS,
MAP3K1, MAPK13, NFKB1, PNKP, POLQ, PRKCA,
SLC22A5 and SNAI1.
16
Literature citations
1. Park, N.I., Rogan, P.K., Tarnowski, H.E. and Knoll, J.H., (2012) Structural and genic
characterization of stable genomic regions in breast cancer: Relevance to
chemotherapy. Molecular oncology 6:347-59.
2. Maglietta, R., D’Addabbo, A., Piepoli, A., Perri, F., Liuni, S., Pesole, G. and Ancona, N.
(2007). Selection of relevant genes in cancer diagnosis based on their prediction
accuracy. Artif. Intell. Med., 40(1):29–44.
3. Theodoridis S., Koutroumbas K. (2009). Pattern Recognition, 4th Edition, Elsevier.
4. Dorman S, Baranova K, Knoll J, Urquhart, B, Marciani G, Carcangiu M-L, and Rogan PK.
(2016) Genomic signatures for Paclitaxel and Gemcitabine resistance in breast cancer derived
by machine learning. Mol. Oncology, 10: 85-100.
5. Mucaki E, Baranova K, Quang HP, Rezaeian I, Angelov D, Ilie L, Ngom A, Rueda L, Rogan
PK. (2017) Predicting Outcomes of Hormone and Chemotherapy in the Molecular Taxonomy
of Breast Cancer International Consortium (METABRIC) Study by Biochemically-inspired
Machine Learning. F1000Research, 5:2124.
6. Zhao, J.Z., Mucaki, E.J. and Rogan, P.K. (2018) Predicting ionizing radiation exposure using
biochemically-inspired genomic machine learning. F1000Research 7:233.
7. Mucaki, E.J., Zhao, J.Z., Lizotte, D.J. and Rogan, P.K. (2019). Predicting responses to platin
chemotherapy agents with biochemically-inspired machine learning. Signal transduction and
targeted therapy, 4(1), 1.
8. Heiser, L.M., Sadanandam, A., Kuo, W.L., Benz, S.C., Goldstein, T.C., Ng, S., Gibb, W.J.,
Wang, N.J., Ziyad, S., Tong, F. and Bayani, N. (2012) Subtype and pathway specific
responses to anticancer compounds in breast cancer. Proc. Natl. Acad. Sci. U. S. A. 109, 27242729.
9. Daemen, A., Griffith, O.L., Heiser, L.M., Wang, N.J., Enache, O.M., Sanborn, Z., Pepin, F.,
Durinck, S., Korkola, J.E., Griffith, M. and Hur, J.S. (2013) Modeling precision treatment of
breast cancer. Genome Biol. 14, R110.
10. Hurst CD, Knowles MA (2014) Molecular subtyping of invasive bladder cancer: time to
divide and rule? Cancer Cell 25(2):135–6.
11. Choi, W., Porten, S., Kim, S., Willis, D., Plimack, E.R., Hoffman-Censits, J., Roth, B.,
Cheng, T., Tran, M., Lee, I.L. and Melquist, J. (2014) Identification of distinct Basal and
17
luminal subtypes of muscle-invasive bladder cancer with different sensitivities to frontline
chemotherapy. Cancer Cell 25(2):152–65.
12. Seiler, R., Choi, W., Lam, L.L., Erho, N., Buerki, C., Davicioni, E., Thalmann, G.N.,
McConkey, D.J. and Black, P.C (2015) Association of p53-ness with chemo-resistance in
urothelial cancers treated with neoadjuvant gemcitabine plus cisplatin. J Clin
Oncol 33(15):4512.
13. Lee, J.S., Leem, S.H., Lee, S.Y., Kim, S.C., Park, E.S., Kim, S.B., Kim, S.K., Kim, Y.J.,
Kim, W.J. and Chu, I.S. (2010) Expression signature of E2F1 and its associated genes predict
superficial to invasive progression of bladder tumors. J Clin Oncol. 28(16):2660-7.
14. Sjödahl, G., Lauss, M., Lövgren, K., Chebil, G., Gudjonsson, S., Veerla, S., Patschan, O.,
Aine, M., Fernö, M., Ringnér, M. and Månsson, W. A molecular taxonomy for urothelial
carcinoma. Clin Cancer Res (2012) 18(12):3377-86.
15. Blaveri, E., Simko, J.P., Korkola, J.E., Brewer, J.L., Baehner, F., Mehta, K., DeVries, S.,
Koppie, T., Pejavar, S., Carroll, P. and Waldman, F.M. Bladder cancer outcome and subtype
classification by gene expression. Clin Cancer Res (2005) 11(11):4044-55.
16. Shoemaker R.H. (2006) The NCI60 human tumour cell line anticancer drug screen. Nat. Rev.
Cancer 6, 813-823.
17. Neve, R.M., Chin, K., Fridlyand, J., Yeh, J., Baehner, F.L., Fevr, T., Clark, L., Bayani, N.,
Coppe, J.P., Tong, F. and Speed, T. (2006) A collection of breast cancer cell lines for the
study of functionally distinct cancer subtypes. Cancer Cell 10, 515-527.
18. Prat, A., Karginova, O., Parker, J.S., Fan, C., He, X., Bixby, L., Harrell, J.C., Roman, E.,
Adamo, B., Troester, M. and Perou, C.M. (2013). Characterization of cell lines derived from
breast cancers and normal mammary tissues for the study of the intrinsic molecular
subtypes. Breast Cancer Res. Treat 142, 237-255.
19. Hafner, M., Niepel, M., Chung, M. and Sorger, P.K.(2016) Growth rate inhibition metrics
correct for confounders in measuring sensitivity to cancer drugs. Nature Methods 13: 521-7.
20. Fawcett T (2006). An Introduction to ROC Analysis, Patt Recogn Letters, 861-874.
21. Sing, T., Sander, O., Beerenwinkel, N. and Lengauer, T. (2005). ROCR: visualizing
classifier performance in R. Bioinform. 21:3940.
22. Abe S. (2010). Support Vector Machines for Pattern Classification, Springer.
23. Breiman L. (2001). Random Forests. Journal of Machine Learning, 45(1):5-32.
18
24. Ding, C., and Peng, H.C., (2003). Minimum redundancy feature selection from microarray
gene expression data, . J Bioinform Comput Biol. 3(2): 185–205.
25. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009). The
WEKA Data Mining Software; SIGKDD Explorations, 11(1):10-18.
26. Hatzis, C., Pusztai, L., Valero, V., Booser, D.J., Esserman, L., Lluch, A., Vidaurre, T.,
Holmes, F., Souchon, E., Wang, H. and Martin, M. (2011). A genomic predictor of response
and survival following taxane- anthracycline chemotherapy for invasive breast
cancer. JAMA 305:1873-1881.
27. Curtis, C., Shah, S.P., Chin, S.F., Turashvili, G., Rueda, O.M., Dunning, M.J., Speed, D.,
Lynch, A.G., Samarajiwa, S., Yuan, Y. and Gräf, S. (2012). The genomic and transcriptomic
architecture of 2,000 breast tumours reveals novel subgroups. Nature. 486:346-352
28. Galluzzi, L., Senovilla, L., Vitale, I., Michels, J., Martins, I., Kepp, O., Castedo, M. and
Kroemer, G., (2012) Molecular mechanism of cisplatin resistance. Oncogene 31:1869-83.
29. Grossman, H.B., Natale, R.B., Tangen, C.M., Speights, V.O., Vogelzang, N.J., Trump, D.L.,
White, R.W.D., Sarosdy, M.F., Wood Jr, D.P., Raghavan, D. and Crawford, E.D. (2003)
Neoadjuvant chemotherapy plus cystectomy compared with cystectomy alone for locally
advanced bladder cancer. N. Engl J. Med 349:859-866.
30. Meeks, J.J., Bellmunt, J., Bochner, B.H., Clarke, N.W., Daneshmand, S., Galsky, M.D.,
Hahn, N.M., Lerner, S.P., Mason, M., Powles, T. and Sternberg, C.N. (2012) A systematic
review of neoadjuvant and adjuvant chemotherapy for muscle-invasive bladder cancer. Eur
Urol 62(3):523–33.
31. Bellmunt, J., von der Maase, H., Mead, G.M., Skoneczna, I., De Santis, M., Daugaard, G.,
Boehle, A., Chevreau, C., Paz-Ares, L., Laufman, L.R. and Winquist, E. (2012) Randomized
phase III study comparing paclitaxel/ cisplatin/gemcitabine and gemcitabine/cisplatin in
patients with locally advanced or metastatic urothelial cancer without prior systemic therapy:
EORTC Intergroup Study 30987. J Clin Oncol 30(10):1107–13.
32. Batuwita R., Palade V. (2012a). Adjusted Geometric-Mean: A Novel Performance Measure
for Imbalanced Bioinformatic Datasets Learning. J Bioinf Comp Biol, 10(4):1250003.
33. Batuwita R., Palade V. (2012b). Class imbalance Learning Methods for Support Vector
Machines. In Imbalanced Learning: Foundations, Algorithms and Applications, Wiley.
19
34 Alon U. (2006) An introduction to systems biology - design principles of biological circuits.
Math and Comput Biol Series. London: Chapman & Hall/CRC.
35 Thomas R. (1981) On the relation between the logical structure of systems and their ability to
generate multiple steady states or sustained oscillations. In: Della-Dora J, Demongeot J,
Lacolle B, editors. Numerical methods in the study of critical phenomena. Springer Series in
Synergetics, vol. 9. Berlin, Heidelberg: Springer. p. 180–93.
36. Gouzé J-L. (1998) Positive and negative circuits in dynamical systems. J Biol Syst. 6(21):11–
15.
37. Freeman M. (2000) Feedback control of intracellular signalling in development. Nature 408:
313–19.
38. Alon U. (2007) Network motifs: theory and experimental approaches. Nat Rev Genet. 8:
450–61.
39. Savageau MA. (1974) Comparison of classical and autogenous systems of regulation in
inducible operons. Nature 252(5484): 546–49.
40. Jensch A, Thomaseth C and Radde NE. (2017) Sampling-based Bayesian approaches reveal
the importance of quasi-bistable behavior in cellular decision processes on the example of the
MAPK signaling pathway in PC-12 cell lines. BMC Systems Biology 11: 11.
41. van den Akker, E., Verbruggen, B., Heijmans, B., Beekman, M., Kok, J., Slagboom, E. and
Reinders, M. (2011) Integrating Protein-Protein Interaction Networks with Gene-Gene CoExpression Networks Improves Gene Signatures for Classifying Breast Cancer Metastasis. J.
Integr. Bioinform. 8(2): 188.
42. Zhu Y, Zhao Y, Dong S, Liu L, Tai L, and Xu Y. (2019). Systematic identification of
dysregulated lncRNAs associated with platinum-based chemotherapy response across 11
cancer types. Genomics (https://doi.org/10.1016/j.ygeno.2019.07.007)
43. Shirley BC, Mucaki EJ and Rogan PK. (2019) Pan-cancer repository of validated natural
and cryptic mRNA splicing mutations. F1000Research 7:1908 44. Mullighan, C.G.,
Phillips, L.A., Su, X., Ma, J., Miller, C.B., Shurtleff, S.A. and Downing, J.R. (2008) Genomic
analysis of the clonal origins of relapsed acute lymphoblastic leukemia. Science 322: 1377–
1380
45. Yizhak K, Aguet F, Kim J, Hess JM, KÜBLER K, Grimbsby J, Frazer R, Zhang
H, Haradhvala NJ, Rosebrock D, Livitz D, Li X, Arich-Landkof E, Shoresh
20
N, STEWART C, Segre AV, Branton PA, Polak P, Ardlie KG, Getz G (2019) RNA
sequence analysis reveals macroscopic somatic clonal expansion across normal tissues.
Science 364: eaaw0726.
46. Saad ED and Katz A. Progression-free survival and time to progression as primary end points
in advanced breast cancer: often used, sometimes loosely defined, Annals of Oncology, 20[3]:
460–464.
47. Lloyd KL, Cree IA, Savage RS. (2015) Prediction of resistance to chemotherapy in ovarian
cancer: a systematic review. BMC Cancer. 15: 117.
48. Ein-Dor, L., Zuk, O. and Domany, E. (2006) Thousands of samples are needed to generate a
robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. 103: 5923-5928.
49. Nilsson, R., Björkegren, J. and Tegnér, J. (2009) On reliable discovery of molecular
signatures. BMC Bioinform. 10: 38.
50. Smith SC, Baras AS, Lee JK and Theodorescu D. (2009) The COXEN Principle:
Translating Signatures of In vitro Chemosensitivity into Tools for Clinical Outcome
Prediction and Drug Discovery in Cancer. Cancer Res. 70: 1753-1758.
51. Varmus H. (2016) The transformation of oncology. Science 352: 123.
21
Figure legends
Figure 1. Biochemically-inspired gene signature workflow. Genes biologically relevant to drug
response were collected through examination of the published literature and public databases.
Association of GE and/or CN with GI50 was tested in drug-treated 49 breast and 18 bladder
carcinoma cell lines. Multiple Factor Analysis (MFA) shows either positive or inversely correlated
relationships between the GI50 and GE or CN as coincident or opposing vectors. These genes were
then used to build models for the drug using the following machine learning techniques: Support
Vector Machine Learning, Random Forest Decision Trees, Minimum Redundancy Maximum
Relevance wrapper for feature selection, and Simulated Annealing. Those models were
subsequently used on patient data to predict drug response.
Figure 2. Schematic of cisplatin sensitivity and resistance genes. Genes are indicated according
to their role in drug metabolism. Red boxes indicate genes with a positive correlation between
gene expression or copy number to GI50 and IC50
using Multiple Factor Analysis. Blue
demonstrates a negative correlation.
Figure 3. Clustering cisplatin-treated patients based on expression of damage response
genes in signature.
Of 146 genes involved in G1/S DNA damage checkpoints (N=61), base excision repair (N=35),
and nucleotide excision repair (N=50), 11 showed significant differences in gene expression
between 4 sensitive and 6 resistant bladder cell lines9. These 11 genes were selected for
unsupervised clustering of expression in treated patients in NCBI GEO datasets, GSE5287 and
GSE31684. (A) indicates differences in survival for patients in GSE31684 with poor (2.1),
intermediate (1), and good (2) prognostic outcomes (percent survival) in a Kaplan-Meier plot
based on expression of these genes. (B) shows the heatmap for these clusters by patient
(GSE31684; n=144; red=high, blue=low expression). In this case, the differences between the
outcome categories were not statistically significant.
Figure 4. Ensemble of gene signatures for cisplatin at different GI50 thresholds. Each column
represents the contributions of genes (number of genes by pathway: 4 apoptosis, 2 cell growth, 9
22
DNA repair, 2 metabolism, 3 metal binding, 3 signal transduction, 7 transcription and 9 transport)
at different GI50 levels. Font size corresponds to frequency of gene in different gene signature
models.
23
Figure 1
24
Figure 2
25
A.
C lu s te r 1
P e r c e n t s u r v iv a l
100
C lu s te r 2
C lu s te r 2 .1
50
0
0
50
100
150
M o n th s
B.
Figure 3.
26
200
Figure 4
27