Leveraging Big Data To Transform Target
Leveraging Big Data To Transform Target
Leveraging Big Data To Transform Target
The advances of genomics, sequencing, and high throughput technologies have led to the creation of large volumes of
diverse datasets for drug discovery. Analyzing these datasets to better understand disease and discover new drugs is
becoming more common. Recent open data initiatives in basic and clinical research have dramatically increased the types
of data available to the public. The past few years have witnessed successful use of big data in many sectors across the
whole drug discovery pipeline. In this review, we will highlight the state of the art in leveraging big data to identify new
targets, drug indications, and drug response biomarkers in this era of precision medicine.
In 2013, the European Bioinformatics Institute hosted 15 peta- and lead compound discovery. One trend of disease classification
bytes data in their shared file systems.1 This increased to 25 peta- in drug discovery is moving from a symptom-based disease classi-
bytes in 2014, which is equal to the hard drive space of over fication system to a system of precision medicine based on molec-
12,000 current-day typical personal laptops (each with a 2 tera- ular states.3,4 Building a new classification of diseases requires
byte drive). These data were distributed in over 120,000 datasets molecular characterization of all diseases. In addition, an ideal
available for searching and analysis in 2014. As voluminous as level of disease understanding would characterize all levels of
this data sounds, these numbers simply reflect the complexity and molecular changes, from DNA to RNA to protein, as well as the
growth of the data from one single institute. effects of environmental factors.
This growth in the digitalization of biomedical research is due Each level of molecular change can be characterized by the
to the advances and decreasing costs of genomics, sequencing, analysis of relevant data points. Table 1 lists the data types fre-
and the increasing use of high throughput technologies in the quently used in drug discovery and their current relevant technol-
research enterprise. Large volumes of biomedical data are being ogies. At the DNA level, single-nucleotide polymorphisms
produced every day, and much of these data are actually now (SNPs) that occur specifically in the disease population is one
becoming publicly available, owing to the initiatives of open data. type of DNA sequence variation widely used to characterize dis-
Although the field of biomedical informatics is facing challenges ease. Copy number variations (CNVs) reflect relatively large
in the storage and management of these datasets, this field is also regions of genome alterations, which may be also associated with
embracing more exciting opportunities in the discovery of new disease. Both SNPs and CNVs can be identified from the
knowledge from these data.2 Big datasets are now not only rou- genome-wide association studies (GWASs) and whole genome
tinely analyzed to inform discovery and validate hypothesis, but sequencing approaches. Mutations, particularly somatic muta-
also frequently repurposed to ask new biomedical questions. tions, are widely examined using next generation sequencing to
However, researchers are facing so many datasets that sometimes find driver genes in cancer that confer a selective growth advant-
it is difficult to choose the appropriate one for their studies. In age of cells.
this review, we will first describe the data types commonly used At the RNA level, gene expression (primarily mRNA) is argu-
in drug discovery and then list datasets publicly available. We will ably the most widely used feature for disease characterization. It
highlight some remarkable datasets that led to the discovery of has been used extensively to understand disease mechanism owing
new targets, drugs, or drug response biomarkers. to the development of the microarray technology. The recent
development of RNA-Seq presents merits in the expanded cover-
WHAT BIG DATA ARE AVAILABLE FOR DRUG DISCOVERY? age of transcripts and in the detection of low abundant tran-
Drug discovery often starts with the classification and under- scripts.5 Protein expression is another critical feature used to
standing of disease processes, followed by target identification characterize disease. Large-scale quantification of protein
1
Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, California, USA. Correspondence: Bin Chen (bin.chen@ucsf.
edu), Atul J Butte (atul.butte@ucsf.edu)
Received 12 November 2015; accepted 2 December 2015; advance online publication 11 December 2015. doi:10.1002/cpt.318
Gene expression Mostly expression of mRNA but also Microarray: most widely used *****
includes expression of other
transcripts RNA-Seq: can detect novel transcripts, low
abundant transcripts and isoforms
Fluorescent in situ hybridization: can detect
transcript abundance and spatial location in
cells for a small number of genes
RT-PCR: frequently used to confirm expression
for a small number of genes
Protein expression Can be expression of multiple Western blot: widely used to quantify protein ***
isoforms or variations due to expression for a small number of proteins
posttranslational modifications
ELISA: widely used to detect and quantita-
tively measure a protein in samples
Immunohistochemistry: can detect intracellu-
lar localization for a small number of proteins
Reverse phase protein array: can detect
expression for a few hundred proteins
Mass spectrometry: can detect expression for
a wide range of proteins
Protein-protein interaction Physical interactions between two or Two-hybrid screening: low-tech; high false- ****
more proteins positive rate
Mass spectrometry
Protein-DNA interaction Binding of a protein to a molecule of ChIP-seq: combines chromatin immunopreci- ***
DNA pitation with massively parallel DNA sequenc-
ing to identify the binding sites of DNA-
associated proteins
Gene silencing Effect of loss of gene function RNAi: established method; knocks gene down **
at mRNA or non-coding RNA level; can have
transient effect (siRNA) or long-term effect
(shRNA)
CRISPR-Cas9: new method; modifies gene
(via knockout/knockin) at the DNA level;
causes permanent and heritable changes in
the genome
Gene overexpression Effect of gain of gene function cDNAs/ORFs: provide clones of sequence *
Table 1 Continued on next page
expression is becoming possible recently because of the emerging new drugs. Hence, it is of utmost importance that the data
new high throughput technologies, such as reverse-phase protein should be open to the public, such that every piece of informa-
arrays and mass spectrometry, although their coverage and quality tion can be easily connected.
remain limited. The interactions among DNA, RNA, and pro- Many important reference datasets have recently been created
tein can be captured by ChIP-Seq, mass spectrometry, and other and released, and can be used for drug discovery.7 Notable exam-
techniques; however, most of those interactions have been cap- ples are listed in Table 2. Arguably, public datasets can be used
tured only in cell lines or other in vitro models. More recently, to inform every step of preclinical drug discovery. Clinical data-
next generation sequencing approaches have allowed sequencing sets are becoming increasingly open as well.8 Figure 1 shows a list
environmental factors, such as microbial cells in the human body. of public datasets that can be leveraged to identify new targets,
Today, a snapshot of the molecular changes in disease can be drug indications, and drug response biomarkers. Not only have
quickly modeled by the variety of datasets collected using multi- public datasets been widely used as a source of reference, but also
ple techniques. The recent development of single cell sequencing they have been intensively analyzed to ask new questions, dis-
adds another layer of molecular changes. The number of layers cover new findings, or even validate hypothesis. In this study, we
dramatically increases as we consider the dynamic process of dis- selectively review some outstanding cases in the past few years in
ease progression. Moreover, other than disease samples from which discoveries were made primarily through the analysis of
patients, diverse preclinical models (e.g., cell lines, animal models) big data and validated rigorously through experimental
could be molecularly characterized in order to understand disease approaches.
and validate hypothesis.
On the drug side, molecular changes in disease models per- LEVERAGE BIG DATA TO IDENTIFY NEW TARGETS
turbed by chemical or genetic agents can be captured to under- FOR PRECLINICAL STUDIES
stand disease and drug mechanism. Gene function and gene Using big data to select targets for preclinical studies often starts
regulatory networks can be studied via genome-wide functional with the identification of molecular changes between disease sam-
screens, such as RNAi and clustered regularly interspaced short ples and healthy samples. The molecular changes are implicated
palindromic repeats (CRISPR)-Cas9.6 In addition, cellular in gene expression change, genetic variation, or other features,
responses of thousands of chemical compounds in a large of num- and are furthermore used to inform target discovery. Figure 2
ber of disease models can be quickly detected by high throughput illustrates three common big data approaches that use different
screening. Patient response upon drug intervention can be molecular features to discover targets, and basic experimental
tracked and analyzed recently owing to the availability of elec- approaches to validate targets. We will first discuss these three
tronic medical records (EMRs) and clinical trials. In addition to approaches and then suggest that public datasets can be used to
the molecular and clinical data, free-text data presented in litera- validate targets before time-consuming experiments.
ture are also useful in drug discovery.
Target discovery using gene expression data
WHAT BIG DATA SOURCES ARE PUBLICLY AVAILABLE? Among molecular features, gene expression is the most widely
No single laboratory, institute, or consortium is able to produce used feature and has been extensively explored to inform target
the data fully capturing all the layers of the complex disease sys- selection. As an example, Grieb et al.9 found that MTBP was sig-
tems. In addition, understanding of these systems relies on a large nificantly elevated in breast cancer samples compared with nor-
number of samples, such that statistical power could be reached. mal breast tissues by examining mRNA expression of 844 breast
Integrative analysis of multiple layers of data points from differ- cancer samples from The Cancer Genome Atlas (TCGA). Analy-
ent sources is thus essential to understand disease and discover sis of survival data revealed that increased MTBP levels are
significantly linked with poor patient survival. They further initiating cells, they found 13 kinases with higher mRNA expres-
stratified patients into clinically relevant subgroups: estrogen- sion in TNBC cell lines than in non-TNBC cell lines. Subse-
receptor positive, HER2 positive, and triple negative breast can- quent protein expression validation reduced the candidate list to
cer (TNBC) tumors and observed that MTBP is expressed higher eight kinases, which were further correlated to TNBC clinical
in the triple negative tumor subgroup than in the other two sub- subtype samples in TCGA. Among these eight kinases, three
groups. Further knockdown of MTBP significantly impaired kinases (PKC-a, CDK6, and MET) with high expression were
TNBC tumor growth in vivo. In another example, analysis of associated with shorter overall survival in patients with TNBC,
clear cell renal cell carcinoma samples from TCGA indicated suggesting their potential as prognostic markers and therapeutic
that TPL2 overexpression was significantly related to the pres- targets. In the subsequent functional validation, two-drug combi-
ence of metastases and poor outcome in clear cell renal cell carci- nations targeting these three kinases inhibited TNBC cell prolif-
noma.10 Silencing of TPL2 inhibited cell proliferation, eration and tumorigenic potential and a combination of PKC-
clonogenicity, anoikis resistance, migration, and invasion capabil- a2MET inhibitors attenuated tumor growth in vivo.
ities and inhibited orthotopic xenograft growth and lung metasta- Analyzing the samples from a single data source may limit the
sis, demonstrating the significant role of TPL2 in disease broader application of the findings because of biological and tech-
progression. In addition, public gene expression databases, such as nical bias. Meta-analysis that is aimed at detecting consistent
TCGA, were regularly used as a source of reference to confirm changes across multiple data sources may increase statistical
gene expression. One example includes the confirmation of power and further mitigate the bias. The availability of public
HMMR in the study of glioblastoma.11 datasets enables researchers to perform meta-analysis of microar-
The targets in these previous examples were first proposed by ray datasets for many diseases. Our colleagues Kodama et al.13
authors and were then confirmed by the analysis of public gene proposed a meta-analysis approach: a gene expression-based
expression data. By contrast, targets can also be directly discov- GWAS that searches for genes repeatedly implicated in multiple
ered through the primary analysis of gene expression data. With- experiments. They carried out an expression-based GWAS for
out any specific targets in mind, Hsu et al.12 sought for druggable type 2 diabetes by using 1,175 samples collected from 130 inde-
kinases, which are oncogenic in TNBC. By analyzing gene expres- pendent microarray experiments and identified the immune-cell
sion data from CCLE and National Cancer Institute-60 panel of receptor CD44 as the top candidate. They further validated that
cancer cell lines, and gene expression profiles of breast tumor CD44 deficiency ameliorated adipose tissue inflammation and
Figure 1 Public datasets can be leveraged to identify new targets, drug indications, and drug response biomarkers.
Gene
Relative Fluorescence
Disease samples Protein expression
using western blot
Gene
using Immunohistochemistry
Disease samples
somatic mutation
Functional validation
(in vitro, in vivo)
Normal samples DNA sequencing Genetic altered
genes
Cell viability after loss of gene
Risk gene
Genetic association data function in vitro
Risk SNP
-log10(p-value)
rs73014012
Patients
Tumor growth after loss of
gene function in vivo
Position on chromosome
Non-Patients SNP array Disease risk genes
insulin resistance and anti-CD44 treatment decreased blood glu- molecularly characterized >10,000 tumor samples across over
cose levels and adipose macrophage infiltration. In another exam- 30 cancers across multiple technologies.15 The large-scale anal-
ple, our colleagues Chen et al.14 analyzed 13 independent non- ysis of tumor samples suggested that an average of 33 to 66
small cell lung cancer (SCLC) gene expression datasets consisting genes harbor somatic mutations that could alter the function
of 2,026 lung samples collected from Gene Expression Omnibus of their protein targets and 140 genes can promote tumori-
(GEO). They identified 11 genes that were consistently overex- genesis.16 Most human cancers are caused by two to eight
pressed across all the samples, among which protein kinase PTK7 sequential alterations that lead to a selective growth advantage
was found. Immunostaining revealed that PTK7 was highly of the cell where it resides.16 These alterations have been
expressed in primary adenocarcinoma patient samples. They veri- widely explored as therapeutic targets. Representative examples
fied that RNA interference-mediated attenuation of PTK7 include EGFR amplification in lung cancer,17 BRAF mutation
decreased cell viability and increased apoptosis in a subset of ade- in melanoma,18 and ALK translocations in lung cancer.19
nocarcinoma cell lines and loss of PTK7 impaired tumor growth A cancer that possesses a genomic alteration may be treated by
in xenotransplantation assays, suggesting its potential as a novel a drug that targets this alteration, even though this drug was not
therapeutic target in non-SCLC. originally discovered for this tumor type. For instance, KIT was
discovered as a target for chronic myelogenous leukemia and later
Target discovery using somatic mutation data it was discovered as a target in gastrointestinal stromal tumors,
Many complex diseases are caused by alterations of DNA leading to the repositioning of the KIT inhibitor, Imatinib, for
sequences. Targeting genetic alterations is thus an ideal treating patients with KIT-positive gastrointestinal stromal
approach to find therapeutic solutions. Recent advances in tumors.20 Rubio-Perez et al.21 recently collected and analyzed
DNA sequencing technologies enabled large-scale characteriza- somatic mutations, copy-number alterations, fusion genes, and
tion of disease samples. Analyzing molecular data of these RNA-Seq expression data of 4,068 tumors in 16 cancer types in
samples plays an essential role in identifying alterations TCGA and collected somatic mutations for 2,724 additional
responsible for disease. TCGA is one notable example that tumors. They identified 459 mutational driver genes and 38
or viability. Essential genes in 72 breast, pancreatic, and ovarian LEVERAGE BIG DATA TO IDENTIFY NEW DRUG
cancer cell lines were inferred using a lentiviral shRNA library INDICATIONS FOR PRECLINICAL STUDIES
targeting 16,000 genes.33 Essential genes in a few human can- Since discovering a new chemical entity is a very long and compli-
cer cell lines were also characterized recently using the bacterial cated process, we will mainly discuss the reuse of existing drugs
CRISPR system.34 Target function can be even inferred (referred as drug repositioning), which offers a relatively short
through the measurement of gene expression changes upon approval process and straightforward path to clinical translation.
genetic perturbation (data available in Library of Integrated Computational approaches for drug repositioning have been
Network-based Cellular Signatures). reviewed previously.37,38 Figure 3 illustrates three common big
data approaches that use different features to discover new drug
Outstanding challenges indications, and basic experimental approaches to validate them.
First, measurements made from disease samples may have poor We will first discuss these three approaches and then discuss the
quality. Recent studies indicated that a large number of tumor discovery of new drug combinations. Finally, we will argue that
samples are impure because of the mixed immune cells and stro- public datasets can be used to validate drug indications before
mal cells.35 Second, large technical and biological variation of time-consuming experiments.
samples exists. Third, the quality of reagents, especially antibod-
ies, varies widely.36 The misuse of antibodies may directly lead to Indication discovery using drug-target data
the failure of experiments. Last, although the dataset from high Targeting an individual alteration using either a small or a large
throughput experiments is useful either as a reference tool to molecule remains the main paradigm in drug discovery. This
detect expression or as a tool to infer biological function, they approach has led to the discovery of many successful drugs, such
occasionally give false signals, resulting in the misclassification of as trastuzumab (HER2 in breast cancer), crizotinib (ALK in
potentially good targets. non-SCLC), and dabrafenib (BRAF in melanoma). When a new
in 50 breast cancer cell lines68 and 90 drugs in 51 stable can- against five reference genes, leading to the development of a
cer cell lines.69 recurrence score used to predict the risk of recurrence. The recur-
rence score was subsequently validated in independent clinical
Biomarker discovery using genomic data studies.72
from clinical samples
Biomarkers can be also detected by comparing genomic profiles Outstanding challenges
of clinical samples. Outstanding examples include the finding of Lack of effective biomarkers may lead to the failure of clinical tri-
EGFR mutations as a predictor of sensitivity to gefitinib,70 and a als, whereas biomarkers are only detected or confirmed through
12-gene colon cancer recurrence score as a predictor of recurrence clinical trials. The complexity and large variation of clinical trials
in patients with stage II and III colon cancer treated with fluo- may cause some important biomarkers to be missed in the origi-
rouracil and leucovorin.71 O’Connell et al.71 performed quantita- nal study. This issue can be mitigated through an integrative anal-
tive reverse transcription polymerase chain reaction of 375 genes ysis of clinical trials across multiple studies. Unfortunately, a large
in four independent cohorts consisting of 1,851 patients with number of trials are still not available to the public. Open clinical
stage II or III colon cancer. These patients were either treated trial data becomes necessary in order to identify more effective
with surgery alone or surgery plus fluorouracil/leucovorin and biomarkers for current therapies or even rescue failed drugs via
their recurrence-free interval at three years were observed. Of 375 identifying the right patient populations.
genes, 48 genes were significantly associated with risk of recur-
rence and 66 genes were significantly associated with fluoroura- PERSPECTIVES
cil/leucovorin benefit. From these genes, seven genes were One belief of the current drug discovery paradigm is that thor-
selected based on their biology and the strength of association oughly understanding molecular changes of diseases will ulti-
with outcomes. Expression of these seven genes was normalized mately lead to the discovery of new therapeutics. In order to