Leveraging Big Data To Transform Target

Leveraging Big Data to Transform Target
Selection and Drug Discovery

B Chen1 and AJ Butte1
The advances of genomics, sequencing, and high throughput technologies have led to the creation of large volumes of
diverse datasets for drug discovery. Analyzing these datasets to better understand disease and discover new drugs is
becoming more common. Recent open data initiatives in basic and clinical research have dramatically increased the types
of data available to the public. The past few years have witnessed successful use of big data in many sectors across the
whole drug discovery pipeline. In this review, we will highlight the state of the art in leveraging big data to identify new
targets, drug indications, and drug response biomarkers in this era of precision medicine.
In 2013, the European Bioinformatics Institute hosted 15 peta- and lead compound discovery. One trend of disease classification
bytes data in their shared file systems.1 This increased to 25 peta- in drug discovery is moving from a symptom-based disease classi-
bytes in 2014, which is equal to the hard drive space of over fication system to a system of precision medicine based on molec-
12,000 current-day typical personal laptops (each with a 2 tera- ular states.3,4 Building a new classification of diseases requires
byte drive). These data were distributed in over 120,000 datasets molecular characterization of all diseases. In addition, an ideal
available for searching and analysis in 2014. As voluminous as level of disease understanding would characterize all levels of
this data sounds, these numbers simply reflect the complexity and molecular changes, from DNA to RNA to protein, as well as the
growth of the data from one single institute. effects of environmental factors.
This growth in the digitalization of biomedical research is due Each level of molecular change can be characterized by the
to the advances and decreasing costs of genomics, sequencing, analysis of relevant data points. Table 1 lists the data types fre-
and the increasing use of high throughput technologies in the quently used in drug discovery and their current relevant technol-
research enterprise. Large volumes of biomedical data are being ogies. At the DNA level, single-nucleotide polymorphisms
produced every day, and much of these data are actually now (SNPs) that occur specifically in the disease population is one
becoming publicly available, owing to the initiatives of open data. type of DNA sequence variation widely used to characterize dis-
Although the field of biomedical informatics is facing challenges ease. Copy number variations (CNVs) reflect relatively large
in the storage and management of these datasets, this field is also regions of genome alterations, which may be also associated with
embracing more exciting opportunities in the discovery of new disease. Both SNPs and CNVs can be identified from the
knowledge from these data.2 Big datasets are now not only rou- genome-wide association studies (GWASs) and whole genome
tinely analyzed to inform discovery and validate hypothesis, but sequencing approaches. Mutations, particularly somatic muta-
also frequently repurposed to ask new biomedical questions. tions, are widely examined using next generation sequencing to
However, researchers are facing so many datasets that sometimes find driver genes in cancer that confer a selective growth advant-
it is difficult to choose the appropriate one for their studies. In age of cells.
this review, we will first describe the data types commonly used At the RNA level, gene expression (primarily mRNA) is argu-
in drug discovery and then list datasets publicly available. We will ably the most widely used feature for disease characterization. It
highlight some remarkable datasets that led to the discovery of has been used extensively to understand disease mechanism owing
new targets, drugs, or drug response biomarkers. to the development of the microarray technology. The recent
development of RNA-Seq presents merits in the expanded cover-
WHAT BIG DATA ARE AVAILABLE FOR DRUG DISCOVERY? age of transcripts and in the detection of low abundant tran-
Drug discovery often starts with the classification and under- scripts.5 Protein expression is another critical feature used to
standing of disease processes, followed by target identification characterize disease. Large-scale quantification of protein
1
Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, California, USA. Correspondence: Bin Chen (bin.chen@ucsf.
edu), Atul J Butte (atul.butte@ucsf.edu)
Received 12 November 2015; accepted 2 December 2015; advance online publication 11 December 2015. doi:10.1002/cpt.318
CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 285

Table 1 Common data types for drug discovery
Data type Description Common techniques Public availabilitya
SNP A single nucleotide variation in a SNP array: most widely used ****
genetic sequence
Whole genome sequencing
CNV Variation of the number of copies of SNP array: most widely used; less sample ****
a particular gene in the genetic DNA required; high probe density and
sequence coverage
Comparative genome hybridization: high sen-
sitivity and specificity; low spatial resolution
Whole genome sequencing: can detect
smaller CNVs and novel types (e.g.,
inversions)
Mutation A permanent change of the nucleo- Whole exome sequencing: most widely used ****
tide sequence of the DNA; mostly
somatic mutation that occurs in any Whole genome sequencing: more expensive
of the cells except the germ cells and more coverage
Gene expression Mostly expression of mRNA but also Microarray: most widely used *****
includes expression of other
transcripts RNA-Seq: can detect novel transcripts, low
abundant transcripts and isoforms
Fluorescent in situ hybridization: can detect
transcript abundance and spatial location in
cells for a small number of genes
RT-PCR: frequently used to confirm expression
for a small number of genes
Protein expression Can be expression of multiple Western blot: widely used to quantify protein ***
isoforms or variations due to expression for a small number of proteins
posttranslational modifications
ELISA: widely used to detect and quantita-
tively measure a protein in samples
Immunohistochemistry: can detect intracellu-
lar localization for a small number of proteins
Reverse phase protein array: can detect
expression for a few hundred proteins
Mass spectrometry: can detect expression for
a wide range of proteins
Protein-protein interaction Physical interactions between two or Two-hybrid screening: low-tech; high false- ****
more proteins positive rate
Mass spectrometry
Protein-DNA interaction Binding of a protein to a molecule of ChIP-seq: combines chromatin immunopreci- ***
DNA pitation with massively parallel DNA sequenc-
ing to identify the binding sites of DNA-
associated proteins
Gene silencing Effect of loss of gene function RNAi: established method; knocks gene down **
at mRNA or non-coding RNA level; can have
transient effect (siRNA) or long-term effect
(shRNA)
CRISPR-Cas9: new method; modifies gene
(via knockout/knockin) at the DNA level;
causes permanent and heritable changes in
the genome
Gene overexpression Effect of gain of gene function cDNAs/ORFs: provide clones of sequence *
Table 1 Continued on next page
286 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

Table 1 Continued
Data type Description Common techniques Public availabilitya

Drug efficacy Effect of drug treatment; primarily HTS: rapidly assess the activity of a large ***
represented as IC50/EC50/GI50 in number of compounds in biochemical assays
vitro or cell-based assays
MTT assay: often used to confirm activity for a
small number of compounds
Drug-target interaction Physical interaction between a drug Affinity chromatography with mass spectrome- ***
and a protein target try: most sensitive and unbiased method
SPR
EMR/EHR Patient response upon interventions Digitalization *
CNV, copy number variation; CRISPR, clustered regularly interspaced short palindromic repeats; ELISA, enzyme-linked immunosorbent assay; EMR/HER, electronic
medical/health records; HTS, high throughput screening; MTT, methylthiazol tetrazolium; RT-PCR, real-time polymerase chain reaction; SNP, single-nucleotide
polymorphism; SPR, surface plasmon resonance.
a
Indicates the degree of public availability. For example, ***** shows researchers could easily access this type of data via public portals.
expression is becoming possible recently because of the emerging new drugs. Hence, it is of utmost importance that the data
new high throughput technologies, such as reverse-phase protein should be open to the public, such that every piece of informa-
arrays and mass spectrometry, although their coverage and quality tion can be easily connected.
remain limited. The interactions among DNA, RNA, and pro- Many important reference datasets have recently been created
tein can be captured by ChIP-Seq, mass spectrometry, and other and released, and can be used for drug discovery.7 Notable exam-
techniques; however, most of those interactions have been cap- ples are listed in Table 2. Arguably, public datasets can be used
tured only in cell lines or other in vitro models. More recently, to inform every step of preclinical drug discovery. Clinical data-
next generation sequencing approaches have allowed sequencing sets are becoming increasingly open as well.8 Figure 1 shows a list
environmental factors, such as microbial cells in the human body. of public datasets that can be leveraged to identify new targets,
Today, a snapshot of the molecular changes in disease can be drug indications, and drug response biomarkers. Not only have
quickly modeled by the variety of datasets collected using multi- public datasets been widely used as a source of reference, but also
ple techniques. The recent development of single cell sequencing they have been intensively analyzed to ask new questions, dis-
adds another layer of molecular changes. The number of layers cover new findings, or even validate hypothesis. In this study, we
dramatically increases as we consider the dynamic process of dis- selectively review some outstanding cases in the past few years in
ease progression. Moreover, other than disease samples from which discoveries were made primarily through the analysis of
patients, diverse preclinical models (e.g., cell lines, animal models) big data and validated rigorously through experimental
could be molecularly characterized in order to understand disease approaches.
and validate hypothesis.
On the drug side, molecular changes in disease models per- LEVERAGE BIG DATA TO IDENTIFY NEW TARGETS
turbed by chemical or genetic agents can be captured to under- FOR PRECLINICAL STUDIES
stand disease and drug mechanism. Gene function and gene Using big data to select targets for preclinical studies often starts
regulatory networks can be studied via genome-wide functional with the identification of molecular changes between disease sam-
screens, such as RNAi and clustered regularly interspaced short ples and healthy samples. The molecular changes are implicated
palindromic repeats (CRISPR)-Cas9.6 In addition, cellular in gene expression change, genetic variation, or other features,
responses of thousands of chemical compounds in a large of num- and are furthermore used to inform target discovery. Figure 2
ber of disease models can be quickly detected by high throughput illustrates three common big data approaches that use different
screening. Patient response upon drug intervention can be molecular features to discover targets, and basic experimental
tracked and analyzed recently owing to the availability of elec- approaches to validate targets. We will first discuss these three
tronic medical records (EMRs) and clinical trials. In addition to approaches and then suggest that public datasets can be used to
the molecular and clinical data, free-text data presented in litera- validate targets before time-consuming experiments.
ture are also useful in drug discovery.
Target discovery using gene expression data
WHAT BIG DATA SOURCES ARE PUBLICLY AVAILABLE? Among molecular features, gene expression is the most widely
No single laboratory, institute, or consortium is able to produce used feature and has been extensively explored to inform target
the data fully capturing all the layers of the complex disease sys- selection. As an example, Grieb et al.9 found that MTBP was sig-
tems. In addition, understanding of these systems relies on a large nificantly elevated in breast cancer samples compared with nor-
number of samples, such that statistical power could be reached. mal breast tissues by examining mRNA expression of 844 breast
Integrative analysis of multiple layers of data points from differ- cancer samples from The Cancer Genome Atlas (TCGA). Analy-
ent sources is thus essential to understand disease and discover sis of survival data revealed that increased MTBP levels are

Table 2 Common public databases for drug discovery
Database Description (as of October 2015) URL
dbSNP SNPs for a wide range of organisms, including >150M human http://www.ncbi.nlm.nih.gov/snp
reference SNPs.
dbVar Genomic structural variations (primarily CNVs) generated http://www.ncbi.nlm.nih.gov/dbvar
mostly by published studies of various organisms, including
>2.1M human CNVs.
COSMIC Primarily somatic mutations from expert curation and http://cancer.sanger.ac.uk/cosmic
genome-wide screening, including >3.5M coding mutations.
1000 Genomes Project Genomes of a large number of people to provide a compre- http://www.1000genomes.org
hensive resource on human genetic variation, including
>2.5K samples.
TCGA Genomics and functional genomics data repository for >30 https://tcga-data.nci.nih.gov/tcga
cancers across >10K samples. Primary data types include
mutation, copy number, mRNA, and protein expression.
GEO Functional genomics data repository hosted by NCBI, http://www.ncbi.nlm.nih.gov/geo
including >1.6M samples.
ArrayExpress Functional genomics data repository hosted by EBI, including https://www.ebi.ac.uk/arrayexpress
>1.8M samples.
GTEx Transcriptomic profiles of normal tissues, including >7K http://www.gtexportal.org
samples across >45 tissue types.
CCLE Genetic and pharmacologic characterization of >1,000 http://www.broadinstitute.org/ccle
cancer cell lines.
Human Protein Atlas Expression of >17K unique proteins in cell lines, normal, and http://www.proteinatlas.org
cancer tissues.
Human Proteome Map Expression of >30K proteins in normal tissues. http://humanproteomemap.org
StringDB Protein-protein interactions for >9M proteins from >2K http://string-db.org
organisms.
ENCODE Protein-DNA interactions, including >1.4K ChIP-Seq experi- http://genome.ucsc.edu/ENCODE
ments across 200 cell lines.
Project Achilles Genetic vulnerabilities across >100 genomically character- http://www.broadinstitute.org/achilles
ized cancer cell lines by genome-wide genetic perturbation
reagents (shRNAs or Cas9/sgRNAs), including >11.2K
genes.
LINCS Cellular responses upon the treatment of chemical/genetic http://lincscloud.org
perturbagen, including >1M gene expression profiles repre-
senting >5,000 compounds and >3,500 genes (shRNA and
overexpression) in >15 cell lines.
Genomics of Drug Drug sensitivity data of 140 drugs in >700 cancer cell lines. http://www.cancerrxgene.org
Sensitivity in Cancer project
ChEMBL Bioactivities for drug-like small molecules, including >10K https://www.ebi.ac.uk/chembl
targets, >1.7M distinct compounds, and >13.5M activities.
PubChem Chemical compounds and bioassay experiments, including http://pubchem.ncbi.nlm.nih.gov
>60M unique chemical compounds and >1.1M assays.
CMap >6,000 drug gene expression profiles representing 1,309 http://www.broadinstitute.org/cmap
compounds tested in 3 main cell lines.
CTRP Links genetic, lineage, and other cellular features of cancer http://www.broadinstitute.org/ctrp.v2.2
cell lines to small-molecule sensitivity, including 860 cell
lines and 461 compounds.
ImmPort Clinical assessments in immunology along with molecular https://immport.niaid.nih.gov
profiles, including 143 clinical studies/trials and 799 experi-
ments on >22.4K subjects.
Table 2 Continued on next page

Table 2 Continued
Database Description (as of October 2015) URL

ClinicalTrials.gov Registry and results database of publicly and privately sup- https://clinicaltrials.gov
ported clinical studies, including >201.7K studies.
PharmGKB Genetic variations on drug response, including >3K dis- https://www.pharmgkb.org
eases, >27K genes, and >3K drugs.
CCLE, Cancer Cell Line Encyclopedia; CMap, Connectivity Map; CNVs, copy number variants; COSMIC, catalog of somatic mutations in cancer; CTRP, Cancer Therapeutics
Response Portal; dbSNP, Single Nucleotide Polymorphism Database; dbVar, database of genomic structural variation; EBI, European Bioinformatics Institute; ENCODE, Encyclo-
pedia of DNA Elements; GEO, Gene Expression Omnibus; GTEx, Genotype-Tissue Expression; IMMPORT, Immunology Database and Analysis Portal; LINCS, Library of Integrated
Network-based Cellular Signatures; NCBI, National Center for Biotechnology Information; SNPs, single-nucleotide polymorphisms; TCGA, The Cancer Genome Atlas.
significantly linked with poor patient survival. They further initiating cells, they found 13 kinases with higher mRNA expres-
stratified patients into clinically relevant subgroups: estrogen- sion in TNBC cell lines than in non-TNBC cell lines. Subse-
receptor positive, HER2 positive, and triple negative breast can- quent protein expression validation reduced the candidate list to
cer (TNBC) tumors and observed that MTBP is expressed higher eight kinases, which were further correlated to TNBC clinical
in the triple negative tumor subgroup than in the other two sub- subtype samples in TCGA. Among these eight kinases, three
groups. Further knockdown of MTBP significantly impaired kinases (PKC-a, CDK6, and MET) with high expression were
TNBC tumor growth in vivo. In another example, analysis of associated with shorter overall survival in patients with TNBC,
clear cell renal cell carcinoma samples from TCGA indicated suggesting their potential as prognostic markers and therapeutic
that TPL2 overexpression was significantly related to the pres- targets. In the subsequent functional validation, two-drug combi-
ence of metastases and poor outcome in clear cell renal cell carci- nations targeting these three kinases inhibited TNBC cell prolif-
noma.10 Silencing of TPL2 inhibited cell proliferation, eration and tumorigenic potential and a combination of PKC-
clonogenicity, anoikis resistance, migration, and invasion capabil- a2MET inhibitors attenuated tumor growth in vivo.
ities and inhibited orthotopic xenograft growth and lung metasta- Analyzing the samples from a single data source may limit the
sis, demonstrating the significant role of TPL2 in disease broader application of the findings because of biological and tech-
progression. In addition, public gene expression databases, such as nical bias. Meta-analysis that is aimed at detecting consistent
TCGA, were regularly used as a source of reference to confirm changes across multiple data sources may increase statistical
gene expression. One example includes the confirmation of power and further mitigate the bias. The availability of public
HMMR in the study of glioblastoma.11 datasets enables researchers to perform meta-analysis of microar-
The targets in these previous examples were first proposed by ray datasets for many diseases. Our colleagues Kodama et al.13
authors and were then confirmed by the analysis of public gene proposed a meta-analysis approach: a gene expression-based
expression data. By contrast, targets can also be directly discov- GWAS that searches for genes repeatedly implicated in multiple
ered through the primary analysis of gene expression data. With- experiments. They carried out an expression-based GWAS for
out any specific targets in mind, Hsu et al.12 sought for druggable type 2 diabetes by using 1,175 samples collected from 130 inde-
kinases, which are oncogenic in TNBC. By analyzing gene expres- pendent microarray experiments and identified the immune-cell
sion data from CCLE and National Cancer Institute-60 panel of receptor CD44 as the top candidate. They further validated that
cancer cell lines, and gene expression profiles of breast tumor CD44 deficiency ameliorated adipose tissue inflammation and
Figure 1 Public datasets can be leveraged to identify new targets, drug indications, and drug response biomarkers.

Target Discovery Using Big Data Experimental Validation
Gene expression data Disease Normal
Expression validation
(in vitro, in vivo)
Gene
Relative Fluorescence
Disease samples Protein expression
using western blot
Normal samples Expression profiling Differentially

Cycles
Expressed genes
mRNA expression
Somatic mutation data Disease Normal using RT-PCR
Protein expression and location
Gene
using Immunohistochemistry
Disease samples
somatic mutation
Functional validation
(in vitro, in vivo)
Normal samples DNA sequencing Genetic altered
genes
Cell viability after loss of gene
Risk gene
Genetic association data function in vitro
Risk SNP
-log10(p-value)
rs73014012
Patients
Tumor growth after loss of
gene function in vivo
Position on chromosome
Non-Patients SNP array Disease risk genes
Figure 2 An illustration of big data approaches to identifying new targets.
insulin resistance and anti-CD44 treatment decreased blood glu- molecularly characterized >10,000 tumor samples across over
cose levels and adipose macrophage infiltration. In another exam- 30 cancers across multiple technologies.15 The large-scale anal-
ple, our colleagues Chen et al.14 analyzed 13 independent non- ysis of tumor samples suggested that an average of 33 to 66
small cell lung cancer (SCLC) gene expression datasets consisting genes harbor somatic mutations that could alter the function
of 2,026 lung samples collected from Gene Expression Omnibus of their protein targets and 140 genes can promote tumori-
(GEO). They identified 11 genes that were consistently overex- genesis.16 Most human cancers are caused by two to eight
pressed across all the samples, among which protein kinase PTK7 sequential alterations that lead to a selective growth advantage
was found. Immunostaining revealed that PTK7 was highly of the cell where it resides.16 These alterations have been
expressed in primary adenocarcinoma patient samples. They veri- widely explored as therapeutic targets. Representative examples
fied that RNA interference-mediated attenuation of PTK7 include EGFR amplification in lung cancer,17 BRAF mutation
decreased cell viability and increased apoptosis in a subset of ade- in melanoma,18 and ALK translocations in lung cancer.19
nocarcinoma cell lines and loss of PTK7 impaired tumor growth A cancer that possesses a genomic alteration may be treated by
in xenotransplantation assays, suggesting its potential as a novel a drug that targets this alteration, even though this drug was not
therapeutic target in non-SCLC. originally discovered for this tumor type. For instance, KIT was
discovered as a target for chronic myelogenous leukemia and later
Target discovery using somatic mutation data it was discovered as a target in gastrointestinal stromal tumors,
Many complex diseases are caused by alterations of DNA leading to the repositioning of the KIT inhibitor, Imatinib, for
sequences. Targeting genetic alterations is thus an ideal treating patients with KIT-positive gastrointestinal stromal
approach to find therapeutic solutions. Recent advances in tumors.20 Rubio-Perez et al.21 recently collected and analyzed
DNA sequencing technologies enabled large-scale characteriza- somatic mutations, copy-number alterations, fusion genes, and
tion of disease samples. Analyzing molecular data of these RNA-Seq expression data of 4,068 tumors in 16 cancer types in
samples plays an essential role in identifying alterations TCGA and collected somatic mutations for 2,724 additional
responsible for disease. TCGA is one notable example that tumors. They identified 459 mutational driver genes and 38

drivers acting via copy-number alterations or fusions. After map- rheumatoid arthritis risk loci, adding up to a total of 101 total
ping these driver genes to drug databases, including ChEMBL rheumatoid arthritis risk loci. These loci were connected to 98
and ClinicalTrials.gov, they found that up to 73.3% of patients genes. They demonstrated the gene list expanded from those 98
could benefit from agents in clinical stages. This in silico analysis genes via protein-protein interaction networks significantly over-
showed the potential of targeting genomic alterations for individ- lap with the targets of the drugs approved for rheumatoid arthri-
ual tumors, yet experimental validation is expected for wide clini- tis. This suggested that other targets among those 98 genes might
cal applications. The recent launch of the National Cancer be therapeutic targets. Nelson et al.26 performed a large-scale
Institute-Molecular Analysis for Therapy Choice program that evaluation of genetic support in target selection. They collected
aims to identify targets and therapeutics for individual patients 16,459 gene-medical subject heading pairs consisting of 2,531
solely based on mutations demonstrates a broad interest of this traits and 7,253 genes associated with traits from public genetic
approach in target selection. databases, and collected 19,085 target-medical subject heading
Analysis of genomic features from a wide range of cancers pairs from drug databases. The significant enrichment of
revealed that a large fraction of driver genes are either undruggable known targets in the list of variant genes suggested that select-
or are tumor suppressors, which usually cannot be interfered by ing genetically supported targets could increase the success rate
drugs.16 Targeting their downstream or upstream-dependent com- in clinical development.
ponents may bypass this problem. For example, inactivating muta- Because GWASs are often not able to identify the causal rela-
tions of the tumor suppressors BRCA1 or BRCA2 lead to tion between variant and disease, combining genetic analysis with
activation of a downstream pathway required to repair DNA dam- other types of evidence may increase the likelihood of selecting a
age. Poly ADP-ribose polymerase, a family of protein involved in good target. We recently integrated gene expression with disease-
the DNA repair, was subsequently developed as a therapeutic tar- associated SNPs and therapeutic target datasets across a diverse
get for those with absence of BRCA function.22 In addition to set of 56 diseases in 12 disease categories.27 We systematically
BRCA, defects in the DNA-damage response, a complex network evaluated how successful differentially expressed genes, disease-
of proteins required for cell-cycle checkpoint and DNA repair, associated SNPs, or the combination of both could recover
have been associated with tumorigenesis, yet are undruggable. known disease targets. We observed the combination of differen-
Squatrito et al.23 assessed genes encoding key components of the tially expressed genes and SNPs has more predictive power than
DNA-damage response from the glioma samples in TCGA and each feature alone. This suggested that linking differentially
found that 3.2% of these samples showed somatic mutations in expressed genes with SNPs improves the accuracy of prioritizing
ATR, ATM, or CHEK1 and 36% of these samples presented candidate targets.
genomic loss of at least one copy of ATR, ATM, CHEK1, or
CHEK2, suggesting tumor suppressor activity of the ATM/Chk2/ Leveraging public datasets for target validation
p53 pathway. Further experiments confirmed that the loss of The de novo analysis of data discussed above can be used to
ATM/Chk2/p53 pathway components accelerate tumor develop- produce a list of candidate targets. In order to prioritize tar-
ment. Hence, it would be interesting to target the components gets for time-consuming experimental validation, one needs to
involved in this pathway. first assess their novelty and commercialization potential.28 In
addition, a good target in a preclinical study should satisfy
Target discovery using genetic association data the following criteria: (1) it should be druggable; (2) it should
Recent GWASs have identified common DNA sequence variants be expressed only in the abnormal cells of clinical samples
that contribute to many human diseases. An increasing number and not, or barely, expressed in normal cells; and (3) the
of studies demonstrate that genes with disease-associated alleles modulation of the target has the potential to reverse disease
may be promising drug targets as shown by the list of targets vali- phenotype.
dated by genetics.24 In one example, the analysis of patients with Public datasets can be leveraged to assess these criteria. The
familial hypercholesterolemia reveals mutations in the low- druggability can be assessed through an integrative analysis of
density lipoprotein receptor gene causes high levels of low-density protein functional class, homology to targets of approved drugs,
lipoprotein cholesterol and an increased risk of heart disease, three-dimensional structure, and the existence of published
leading to the subsequent discovery of the statin class of HMG- active small molecules.29 We may search its mRNA expression
CoA reductase inhibitors. In another example, rare gain-of- in cell-lines (data from CCLE and ref. 30), patients (data from
function mutations in the PCSK9 gene were found in the TCGA and GEO), and normal tissues (data from GTEx). We
families with high low-density lipoprotein levels and an increased may also search its protein expression in cell lines (data from
incidence of coronary heart disease, and subsequent functional The Human Protein Atlas), patient tissues (data from The
studies and clinical trials revealed that the loss of function of Human Protein Atlas and TCGA), and normal tissues (data
PCSK9 significantly reduced low-density lipoprotein cholesterol from The Human Protein Atlas and the Human Proteome
levels. Map31). We may further infer its function through the recent
By evaluating 10 million SNPs, Okada et al.25 recently per- high throughput experiments. For example, Cowley et al.32 used
formed a GWAS meta-analysis in a total of >100,000 subjects of a genome-scale, lentivirally delivered shRNA library to perform
European and Asian ancestries comprising 29,880 rheumatoid massively parallel pooled shRNA screens in 216 cancer cell lines
arthritis cases and 73,758 controls. They discovered 42 novel and identified genes that are essential for cell proliferation and/

Figure 3 An illustration of big data approaches to identifying new drug indications.
or viability. Essential genes in 72 breast, pancreatic, and ovarian LEVERAGE BIG DATA TO IDENTIFY NEW DRUG
cancer cell lines were inferred using a lentiviral shRNA library INDICATIONS FOR PRECLINICAL STUDIES
targeting 16,000 genes.33 Essential genes in a few human can- Since discovering a new chemical entity is a very long and compli-
cer cell lines were also characterized recently using the bacterial cated process, we will mainly discuss the reuse of existing drugs
CRISPR system.34 Target function can be even inferred (referred as drug repositioning), which offers a relatively short
through the measurement of gene expression changes upon approval process and straightforward path to clinical translation.
genetic perturbation (data available in Library of Integrated Computational approaches for drug repositioning have been
Network-based Cellular Signatures). reviewed previously.37,38 Figure 3 illustrates three common big
data approaches that use different features to discover new drug
Outstanding challenges indications, and basic experimental approaches to validate them.
First, measurements made from disease samples may have poor We will first discuss these three approaches and then discuss the
quality. Recent studies indicated that a large number of tumor discovery of new drug combinations. Finally, we will argue that
samples are impure because of the mixed immune cells and stro- public datasets can be used to validate drug indications before
mal cells.35 Second, large technical and biological variation of time-consuming experiments.
samples exists. Third, the quality of reagents, especially antibod-
ies, varies widely.36 The misuse of antibodies may directly lead to Indication discovery using drug-target data
the failure of experiments. Last, although the dataset from high Targeting an individual alteration using either a small or a large
throughput experiments is useful either as a reference tool to molecule remains the main paradigm in drug discovery. This
detect expression or as a tool to infer biological function, they approach has led to the discovery of many successful drugs, such
occasionally give false signals, resulting in the misclassification of as trastuzumab (HER2 in breast cancer), crizotinib (ALK in
potentially good targets. non-SCLC), and dabrafenib (BRAF in melanoma). When a new

target is proposed for a disease, existing drugs that interfere with ological markers to their baselines, providing a sound basis to this
this target can be searched from the literature or drug-target data- computational approach.44
bases (e.g., DrugBank39 and ChEMBL) and their potential new Other studies have used slightly different approaches. For
usage is further validated by experiments. This approach is com- example, instead of building a universal signature for one disease,
monly practiced. If there is no drug available for this target, Zerbini et al.45 considered the variation of individual patients.
structure-based design, such as homology modeling, can be used They built a disease signature for individual patients with clear
to infer new drug hits. cell renal cell carcinoma and predicted drugs for individual
patients. Pentamidine, one of the common drugs shared by all
Indication discovery using gene expression data the patients, showed its efficacy in vitro and in the 786-O human
Another common approach is to look for inverse drug-disease clear cell renal cell carcinoma xenograft mouse model. Brum
relationships by comparing disease molecular features and drug et al.46 profiled gene expression in human mesenchymal stromal
molecular features, such as gene expression. This approach starts cells toward osteoblasts and created significantly regulated genes.
with the creation of a disease gene expression signature by com- They found that the signature of parbendazole matches the
paring disease samples and normal tissue samples, followed by expression changes observed for osteogenic human mesenchymal
querying drug-gene expression databases, such as Connectivity stromal cells, suggesting that parbendazole could stimulate osteo-
Map (CMap) and Library of Integrated Network-based Cellular blast differentiation. They further validated that parbendazole
Signatures. For example, our colleagues Dudley et al.40 and Sirota induced osteogenic differentiation through a combination of
et al.41 performed large-scale analysis of gene expression profiles cytoskeletal changes.
across over 100 diseases using microarray data from GEO and
mapped disease signatures to over 100 drugs signatures in CMap. Indication discovery using other sources
Using this system’s approach, they repurposed the anticonvulsant Many other molecular and clinical features, including side effect,
topiramate for the treatment of inflammatory bowel disease and genetic variation, and chemical structure, have been leveraged for
the antiulcer drug cimetidine for the treatment of lung adenocar- drug repositioning. We highlight some exciting findings here and
cinoma. Our colleagues Jahchan et al.42 used a similar systematic refer other findings to our recent review on the trend of compu-
drug-repositioning bioinformatics approach to query a large com- tational drug repositioning.38 Our colleagues Paik et al.47
pendium of gene expression profiles using a SCLC expression sig- extracted clinical features from over 13 years of EMRs, including
nature derived from GEO. They predicted antidepressant drugs >9.4 M laboratory tests of >530,000 patients, in addition to
for the treatment of SCLC and validated that this group of drugs diverse genomics features. With these features, they computed
potently induce apoptosis in both chemotherapy naive and chem- drug-drug similarity and disease-disease similarity. Based on the
otherapy resistant SCLC cells in culture, in mouse and human assumption that similar diseases can be treated with similar drugs,
SCLC tumors transplanted into immunocompromised mice, and they inferred 3,891 new indications that were previously not
in endogenous tumors from a mouse model for human SCLC. known to be associated. Among those new indications, terbuta-
This finding even led to the launch of a clinical trial line sulfate was indicated as a potential drug for amyotrophic lat-
(NCT01719861). eral sclerosis treatment and was further validated in an in vivo
Van Noort et al.43 systematically assessed how well the known zebrafish model of amyotrophic lateral sclerosis.
disease-drug indications were recapitulated by the expression- Iorio et al.48 built a drug-drug similarity matrix using the gene
based inverse correlation of disease-drug relations for 40 individ- expression data from CMap and verified an unexpected similarity
ual diseases. They found that colorectal cancer is one of the between CDK2 inhibitors and topoisomerase inhibitors. They
diseases in which known disease-drug indications could be well also found that a Rho-kinase inhibitor might be reused as an
recapitulated. This finding, together with the unmet clinical need enhancer of cellular autophagy, potentially applicable to several
in the treatment of metastasized colorectal cancer, led them to neurodegenerative disorders. This work was further extended in
look for drugs that inhibit metastasis in colorectal cancer. Instead a recent study in which glipizide and splitomicin were found
of a signature built by comparing disease samples and normal to perturb microtubule function through a semisupervised
samples, they built a gene signature of metastatic potential by approach.49
comparing nonmetastatic tumors vs. metastatic primary tumors.
By querying the CMap V2 using this signature, they predicted Discovery of new drug combinations
three novel compounds against colorectal cancer: citalopram, tro- As many diseases are driven by complex molecular and environmen-
glitazone, and enilconazole, and verified these compounds by in tal interactions, targeting a single component may not be sufficient
vitro assays of clonogenic survival, proliferation, and migration to disrupt these complex interactions; thus, there is increasing inter-
and in a subcutaneous mouse model. est in targeting multiple molecules using combined drugs or multi-
Although drugs in these previous examples were validated in target inhibitors. Using big data to predict drug synergy is
preclinical models, the question of whether the disease gene appealing, yet challenging. In a recent community-based open chal-
expression was really reversed in disease models remains lenge for drug synergy predictions, among the 31 submitted meth-
unknown. A recent study in a mouse model of dyslipidemia ods, only three methods performed significantly better than
found that treatments that restore gene expression patterns to random chance.50 Nevertheless, a few interesting combinations
their norm are associated with the successful restoration of physi- have been found through a big data mining approach. Mitrofanova

et al.51 assumed that if a drug could downregulate the activated tar- ety of genomic features. In addition to choosing the appropriate
get genes and upregulate the repressed targets of a master regulator preclinical models, moving preclinical findings into the clinic is
(e.g., a key transcription factor), then the drug could reverse the challenging. One drug or one drug combination validated success-
activity of the master regulator. Using the drug signatures derived fully in preclinical models may fail to translate into the clinic
from genetically engineered mouse models, they identified drugs to because of the concerns of high toxicity, high cost, low bioavaila-
reverse the master regulator pair, FOXM1/CENPF, which is essen- bility, or many other factors. Our recent following analysis of the
tial for prostate tumor malignancy. They further extended the con- previous work on drug combinations in clinical trials62 revealed
cept that effective drug combinations should induce a more that a drug is more likely to be combined with existing therapies
significant reversal of master regulator-specific regulon expression, and a brand name drug is rarely combined with another brand
compared to the individual drugs. The combination of rapamycin name drug (unpublished), suggesting the necessity of considering
1 PD0325901 was predicted to have the strongest reversion of the the characteristics of clinical trials during preclinical studies.
FOXM1/CENPF activity, both with respect to the total number
of targets affected by both drugs and the number of unique targets LEVERAGE BIG DATA TO IDENTIFY DRUG RESPONSE
affected by each drug. Their synergistic effect was validated in BIOMARKERS IN THE ERA OF PRECISION MEDICINE
mouse and human prostate cancer models. Sun et al.52 demon- Because drugs are mostly discovered based on disease molecular
strated that using genomic and network characteristics could lead features, it is natural that they should be applied to those patients
to a good performance of predicting synergistic drugs for cancer. possessing these molecular features. A number of existing drugs
They confirmed 63.6% of their predictions for breast cancer have been proven to be effective only for a group of patients with
through experimental validation and literature search, and identi- specific molecular features: for example, trastuzumab for patients
fied that the combination of erlotinib and sorafenib has strong syn- with HER2-positive breast cancer. Identifying molecular features
ergy and low toxicity in a zebrafish MCF7 xenograft model. (or biomarkers) for predicting drug response is critical to identify
the right patient populations for any drug under investigation.63
Leveraging public datasets to validate new drug indications
Figure 4 shows two big data approaches to identify biomarkers
Public datasets can be leveraged to validate drug hits and under-
for predicting drug response, and experimental approaches to val-
stand drug mechanisms. For example, drug efficacy and toxicity
idate biomarkers.
in vitro or in vivo may be searched from the drug-sensitivity data-
bases (e.g., CCLE, ChEMBL, canSAR53) and toxicity databases Biomarker discovery using genomic and pharmacogenomics
(e.g., CTD54), respectively. Drug efficacy can be inferred from data from preclinical samples
EMRs as well. Xu et al.55 recently demonstrated the usage of The recent large-scale generation of pharmacogenomics data in
EMRs in the validation of drug-disease pairs through a case study preclinical disease models (especially cell lines) and molecular
of metformin associated with reduced cancer mortality. Our characterization of these models enable researchers to identify
colleagues Khatri et al.56 validated the beneficial effect of atorvas- biomarkers for predicting drug response. By integrating pharma-
tatin on graft survival by retrospective analysis of EMRs of a cological profiles for 24 anticancer drugs across 479 cell lines
single-center cohort of 2,515 renal transplant patients followed
with the gene expression, copy number, and mutation data of
for up to 22 years.
these cell lines, Barretina et al.64 identified genetic, lineage, and
To understand drug mechanisms, the models,57,58 which were
gene-expression-based biomarkers of drug sensitivity. They high-
built by leveraging public datasets, can be used. Woo et al.58
lighted a few cases: plasma cell lineage for IGF1 receptor inhibi-
recently built a computational model called DEMAND to infer
tors, AHR expression for MEK inhibitors, and SLFN11
drug targets in a disease model (e.g., cell line) by using drug-gene
expression for topoisomerase inhibitors. Kim et al.65 identified
expression profiles and a regulatory network of the disease model.
three distinct target/response-indicator pairings including
Their model recovered the established proteins involved in the
NLRP3 mutation/inflammasome activation for FLIP addiction,
mechanism of action for 70% of the tested compounds and
co-occurring KRAS and LKB1 mutation for COPI addiction,
revealed altretamine, an anticancer drug, as an inhibitor of GPX4
lipid repair activity. and a seven-gene expression signature for a synthetic indolotria-
zine. Basu et al.66 quantitatively measured the sensitivity of 242
Outstanding challenges molecularly characterized cancer cell lines to 354 small mole-
First, selecting appropriate preclinical models from a large num- cules and created the Cancer Therapeutics Response Portal that
ber of available models is often challenging during the validation enables researchers to correlate genetic features to sensitivity.
stage, as some validation models may not be reliable per se, or the Using their portal, they identified that activating mutations in
molecular features of some models may be quite different with the oncogene b-catenin could predict sensitivity of the BCL-2
those used for the prediction.59 We recently identified that half family antagonist navitoclax. Their subsequent work expanded
of the hepatocellular carcinoma cell lines are not significantly cor- the portal to 860 cell lines and 481 compounds including 70
related to the hepatocellular carcinoma tumors from TCGA US Food and Drug Administration-approved agents, 100 clini-
using gene expression features.60 Domcke et al.61 identified a few cal candidates, and 311 small-molecule probes,67 allowing
rarely used ovarian cancer cell lines that more closely resembled researchers to identify biomarkers for a larger number of drugs.
ovarian tumors than commonly used cell lines by analyzing a vari- Several other similar sources include 77 therapeutic compounds

Figure 4 An illustration of big data approaches to identifying new drug response biomarkers.
in 50 breast cancer cell lines68 and 90 drugs in 51 stable can- against five reference genes, leading to the development of a
cer cell lines.69 recurrence score used to predict the risk of recurrence. The recur-
rence score was subsequently validated in independent clinical
Biomarker discovery using genomic data studies.72
from clinical samples
Biomarkers can be also detected by comparing genomic profiles Outstanding challenges
of clinical samples. Outstanding examples include the finding of Lack of effective biomarkers may lead to the failure of clinical tri-
EGFR mutations as a predictor of sensitivity to gefitinib,70 and a als, whereas biomarkers are only detected or confirmed through
12-gene colon cancer recurrence score as a predictor of recurrence clinical trials. The complexity and large variation of clinical trials
in patients with stage II and III colon cancer treated with fluo- may cause some important biomarkers to be missed in the origi-
rouracil and leucovorin.71 O’Connell et al.71 performed quantita- nal study. This issue can be mitigated through an integrative anal-
tive reverse transcription polymerase chain reaction of 375 genes ysis of clinical trials across multiple studies. Unfortunately, a large
in four independent cohorts consisting of 1,851 patients with number of trials are still not available to the public. Open clinical
stage II or III colon cancer. These patients were either treated trial data becomes necessary in order to identify more effective
with surgery alone or surgery plus fluorouracil/leucovorin and biomarkers for current therapies or even rescue failed drugs via
their recurrence-free interval at three years were observed. Of 375 identifying the right patient populations.
genes, 48 genes were significantly associated with risk of recur-
rence and 66 genes were significantly associated with fluoroura- PERSPECTIVES
cil/leucovorin benefit. From these genes, seven genes were One belief of the current drug discovery paradigm is that thor-
selected based on their biology and the strength of association oughly understanding molecular changes of diseases will ulti-
with outcomes. Expression of these seven genes was normalized mately lead to the discovery of new therapeutics. In order to

capture molecular changes of disease and changes upon drug 4. Chen, B. & Butte, A.J. Network medicine in disease analysis and
therapeutics. Clin. Pharmacol. Ther. 94, 627–629 (2013).
interventions, the molecular profiles have to be presented in an 5. Mantione, K.J. et al. Comparing bioinformatic gene expression
accessible format, which we now consider big data. There is no profiling methods: microarray and RNA-Seq. Med. Sci. Monit. Basic
doubt that the profiles we have created will quickly become small Res. 20, 138–142 (2014).
6. Barrangou, R., Birmingham, A., Wiemann, S., Beijersbergen, R.L.,
sets because of rapid advances in technologies. In the near future, Hornung, V. & Smith, A. Advances in CRISPR-Cas9 genome
much larger volumes and complex datasets will be created to engineering: lessons learned from RNA interference. Nucleic Acids
characterize disease systems: from single cells to organs, from can- Res. 43, 3407–3419 (2015).
7. Kannan, L. et al. Public data and open source tools for multi-assay
cer cells to microorganisms, from cell lines to genetically modified genomic investigation of disease. Brief. Bioinform. (2015); e-pub
mice to individual patients, and from one time point to the lon- ahead of print.
gitudinal course of treatment. The incredible number of targets, 8. Doshi, P., Goodman, S.N. & Ioannidis, J.P. Raw data from clinical
trials: within reach? Trends Pharmacol. Sci. 34, 645–647 (2013).
drugs, and biomarkers discovered by leveraging big datasets in the 9. Grieb, B.C., Chen, X. & Eischen, C.M. MTBP is overexpressed in
past years suggests an unprecedented opportunity to leverage triple-negative breast cancer and contributes to its growth and
them to transform discovery now. survival. Mol. Cancer Res. 12, 1216–1224 (2014).
10. Lee, H.W. et al. Tpl2 kinase impacts tumor growth and metastasis of
Given the volumes and complexity of datasets for drug discov- clear cell renal cell carcinoma. Mol. Cancer Res. 11, 1375–1386
ery, no single person or team could comprehend or use all of (2013).
them; therefore, it is necessary to reengineer the entire pipeline of 11. Tilghman, J. et al. HMMR maintains the stemness and tumorigenicity
of glioblastoma stem-like cells. Cancer Res. 74, 3168–3179 (2014).
drug discovery, where every step is driven by data and rigorous 12. Hsu, Y.H. et al. Definition of PKC-a, CDK6, and MET as therapeutic
data models. Example steps include the selection of appropriate targets in triple-negative breast cancer. Cancer Res. 74, 4822–4835
(2014).
tissue samples to profile, the selection of appropriate models to 13. Kodama, K. et al. Expression-based genome-wide association study
validate hypothesis, etc. In addition, high performance comput- links the receptor CD44 in adipose tissue with type 2 diabetes. Proc.
ing allows us to generate hypotheses very quickly, but the current Natl. Acad. Sci. USA 109, 7049–7054 (2012).
14. Chen, R. et al. A meta-analysis of lung cancer gene expression
experimental settings limit the validation efforts. It is often true identifies PTK7 as a survival gene in lung adenocarcinoma. Cancer
that the validation of a drug in preclinical models takes over 10 Res. 74, 2892–2902 (2014).
times longer than the prediction. New sharing economy inspired 15. International Cancer Genome Consortium et al. International network
of cancer genome projects. Nature 464, 993–998 (2010).
sources for biomedical research, such as Science Exchange 16. Vogelstein, B., Papadopoulos, N., Velculescu, V.E., Zhou, S., Diaz,
(http://scienceexchange.com) and Assay Depot (http://assayde- L.A. Jr. & Kinzler, K.W. Cancer genome landscapes. Science 339,
pot.com) could facilitate running experiments using external 1546–1558 (2013).
17. Sharma, S.V., Bell, D.W., Settleman, J. & Haber, D.A. Epidermal
sources. More efficient ways are expected to quickly transform growth factor receptor mutations in lung cancer. Nat. Rev. Cancer 7,
big data discoveries into clinical applications. 169–181 (2007).
18. Chapman, P.B. et al. Improved survival with vemurafenib in
melanoma with BRAF V600E mutation. N. Engl. J. Med. 364, 2507–
ACKNOWLEDGMENTS 2516 (2011).
We thank Drs. Marina Sirota, Mei-Sze Chua, Hyojung Paik, Dvir Aran, and 19. Kwak, E.L. et al. Anaplastic lymphoma kinase inhibition in non-small-
cell lung cancer. N. Engl. J. Med. 363, 1693–1703 (2010).
Shann-Ching Chen for critical comments on the manuscript.
20. de Silva, C.M. & Reid, R. Gastrointestinal stromal tumors (GIST): C-kit
mutations, CD117 expression, differential diagnosis and targeted
SOURCE OF FUNDING cancer therapy with imatinib. Pathol. Oncol. Res. 9, 13–19 (2003).
Research reported in this publication was supported by the National 21. Rubio-Perez, C. et al. In silico prescription of anticancer drugs to
Institute of General Medical Sciences of the National Institutes of Health cohorts of 28 tumor types reveals targeting opportunities. Cancer Cell
under award number R01GM079719. The content of this article is solely 27, 382–396 (2015).
22. Farmer, H. et al. Targeting the DNA repair defect in BRCA mutant cells
the responsibility of the authors and does not necessarily represent the
as a therapeutic strategy. Nature 434, 917–921 (2005).
official views of the National Institutes of Health. 23. Squatrito, M., Brennan, C.W., Helmy, K., Huse, J.T., Petrini, J.H. &
Holland, E.C. Loss of ATM/Chk2/p53 pathway components
CONFLICT OF INTEREST/DISCLOSURE accelerates tumor development and contributes to radiation
Atul Butte is a scientific advisor to Assay Depot, and a founder and resistance in gliomas. Cancer Cell 18, 619–629 (2010).
scientific advisor to NuMedii, Inc. Bin Chen is a consultant to NuMedii, Inc. 24. Plenge, R.M., Scolnick, E.M. & Altshuler, D. Validating therapeutic
targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594
(2013).
C 2015 The Authors Clinical Pharmacology & Therapeutics published by
V 25. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to
Wiley Periodicals, Inc. on behalf of American Society for Clinical Pharmacol- biology and drug discovery. Nature 506, 376–381 (2014).
ogy and Therapeutics. 26. Nelson, M.R. et al. The support of human genetic evidence for
This is an open access article under the terms of the Creative Commons
Attribution–NonCommercial–NoDerivs License, which permits use and distri- approved drug indications. Nat. Genet. 47, 856–860 (2015).
bution in any medium, provided the original work is properly cited, the use is 27. Fan-Minogue, H., Chen, B., Sikora-Wohlfeld, W., Sirota, M. & Butte,
non–commercial and no modifications or adaptations are made. A.J. A systematic assessment of linking gene expression with genetic
variants for prioritizing candidate targets. Pac. Symp. Biocomput.
1. EMBL–European Bioinformatics Institute EMBL-EBI Annual Scientific 383–394 (2015).
Report 2014. <https://www.embl.de/aboutus/communication_ 28. Knowles, J. & Gromo, G. A guide to drug discovery: target selection in
outreach/publications/ebi_ar/ebi_ar_2014.pdf>. drug discovery. Nat. Rev. Drug Discov. 2, 63–69 (2003).
2. Marx, V. Biology: the big challenges of big data. Nature 498, 255– 29. Patel, M.N., Halling-Brown, M.D., Tym, J.E., Workman, P. & Al-
260 (2013). Lazikani, B. Objective assessment of cancer genes for drug
3. Barabasi, A.L., Gulbahce, N. & Loscalzo, J. Network medicine: a discovery. Nat. Rev. Drug Discov. 12, 35–50 (2013).
network-based approach to human disease. Nat. Rev. Genet. 12, 56– 30. Klijn, C. et al. A comprehensive transcriptional portrait of human
68 (2011). cancer cell lines. Nat. Biotechnol. 33, 306–312 (2015).

31. Kim, M.S. et al. A draft map of the human proteome. Nature 509, 53. Halling-Brown, M.D., Bulusu, K.C., Patel, M., Tym, J.E. & Al-Lazikani, B.
575–581 (2014). canSAR: an integrated cancer public translational research and drug
32. Cowley, G.S. et al. Parallel genome-scale loss of function screens in discovery resource. Nucleic Acids Res. 40(Database issue), D947–
216 cancer cell lines for the identification of context-specific genetic D956 (2012).
dependencies. Sci. Data 1, 140035 (2014). 54. Davis, A.P. et al. The Comparative Toxicogenomics Database’s 10th
33. Marcotte, R. et al. Essential gene profiles in breast, pancreatic, and year anniversary: update 2015. Nucleic Acids Res. 43(Database
ovarian cancer cells. Cancer Discov. 2, 172–189 (2012). issue), D914–D920 (2015).
34. Wang, T. et al. Identification and characterization of essential genes 55. Xu, H. et al. Validating drug repurposing signals using electronic
in the human genome. Science 350, 1096–1101 (2015). health records: a case study of metformin associated with reduced
35. Yoshihara, K. et al. Inferring tumour purity and stromal and immune cancer mortality. J. Am. Med. Inform. Assoc. 22, 179–191 (2015).
cell admixture from expression data. Nat. Commun. 4, 2612 (2013). 56. Khatri, P. et al. A common rejection module (CRM) for acute rejection
36. Baker, M. Reproducibility crisis: blame it on the antibodies. Nature across multiple organs identifies novel therapeutics for organ
521, 274–276 (2015). transplantation. J. Exp. Med. 210, 2205–2221 (2013).
37. Dudley, J.T., Deshpande, T. & Butte, A.J. Exploiting drug-disease 57. Keiser, M.J. et al. Predicting new molecular targets for known drugs.
relationships for computational drug repositioning. Brief. Bioinform. Nature 462, 175–181 (2009).
12, 303–311 (2011). 58. Woo, J.H. et al. Elucidating compound mechanism of action by
38. Li, J., Zheng, S., Chen, B., Butte, A.J., Swamidass, S.J. & Lu, Z. A network perturbation analysis. Cell 162, 441–451 (2015).
survey of current trends in computational drug repositioning. Brief. 59. Day, C.P., Merlino, G. & Van Dyke, T. Preclinical mouse cancer
Bioinform. (2015); e-pub ahead of print. models: a maze of opportunities and challenges. Cell 163, 39–53
39. Knox, C. et al. DrugBank 3.0: a comprehensive resource for ‘omics’ (2015).
research on drugs. Nucleic Acids Res. 39(Database issue), D1035– 60. Chen, B., Sirota, M., Fan-Minogue, H., Hadley, D. & Butte, A.J.
D1041 (2011). Relating hepatocellular carcinoma tumor samples and cell lines using
40. Dudley, J.T. et al. Computational repositioning of the anticonvulsant gene expression data in translational research. BMC Med. Genomics
topiramate for inflammatory bowel disease. Sci. Transl. Med. 3, 8 (suppl. 2), S5 (2015).
96ra76 (2011). 61. Domcke, S., Sinha, R., Levine, D.A., Sander, C. & Schultz, N.
41. Sirota, M. et al. Discovery and preclinical validation of drug Evaluating cell lines as tumour models by comparison of genomic
indications using compendia of public gene expression data. Sci. profiles. Nat. Commun. 4, 2126 (2013).
62. Wu, M., Sirota, M., Butte, A.J. & Chen, B. Characteristics of drug
Transl. Med. 3, 96ra77 (2011).
combination therapy in oncology by analyzing clinical trial data on
42. Jahchan, N.S. et al. A drug repositioning approach identifies tricyclic
ClinicalTrials.gov. Pac. Symp. Biocomput. 68–79 (2015).
antidepressants as inhibitors of small cell lung cancer and other
63. Collins, F.S. & Varmus, H. A new initiative on precision medicine. The
neuroendocrine tumors. Cancer Discov. 3, 1364–1377 (2013).
N. Engl. J. Med. 372, 793–795 (2015).
43. van Noort, V. et al. Novel drug candidates for the treatment of
64. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables
metastatic colorectal cancer through global inverse gene-expression
predictive modelling of anticancer drug sensitivity. Nature 483, 603–
profiling. Cancer Res. 74, 5690–5699 (2014).
607 (2012).
44. Wagner, A. et al. Drugs that reverse disease transcriptomic
65. Kim, H.S. et al. Systematic identification of molecular subtype-
signatures are more effective in a mouse model of dyslipidemia. Mol.
selective vulnerabilities in non-small-cell lung cancer. Cell 155, 552–
Syst. Biol. 11, 791 (2015). 566 (2013).
45. Zerbini, L.F. et al. Computational repositioning and preclinical 66. Basu, A. et al. An interactive resource to identify cancer genetic and
validation of pentamidine for renal cell cancer. Mol. Cancer Ther. 13, lineage dependencies targeted by small molecules. Cell 154, 1151–
1929–1941 (2014). 1161 (2013).
46. Brum, A.M. et al. Connectivity Map-based discovery of parbendazole 67. Seashore-Ludlow, B. et al. Harnessing connectivity in a large-scale
reveals targetable human osteogenic pathway. Proc. Natl. Acad. Sci. small-molecule sensitivity dataset. Cancer Discov. 5, 1210–1223
USA 112, 12711–12716 (2015). (2015).
47. Paik, H. et al. Repurpose terbutaline sulfate for amyotrophic lateral 68. Heiser, L.M. et al. Subtype and pathway specific responses to
sclerosis using electronic medical records. Sci. Rep. 5, 8580 (2015). anticancer compounds in breast cancer. Proc. Natl. Acad. Sci. USA
48. Iorio, F. et al. Discovery of drug mode of action and drug repositioning 109, 2724–2729 (2012).
from transcriptional responses. Proc. Natl. Acad. Sci. USA 107, 69. Martins, M.M. et al. Linking tumor mutations to drug responses via a
14621–14626 (2010). quantitative chemical-genetic interaction map. Cancer Discov. 5,
49. Iorio, F. et al. A semi-supervised approach for refining transcriptional 154–167 (2015).
signatures of drug response and repositioning predictions. PLoS One 70. Paez, J.G. et al. EGFR mutations in lung cancer: correlation with
10, e0139446 (2015). clinical response to gefitinib therapy. Science 304, 1497–1500
50. Bansal, M. et al. A community computational challenge to predict the (2004).
activity of pairs of compounds. Nat. Biotechnol. 32, 1213–1222 71. O’Connell, M.J. et al. Relationship between tumor gene expression
(2014). and recurrence in four independent studies of patients with stage II/
51. Mitrofanova, A., Aytes, A., Zou, M., Shen, M.M., Abate-Shen, C. & III colon cancer treated with surgery alone or surgery plus adjuvant
Califano, A. Predicting drug response in human prostate cancer from fluorouracil plus leucovorin. J. Clin. Oncol. 28, 3937–3944 (2010).
preclinical analysis of in vivo mouse models. Cell Rep. 12, 2060– 72. Yothers, G. et al. Validation of the 12-gene colon cancer recurrence
2071 (2015). score in NSABP C-07 as a predictor of recurrence in patients with
52. Sun, Y. et al. Combining genomic and network characteristics for stage II and III colon cancer treated with fluorouracil and leucovorin
extended capability in predicting synergistic drugs for cancer. Nat. (FU/LV) and FU/LV plus oxaliplatin. J. Clin. Oncol. 31, 4512–4519
Commun. 6, 8481 (2015). (2013).

Leveraging Big Data To Transform Target

Uploaded by

Copyright:

Available Formats

Leveraging Big Data To Transform Target

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Leveraging Big Data To Transform Target

Uploaded by

Copyright:

Available Formats

Leveraging Big Data to Transform Target

Selection and Drug Discovery

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 285

286 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

Data type Description Common techniques Public availabilitya

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 287

288 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

Database Description (as of October 2015) URL

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 289

Normal samples Expression profiling Differentially

Figure 2 An illustration of big data approaches to identifying new targets.

290 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 291

292 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 293

294 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 295

296 VOLUME 99 NUMBER 3 | MARCH 2016 | www.wileyonlinelibrary/cpt

CLINICAL PHARMACOLOGY & THERAPEUTICS | VOLUME 99 NUMBER 3 | MARCH 2016 297

You might also like