“”
C O M M E N TA R Y
Biobanks and Electronic Medical Records:
Enabling Cost-Efective Research
Erica Bowton,1* Julie R. Field,1 Sunny Wang,1 Jonathan S. Schildcrout,2 Sara
L. Van Driest,3 Jessica T. Delaney,4 James Cowan,1 Peter Weeke,4 Jonathan D.
Mosley,4 Quinn S. Wells,4 Jason H. Karnes,4 Christian Shafer,4 Josh F. Peterson,4,5 Joshua C. Denny,4,5 Dan M. Roden,4,6 Jill M. Pulley7
The use of electronic medical record data linked to biological specimens in health care
settings is expected to enable cost-efective and rapid genomic analyses. Here, we present a model that highlights potential advantages for genomic discovery and describe
the operational infrastructure that facilitated multiple simultaneous discovery eforts.
Traditional studies of drug efcacy and safety address the utility of a specifc therapeutic
intervention in a defned population. Such
study designs present important challenges.
Patient accrual can take months to years,
and the potential exists for systematic exclusion of clinically complicated but relevant
patient groups, such as the elderly, those
with comorbid conditions, and those who
routinely take multiple drugs. Patient cohorts can be inadequate in size for subgroup
analysis, long-term follow-up is ofen not
feasible, and results are limited to diseases
for which the participants were originally
assessed. Hypothesis-neutral cohorts such
as the Framingham Heart Study and Multicenter AIDS Cohort Study (MACS) have
overcome these challenges and provided the
foundation for critical discoveries that continue to shape health care practice. However, large monetary, time, and infrastructure
investments are required to establish and
maintain these highly curated, large cohorts
in which data collection is focused on hypotheses formulated at the outset.
An alternative to clinical studies with traditional patient cohorts has emerged in the
last decade—the pairing of disease-agnostic
1
Institute for Clinical and Translational Research,
School of Medicine, Vanderbilt University, Nashville,
TN 37232, USA. 2Department of Biostatistics, School
of Medicine, Vanderbilt University, Nashville, TN 37232,
USA. 3Department of Pediatrics, School of Medicine,
Vanderbilt University, Nashville, TN 37232, USA. 4Department of Medicine, School of Medicine, Vanderbilt
University, Nashville, TN 37232, USA. 5Department of
Biomedical Informatics, School of Medicine, Vanderbilt
University, Nashville, TN 37232, USA. 6Department of
Clinical Pharmacology, School of Medicine, Vanderbilt
University, Nashville, TN 37232, USA. 7Department of
Medical Administration, School of Medicine, Vanderbilt
University, Nashville, TN 37232, USA.
*Corresponding author. E-mail: erica.bowton@vanderbilt.edu
biobank specimens with electronic medical records (EMRs). Here, we describe the
Vanderbilt Electronic Systems for Pharmacogenomic Assessment (VESPA) Project—a
large EMR- and biobank-based initiative for
translational pharmacogenomic discoveries.
We used data from BioVU, Vanderbilt Uni-
versity’s EMR-linked biorepository (which
as of April 2014 contains more than 179,000
DNA samples) to perform a preliminary
cost and time analysis for this approach and
compared these costs and time investments
with those of traditional cohort studies.
FASHIONING AN EFFICIENT PIPELINE
A key element to establishing an efcient
and efective pipeline was the creation of
an organizational structure to facilitate
communication and management among
research teams. Trough VESPA, we developed strategies and methods for initiating,
executing, and monitoring studies. Essential
to this pipeline was the formation of teams
for phenotyping and genetic data analysis.
Phenotype teams were physician-led and
composed of individuals with clinical and
informatics expertise, including specifc
clinical domain content experts. Tese experts were responsible for cohort selection,
algorithm development and refnement, and
manual review when necessary. Te genetic
Table 1. VESPA cohorts and phenotypes
Total number of genotyped subjects: 11,639
Total number of phenotypes analyzed: 28*
Median age: 61.6 years (range, newborn to 100+)
Observer-reported race: ~84% Caucasian, 12% African American
Subject phenotypic data: Majority had medical records with rich phenotypic data
(median of 80 diagnosis codes and a median of 7.7 years of follow-up, from the
first to last electronic clinical note)
Median cohort size†: 1123 (IQR, 492 to 4158)
Median case cohort size†: 133 (IQR, 84 to 569)
Total case counts‡: Ranged from 6 total cases (cerebrovascular event following
clopidogrel therapy) to 1174 total cases (cough attributed to ACE inhibitor exposure)
Genomic data available:
• Genome-wide genotyping data were already available in 2500 subjects
• 9139 subjects were newly genotyped in both GWAS and drug-metabolism platforms
• An additional 693 subjects and 1167 subjects previously underwent candidate SNP
genotyping for clopidogrel adverse events or warfarin stable dose, respectively (5, 6)
*Clopidogrel in cardiovascular disease, warfarin stable dose, early repolarization, vancomycin, C. difficile colitis,
anthracycline cardiomyopathy, Guillain-Barre Syndrome, heart transplant, kidney transplant, clopidogrel in
cerebrovascular disease, statin-related myopathy, heparin-induced thrombocytopenia, cardiovascular events
during COX2 inhibition therapy, serious bleeding during warfarin therapy, amiodarone toxicity (lung, thyroid),
chronic inflammatory polyneuropathy, rheumatic heart disease, cough during ACE inhibitor therapy, fluoroquinolones and tendonitis/tendon rupture, warfarin stable dose in children, metformin efficacy, metformin and cancer
survival, bisphosphonates and atypical fracture/jaw osteonecrosis, Wolff-Parkinson-White, steroid-induced
osteonecrosis, shellfish anaphylaxis, aspirin anaphylaxis, and Bell’s Palsy.
†Cases and controls.
‡Additional phenotype counts are shown in table S1.
www.ScienceTranslationalMedicine.org 30 April 2014 Vol 6 Issue 234 234cm3
1
Downloaded from stm.sciencemag.org on November 7, 2014
PHARMACOGENOMICS
Table 2. NIH-funded pharmacogenomic versus EMR-biobank studies*
Traditional study
BioVU study
Median cohort size (IQR)
623 (273 to 2095)
1123 (492 to 4158)
Median reuse of cohort (IQR)
N/A
55% (34 to 98%)
Median cost (in U.S. dollars) (IQR)
$1,335,927
($416,895 to $2,715,895)
$76,674
($43,173 to $207,769)
Median cost/subject (IQR)
$1419 ($456 to $4672)
$393 ($382-$465)
Median years of study (IQR)
3 (2 to 5)
0.25 (0.17 to 0.56)
Median cost/yr/subject (IQR)
$478 ($134 to $1216)
$96 ($55 to $194)
*Funding data for traditional human pharmacogenomic studies were obtained by querying NIHReporter for all
funded M-, R-, U-, P- and Z-type grants that contained the keywords “pharmacogenetic” or “pharmacogenomic”
(query performed on 2 November 2012). The resulting grant abstracts were reviewed manually to ensure that they
directly supported human pharmacogenomics research and to identify the number of subjects in the proposed
study cohort. Excluded were studies with only in vitro or animal-model experiments, those directed solely at
technology development, and those for which a defined study-cohort size or clinical trial protocol could not be
determined. Dollars awarded and years of the award to date were summed for 115 unique NIH grants.
Cohort size (total cases plus controls), cost, and time-investment data for VESPA phenotypes were recorded internally. For each phenotype, time investment was calculated as the amount of time required to develop and implement
phenotype algorithms, extract data, review records, and complete phenotype curation. Total cost of the VESPA study
was calculated on the basis of the number of hours invested and the hourly rate of personnel required to complete
the phenotyping plus the cost of genotyping the cohort.
data–analysis team, which had expertise in
laboratory techniques and genomics technologies, directed genotyping assays and
interacted with each of the various phenotype; teams. Project managers participated
in study design, managed both phenotype
development and genotyping throughput,
and tracked timelines and milestones; this
management tier was crucial for promoting
multiple, simultaneous studies at diferent
stages of development or execution.
Te phenotype pipeline consisted of fve
key components: selection of a study phenotype, study design, phenotype-specifc algorithm development, review, and implementation. Study hypotheses were divided into
two categories: (i) validation studies—those
that replicated the association of clinical
outcomes (for example, drug-response phenotypes) with previously identifed genomic
variants—and (ii) discovery studies—
genome-wide investigations that sought to
identify new gene-phenotype associations.
A total of 28 phenotypes were selected for
study (table S1).
Development of phenotype algorithm.
Recent eforts have examined the utility of
algorithms for determining phenotypes
from EMRs (1–3). We used two approaches
to construct phenotype algorithms: (i) fully
automated, through the use of phenotypeselection algorithms that achieved high
precision, and (ii) semi-automated, using
algorithms to select a set of cases for manual
review (usually rarer phenotypes). Data sets
required to identify cases and controls accu-
rately for each phenotype varied, but most
included three data types: ICD-9 codes,
medication regimens, and medical test results. Ten of the phenotypes also required
the use of advanced informatics methods,
such as natural language processing, to extract information stored in unstructured
clinical text.
Pharmacogenomic phenotypes, in particular, rely heavily on temporal relationships
(for example, administration of simvastatin
before or concurrent with the onset of muscle
pain). For our phenotype algorithms, we used
event-sequence analyses to establish temporal relationships between drugs and phenotypes, which is a substantial challenge in
bioinformatics (4). Both our case and control
algorithms excluded records that contained
specifc clinical comorbidities. Algorithms
were quality checked for precision by team
members and iteratively refned to achieve
positive predictive values (PPV) > 90%. For
automated algorithms failing to meet this
threshold, manual review was coupled with
algorithms to validate that the included cases
were true positives (5). Although manual
review can be time-consuming and impractical for large cohorts, it is warranted when
phenotypes are rare, complex, or involve
temporal components too difcult to defne
electronically.
Enabling overlap. A total of 11,639 subjects (Table 1) met phenotyping criteria for
at least one of the 28 phenotypes investigated by the VESPA team. Cohorts included
subjects with primarily drug-response phe-
“”
notypes. Seven phenotypes were not explicitly designed as such but were intended to
enable future investigation into potential
drug-response phenotypes; for example,
subjects exposed to immunosuppressant
therapy afer organ transplantation ofer potential examination of a range of outcomes
(drug levels, transplant rejection, lipid abnormalities, cancer, or infections). Across
all phenotype cases and controls, 90% were
reused as either a case or control for at least
one other phenotype. Tis demonstrates the
capability ofered by EMR-based studies to
reuse cases and controls across both rare
and common phenotypes, each with diferent phenotyping processes. Two VESPA replication studies have established the validity
of an EMR-based method for identifying
pharmacogenomic associations, clopidogrel
major adverse cardiac events, and warfarin
stable-dose (5, 6).
COST CALCULATIONS
We compared the estimated monetary cost
and resources required to generate VESPA
cohorts (excluding analysis) to cost estimates drawn from the analysis of data derived from the NIH RePORTER (7) for M-,
R-, U-, P- and Z-type grants that directly
supported discrete pharmacogenomics
studies in humans. Our analysis (Table 2,
legend) revealed striking savings with the
multiplexed VESPA approach (Table 2 and
Fig. 1). Te VESPA experience resulted in
28 case-control sets with a median cost per
study of $76,674 [interquartile range (IQR),
$43,173 to $207,769] and a median cost
per genotyped subject of $393 (IQR, $382–
$465). Tis includes the cost to phenotype
cases and controls (personnel resources
required to develop algorithms, implement
algorithms, extract data, review records, and
manage the pipeline) as well as the cost to
genotype the cohort (consumables, processing, and quality control).
Te median funding amount for pharmacogenomics-related NIH grants with
defned cohort sizes (across their lifetimes)
is $1,335,927, with a median cost per genotyped subject of $1419. Notably, the low
median cost per VESPA study ($76,674)
was enabled by the reuse of subjects as cases
and controls across multiple studies; had
studies been conducted in isolation with no
overlap among cases and controls, the estimated median cost per study would have
been $438,473. Further highlighting the effciency of biobank studies, VESPA studies
took a median of 3 months to identify sub-
www.ScienceTranslationalMedicine.org 30 April 2014 Vol 6 Issue 234 234cm3
2
Downloaded from stm.sciencemag.org on November 7, 2014
C O M M E N TA R Y
Length of study (years)
Cost per subject (U.S. dollars)
50,000
40,000
30,000
20,000
10,000
800
600
400
200
0
15
10
5
0
Traditional
BioVU
Traditional
BioVU
Fig. 1. Time is money. Comparison of traditional NIH-funded pharmacogenomic studies versus
EMR/biobank studies (BioVU). (Left) Median cost of study per subject. (Right) Median length of
study in years.
CREDIT: V. ALTOUNIAN/SCIENCE TRANSLATIONAL MEDICINE
jects with the target phenotypes, whereas
the NIH grants reviewed were awarded for
a median period of 3 years. Indeed, traditional consented recruitment models, for
example, for common cancers, can take up
to 20 years to generate sufcient cohort sizes
(8). VESPA studies did not sacrifce cohort
size or power as a consequence of reduced
cost; in fact, the median cohort size of VESPA phenotypes was 1123, which is almost
twice that of NIH-funded pharmacogenomics studies, which had a median cohort size
of 623. Compared with a median cost per
subject per year of $478 in a traditional cohort study, the median cost per subject per
year in a VESPA study was $96.
COST-SAVING INFRASTRUCTURE
%ere are potential advantages of discovery
e$orts in an EMR environment, especially
when coupled to large genomic resources.
First, EMRs contain large patient populations without disease-based exclusions (8).
As demonstrated by the EMRs and genomics (eMERGE) network—a U.S. national
consortium of existing DNA biorepositories
linked to EMRs—these data can be used to
rapidly create large, inclusive patient cohorts that foster investigation of variability
in physiological traits and disease susceptibility (9–11). Second, the EMR approach
o$ers substantial efciencies owing to the
ability to examine multiple phenotypes by
using a single cohort of genotyped samples,
an idea frst championed on a large scale by
the Wellcome Trust Case Control Consortium (12). %ird, biobanks enable access,
not only to cases but also to large numbers
of controls, potentially providing additional
power when using a design based on multiple controls per case. Fourth, because EMRbased biobank research is coupled to data
routinely obtained in clinical care, the efciencies of reuse suggest that the approach
will prove to be cost-e$ective. In addition,
the increasing use of EMRs [incentivized by
the U.S. Health Information Technology for
Economic and Clinical Health (HITECH)
Act] and the increasing number of EMRlinked biobanks worldwide o$er cost-effective resources, not only for discovery but
also for the replication of genomic associations across nations and ancestries.
BioVU, the Vanderbilt DNA databank, is
an example of an EMR-linked biorepository
and a component of eMERGE (13, 14). It
is important to note that the total costs described here for the VESPA study are marginal costs—they do not include costs associated with the design, set-up, and building
of BioVU or establishing and maintaining
the clinical electronic medical record. %us,
the substantial cost savings we observed was
facilitated by resources already in place. Development of BioVU, an evolving resource
with longitudinal health information, was
and is institutionally supported, including
investment in EMRs and creation of deidentifed images of the EMRs. We highlight
the cost savings enabled by BioVU to demonstrate the considerable return on investment a$orded by the development of an
EMR-based biobank.
As we have demonstrated, EMR-based
biobanks can be cost-e$ective tools for establishing disease or drug associations in a
real-world community health care setting.
We provide data here that an EMR-linked
biobank model such as BioVU enables cost
and time efciency in multiple ways: (i) the
use of biological samples that have already
been collected and would otherwise be discarded; (ii) an economy of scale obtained
by central processing of these samples; (iii)
reuse of the same sample for multiple studies without incremental collection, extraction, or processing costs; (iv) centralized
de-identifcation and phenotype annotation
of the EMR; and (v) reuse of data, based on
program requirements for redeposit of genetic data for all studies. %is efciency is
refected in the substantial cost savings over
traditional methods and is further amplifed
by the ability to examine multiple phenotypes by using a single cohort of genotyped
samples (12).
Growth in EMR adoption fostered by the
HITECH Act provides the foundation to effciently expand EMR-based research and is
not limited to studies within a single medical
center. As evidenced by the robust analyses
enabled by the eMERGE network (15–17),
the utility of EMR-derived data linked to
biological specimens is amplifed by pooling
analyses across networks, leading to an increase in sample sizes and minimization of
biases (18). %e eMERGE network has demonstrated successful sharing of more than
18 phenotype algorithms across sites, with
a median of three external validations per
algorithm. Performance on case and control
algorithms for development-site evaluations
were similar to external-site evaluations:
Median case PPV was 97% for host evaluations, and median PPV for external site
evaluations was nearly identical at 95.5%,
establishing portability of electronic defnitions regardless of the EMR system and
interoperability (http://phekb.org).
CHALLENGES AND LIMITATIONS
Data reuse. When combining data from
multiple studies in a redeposit design such
as that of BioVU, a major challenge is the
combining of genotyping data ascertained
from di$erent genotyping platforms. %is
presents challenges for genetic analyses, including the selection of variants for analysis and controlling for batch and platform
e$ects. However, these challenges are not
unlike those associated with large genomewide association study (GWAS) meta-analyses (18–20). Indeed, a key analytical approach for VESPA studies has been to use
GWASs, similar to the approach of many
traditional pharmacogenomic studies that
rely on observational cohorts, subject en-
www.ScienceTranslationalMedicine.org 30 April 2014 Vol 6 Issue 234 234cm3
3
Downloaded from stm.sciencemag.org on November 7, 2014
“”
C O M M E N TA R Y
rollment, or randomized controlled trials.
Although the GWAS method has been
highly successful in identifying new loci associated with disease susceptibility, it has
also been criticized because the e$ect sizes
of the identifed loci are ofen small, and
thus, very large cohorts are needed to identify and validate genomic variations. On the
other hand, although GWAS for drug response traits is less well-explored, multiple
studies support the hypothesis that genetic
associations can be identifed even with
small cohort sizes (21–23). Unlike most
disease-susceptibility studies, the e$ect sizes
in pharmacogenomics can be large enough
to consider for implementation in clinical
care. As such, biobanks may become a crucial tool for facilitating pharmacogenomics
research. Although we primarily focus on
drug-response phenotypes, the methods described here can be used for a wide range of
EMR-derived phenotypes or even to inform
phenome-wide analyses (24).
EMR biases. Despite their numerous
benefts related to time and efciency, EMRlinked biobank approaches have limitations
(table S2). One fundamental limitation is
the potential loss to follow-up or the absence of clinical information pertaining to
a patient afer a given point in time. In the
specifc case of BioVU, de-identifcation of
all subjects formally eliminates the ability
to recontact patients. Moreover, the data are
collected as a result of a provider’s determination of need based on clinical relevance
at the time and may include only those
medical encounters within one given medical center. %us, studies are limited to, and
potentially biased by, data that are available
in the EMRs. In addition, it can be challenging to accurately identify cases and controls,
particularly for complex phenotypes, and
exposure misclassifcation or selection effect can lead to bias in the estimation of an
interaction e$ect (20, 25).
In our studies, cohorts were defned by
an exposure to a medication, a procedure,
or patient characteristics at an index point
in time; determining cases and controls by
temporally constrained defnitions can limit
cohort populations because of the inherent
difculties in establishing temporality and
event sequence in EMR records (26). Moreover, EMR-based data do not inherently
capture the cost of a procedure or clinical
event. However, an EMR system could be
expanded and linked to external data sources, including cost and systems-delivery data,
enabling such studies and a$ording addi-
tional opportunities for linking to researchderived data.
Politics. %e trend of reduced U.S. federal support for research (27) jeopardizes
higher-priced scientifc explorations, even
those that have proven fruitful for science
and health. %e current funding climate,
rising costs of health care R&D, and stricter
payer requirements should make resource
reuse increasingly important for advancing
clinical and translational research as well as
for reducing related health care costs.
%e fnancial efciencies we observed
for the EMR approach make it a compelling
complement to traditional cohort designs.
10.
11.
SUPPLEMENTARY MATERIALS
www.sciencetranslationalmedicine.org/cgi/content/
full/6/234/234cm3/DC1
Acknowledgments
Funding
Author contributions
Table S1. Advantages and disadvantages of the EMR-based
biobank approach.
Table S2. Summary of phenotypes.
References (28–40)
REFERENCES AND NOTES
1. R. J. Carroll, A. E. Eyler, J. C. Denny, Naïve electronic health
record phenotype identification for rheumatoid arthritis.
AMIA Annu. Symp. Proc. 2011, 189–196 (2011).
2. J. C. Denny, J. F. Peterson, N. N. Choma, H. Xu, R. A. Miller,
L. Bastarache, N. B. Peterson, Extracting timing and status descriptors for colonoscopy testing from electronic
medical records. J. Am. Med. Inform. Assoc. 17, 383–388
(2010).
3. R. J. Carroll, W. K. Thompson, A. E. Eyler, A. M. Mandelin,
T. Cai, R. M. Zink, J. A. Pacheco, C. S. Boomershine, T. A.
Lasko, H. Xu, E. W. Karlson, R. G. Perez, V. S. Gainer, S. N.
Murphy, E. M. Ruderman, R. M. Pope, R. M. Plenge, A. N.
Kho, K. P. Liao, J. C. Denny, Portability of an algorithm to
identify rheumatoid arthritis in electronic health records.
J. Am. Med. Inform. Assoc. 19, (e1), e162–e169 (2012).
4. W. Sun, A. Rumshisky, O. Uzuner, Evaluating temporal
relations in clinical text: 2012 i2b2 challenge. J. Am. Med.
Inform. Assoc. 20, 806–813 (2013).
5. J. T. Delaney, A. H. Ramirez, E. Bowton, J. M. Pulley, M. A.
Basford, J. S. Schildcrout, Y. Shi, R. Zink, M. Oetjens, H. Xu,
J. H. Cleator, E. Jahangir, M. D. Ritchie, D. R. Masys, D. M.
Roden, D. C. Crawford, J. C. Denny, Predicting clopidogrel response using DNA samples linked to an electronic
health record. Clin. Pharmacol. Ther. 91, 257–263 (2012).
6. A. H. Ramirez, Y. Shi, J. S. Schildcrout, J. T. Delaney, H. Xu,
M. T. Oetjens, R. L. Zuvich, M. A. Basford, E. Bowton, M. Jiang, P. Speltz, R. Zink, J. Cowan, J. M. Pulley, M. D. Ritchie,
D. R. Masys, D. M. Roden, D. C. Crawford, J. C. Denny,
Predicting warfarin dosage in European-Americans and
African-Americans using DNA samples linked to an electronic health record. Pharmacogenomics 13, 407–418
(2012).
7. RePORT query form. http://projectreporter.nih.gov/
reporter.cfm.
8. P. R. Burton, A. L. Hansell, I. Fortier, T. A. Manolio, M. J.
Khoury, J. Little, P. Elliott, Size matters: Just how big is
BIG?: Quantifying realistic sample size requirements
for human genome epidemiology. Int. J. Epidemiol. 38,
263–273 (2009).
9. A. N. Kho, J. A. Pacheco, P. L. Peissig, L. Rasmussen, K. M.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Newton, N. Weston, P. K. Crane, J. Pathak, C. G. Chute, S. J.
Bielinski, I. J. Kullo, R. Li, T. A. Manolio, R. L. Chisholm, J. C.
Denny, Electronic medical records for genetic research:
Results of the eMERGE consortium. Sci. Transl. Med. 3,
79re1 (2011).
C. A. McCarty, R. L. Chisholm, C. G. Chute, I. J. Kullo, G.
P. Jarvik, E. B. Larson, R. Li, D. R. Masys, M. D. Ritchie, D.
M. Roden, J. P. Struewing, W. A. Wolf, M. E. R. G. E. Team
eMERGE Team, The eMERGE network: A consortium of
biorepositories linked to electronic medical records data
for conducting genomic studies. BMC Med. Genomics 4,
13 (2011).
O. Gottesman, H. Kuivaniemi, G. Tromp, W. A. Faucett, R.
Li, T. A. Manolio, S. C. Sanderson, J. Kannry, R. Zinberg,
M. A. Basford, M. Brilliant, D. J. Carey, R. L. Chisholm, C.
G. Chute, J. J. Connolly, D. Crosslin, J. C. Denny, C. J. Gallego, J. L. Haines, H. Hakonarson, J. Harley, G. P. Jarvik, I.
Kohane, I. J. Kullo, E. B. Larson, C. McCarty, M. D. Ritchie,
D. M. Roden, M. E. Smith, E. P. Böttinger, M. S. Williams
eMERGE Network, The electronic medical records and
genomics (eMERGE) network: past, present, and future.
Genet. Med. 15, 761–771 (2013).
Wellcome Trust Case Control Consortium, Genome-wide
association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature 447, 661–678
(2007).
D. M. Roden, J. M. Pulley, M. A. Basford, G. R. Bernard, E. W.
Clayton, J. R. Balser, D. R. Masys, Development of a largescale de-identified DNA biobank to enable personalized
medicine. Clin. Pharmacol. Ther. 84, 362–369 (2008).
T. L. McGregor, S. L. Van Driest, K. B. Brothers, E. A. Bowton, L. J. Muglia, D. M. Roden, Inclusion of pediatric
samples in an opt-out biorepository linking DNA to deidentified medical records: Pediatric BioVU. Clin. Pharmacol. Ther. 93, 204–211 (2013).
M. D. Ritchie, J. C. Denny, R. L. Zuvich, D. C. Crawford, J.
S. Schildcrout, L. Bastarache, A. H. Ramirez, J. D. Mosley,
J. M. Pulley, M. A. Basford, Y. Bradford, L. V. Rasmussen,
J. Pathak, C. G. Chute, I. J. Kullo, C. A. McCarty, R. L. Chisholm, A. N. Kho, C. S. Carlson, E. B. Larson, G. P. Jarvik, N.
Sotoodehnia, T. A. Manolio, R. Li, D. R. Masys, J. L. Haines,
D. M. Roden, Cohorts for Heart and Aging Research in
Genomic Epidemiology (CHARGE) QRS Group, Genomeand phenome-wide analyses of cardiac conduction
identifies markers of arrhythmia risk. Circulation 127,
1377–1385 (2013).
I. J. Kullo, K. Ding, K. Shameer, C. A. McCarty, G. P. Jarvik, J.
C. Denny, M. D. Ritchie, Z. Ye, D. R. Crosslin, R. L. Chisholm,
T. A. Manolio, C. G. Chute, Complement receptor 1 gene
variants are associated with erythrocyte sedimentation
rate. Am. J. Hum. Genet. 89, 131–138 (2011).
J. C. Denny, M. D. Ritchie, D. C. Crawford, J. S. Schildcrout,
A. H. Ramirez, J. M. Pulley, M. A. Basford, D. R. Masys, J.
L. Haines, D. M. Roden, Identification of genomic predictors of atrioventricular conduction: Using electronic
medical records as a tool for genome science. Circulation
122, 2016–2021 (2010).
J. P. A. Ioannidis, T. A. Trikalinos, M. J. Khoury, Implications
of small effect sizes of individual genetic variants on the
design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164, 609–614
(2006).
E. Evangelou, J. P. A. Ioannidis, Meta-analysis methods for
genome-wide association studies and beyond. Nat. Rev.
Genet. 14, 379–389 (2013).
M. I. McCarthy, G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. A. Ioannidis, J. N. Hirschhorn, Genomewide association studies for complex traits: Consensus,
uncertainty and challenges. Nat. Rev. Genet. 9, 356–369
(2008).
G. M. Cooper, J. A. Johnson, T. Y. Langaee, H. Feng, I. B.
Stanaway, U. I. Schwarz, M. D. Ritchie, C. M. Stein, D. M.
www.ScienceTranslationalMedicine.org 30 April 2014 Vol 6 Issue 234 234cm3
4
Downloaded from stm.sciencemag.org on November 7, 2014
“”
C O M M E N TA R Y
“”
C O M M E N TA R Y
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
semantic lexicons from discharge summaries using
machine learning and the C-Value method. AMIA Annu.
Symp. Proc. 2012, 409–416 (2012).
The impact of sequestration on NIH (2012). www.aamc.org/
research/adhocgp/aamcimpactofsequestrationonnih.
pdf
F. S. Collins, The case for a US prospective cohort study of
genes and environment. Nature 429, 475–477 (2004).
I. S. Kohane, Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417–428
(2011).
G. E. Henderson, R. J. Cadigan, T. P. Edwards, I. Conlon, A.
G. Nelson, J. P. Evans, A. M. Davis, C. Zimmer, B. J. Weiner,
Characterizing biobank organizations in the U.S.: Results
from a national survey. Genome Med. 5, 3 (2013).
W. Ollier, T. Sprosen, T. Peakman, UK Biobank: From concept to reality. Pharmacogenomics 6, 639–646 (2005).
L. J. Palmer, UK Biobank: bank on it. Lancet 369, 1980–
1982 (2007).
Z. Chen, J. Chen, R. Collins, Y. Guo, R. Peto, F. Wu, L. LiChina Kadoorie Biobank (CKB) collaborative group, China
Kadoorie Biobank of 0.5 million people: Survey methods,
baseline characteristics and long-term follow-up. Int. J.
Epidemiol. 40, 1652–1666 (2011).
H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman,
J. C. Denny, MedEx: A medication information extraction
system for clinical narratives. J. Am. Med. Inform. Assoc.
17, 19–24 (2010).
S. B. Trinidad, S. M. Fullerton, J. M. Bares, G. P. Jarvik, E. B.
Larson, W. Burke, Genomic research and wide data sharing: Views of prospective participants. Genet. Med. 12,
36.
37.
38.
39.
40.
486–495 (2010).
K. B. Brothers, E. W. Clayton, Parental perspectives on a
pediatric human non-subjects biobank. AJOB Prim. Res.
3, 21–29 (2012).
J. M. Pulley, M. M. Brace, G. R. Bernard, D. R. Masys, Attitudes and perceptions of patients towards methods of
establishing a DNA biobank. Cell Tissue Bank. 9, 55–65
(2008).
C. M. Simon, E. Newbury, J. L’heureux, Protecting participants, promoting progress: Public perspectives on
community advisory boards (CABs) in biobanking.
J. Empir. Res. Hum. Res. Ethics 6, 19–30 (2011).
J. Murphy, J. Scott, D. Kaufman, G. Geller, L. LeRoy, K.
Hudson, Public perspectives on informed consent for
biobanking. Am. J. Public Health 99, 2128–2134 (2009).
C. T. Scott, T. Caulfield, E. Borgelt, J. Illes, Personal medicine—The new banking crisis. Nat. Biotechnol. 30, 141–
147 (2012).
Competing interests: The authors declare that they have no
competing interests.
10.1126/scitranslmed.3008604
Citation: E. Bowton, J. R. Field, S. Wang, J. S. Schildcrout, S. L.
Van Driest, J. T. Delaney, J. Cowan, P. Weeke, J. D. Mosley, Q. S.
Wells, J. H. Karnes, C. Shaffer, J. F. Peterson, J. C. Denny, D. M.
Roden, J. M. Pulley, Biobanks and Electronic Medical Records:
Enabling Cost-Effective Research. Sci. Transl. Med. 6, 234cm3
(2014).
www.ScienceTranslationalMedicine.org 30 April 2014 Vol 6 Issue 234 234cm3
5
Downloaded from stm.sciencemag.org on November 7, 2014
22.
Roden, J. D. Smith, D. L. Veenstra, A. E. Rettie, M. J. Rieder,
A genome-wide scan for common genetic variants with
a large influence on warfarin maintenance dose. Blood
112, 1022–1027 (2008).
E. Link, S. Parish, J. Armitage, L. Bowman, S. Heath, F. Matsuda, I. Gut, M. Lathrop, R. Collins, SEARCH Collaborative
Group, SLCO1B1 variants and statin-induced myopathy—
A genome-wide study. N. Engl. J. Med. 359, 789–799
(2008).
S. Mallal, E. Phillips, G. Carosi, J.-M. Molina, C. Workman,
J. Tomazic, E. Jägel-Guedes, S. Rugina, O. Kozyrev, J. F.
Cid, P. Hay, D. Nolan, S. Hughes, A. Hughes, S. Ryan, N.
Fitch, D. Thorborn, A. Benbow, PREDICT-1 Study Team,
HLA-B*5701 screening for hypersensitivity to abacavir. N.
Engl. J. Med. 358, 568–579 (2008).
J. C. Denny, L. Bastarache, M. D. Ritchie, R. J. Carroll, R. Zink,
J. D. Mosley, J. R. Field, J. M. Pulley, A. H. Ramirez, E. Bowton, M. A. Basford, D. S. Carrell, P. L. Peissig, A. N. Kho, J.
A. Pacheco, L. V. Rasmussen, D. R. Crosslin, P. K. Crane, J.
Pathak, S. J. Bielinski, S. A. Pendergrass, H. Xu, L. A. Hindorff, R. Li, T. A. Manolio, C. G. Chute, R. L. Chisholm, E. B.
Larson, G. P. Jarvik, M. H. Brilliant, C. A. McCarty, I. J. Kullo,
J. L. Haines, D. C. Crawford, D. R. Masys, D. M. Roden, Systematic comparison of phenome-wide association study
of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
M. Garcia-Closas, N. Rothman, J. Lubin, Misclassification
in case-control studies of gene-environment interactions: Assessment of bias and sample size. Cancer Epidemiol. Biomarkers Prev. 8, 1043–1050 (1999).
M. Jiang, J. C. Denny, B. Tang, H. Cao, H. Xu, Extracting