NIH Public Access
Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
NIH-PA Author Manuscript
Published in final edited form as:
J Biomed Inform. 2010 December ; 43(6): 914–923. doi:10.1016/j.jbi.2010.07.011.
An analytical approach to characterize morbidity profile
dissimilarity between distinct cohorts using electronic medical
records
Jonathan S. Schildcrouta,b, Melissa Basfordc, Jill Pulleyd, Daniel R. Masysd,e, Dan M.
Rodend,f, Deede Wangd, Christopher G. Chuteg, Iftikhar J. Kulloh, David Carrelli, Peggy
Peissigj, Abel Khok, and Joshua C. Dennyd,e
aDepartment of Biostatistics, Vanderbilt University School of Medicine
bDepartment
cVanderbilt
of Anesthesiology, Vanderbilt University School of Medicine
Institute for Clinical and Translational Research, Vanderbilt University School of
NIH-PA Author Manuscript
Medicine
dDepartment
of Medicine, Vanderbilt University School of Medicine
eDepartment
of Biomedical Informatics, Vanderbilt University School of Medicine
fDepartment
of Pharmacology, Vanderbilt University School of Medicine
gDivision
of Biostatistics and Informatics, Mayo Clinic
hDivision
of Cardiovascular Diseases, Mayo Clinic
iCenter
for Health Studies, Group Health Cooperative
jBiomedical
Informatics Research Center, Marshfield Clinic Research Foundation
kDepartment
of Internal Medicine, Northwestern University School of Medicine
Abstract
NIH-PA Author Manuscript
We describe a two-stage analytical approach for characterizing morbidity profile dissimilarity
among patient cohorts using electronic medical records. We capture morbidities using the
International Statistical Classification of Diseases and Related Health Problems (ICD-9) codes. In
the first stage of the approach separate logistic regression analyses for ICD-9 sections (e.g.,
“hypertensive disease” or “appendicitis”) are conducted, and the odds ratios that describe adjusted
differences in prevalence between two cohorts are displayed graphically. In the second stage, the
results from ICD-9 section analyses are combined into a general morbidity dissimilarity index
(MDI). For illustration, we examine nine cohorts of patients representing six phenotypes (or
controls) derived from five institutions, each a participant in the electronic MEdical REcords and
GEnomics (eMERGE) network. The phenotypes studied include type II diabetes and type II
diabetes controls, peripheral arterial disease and peripheral arterial disease controls, normal
cardiac conduction as measures by electrocardiography, and senile cataracts.
Corresponding author: Jonathan S. Schildcrout, 1161 21st Ave South, S-2323 Medical Center North, Vanderbilt University School of
Medicine, Nashville, TN 37232-2156, Phone: 615-343-5432, Fax: 615-343-4924, jonathan.schildcrout@vanderbilt.edu.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our
customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of
the resulting proof before it is published in its final citable form. Please note that during the production process errors may be
discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Schildcrout et al.
Page 2
Keywords
NIH-PA Author Manuscript
Electronic medical records; ICD-9; dissimilarity index; comorbidity index; population
comparison; morbidity dissimilarity index
A. Introduction
Electronic Medical Records (EMR) have been shown to offer the potential to improve the
quality of clinical care, reduce costs, and improve guideline adherence. While researchers
have also used EMRs for clinical research(1)(2), for medical outcomes research (3), to
categorize rare findings (4), and to identify patients with various conditions and assess
eligibility for clinical trials (5)(6), there has been little exploration of using DNA biobanks
linked to EMRs for genomic studies. Given the powerful potential for substantial cost and
time efficiency (7), there is increasing interest in EMRs as a potential way to identify
cohorts of patients and associated DNA samples to discover genetic associations for
common complex diseases and the genetic influence on response to therapy through
genome-wide association studies (GWAS) (8).
NIH-PA Author Manuscript
Pooling data from multiple EMRs or sites can improve power and generalizability,
especially when investigating a less prevalent disease phenotype. However, it introduces
analytical considerations related to cohort heterogeneity. If genotype-phenotype associations
are highly variable across the sites, caution should be applied when combining results since
a single summary measure of the overall association may mask important site-by-genotype
interactions. When a single association measure is of interest, meta-analytic approaches such
as the random effects model of DerSimonian and Laird (9) and its extensions can be applied.
In this model, the overall association (e.g., a log odds ratio), θ, is a weighted average of the
site-specific associations, θi where i = 1,2,…,I denotes site. The variance of θ, Var(θ), is
is the variance at side i and τ2 is a measure
given by
of variability among θi across the sites. The value τ2 can be thought of as a heterogeneity
penalty that increases Var(θ) and can lead to diminished power to detect associations. If
costs associated with ascertaining genotypes and/or phenotype are high, being able to
anticipate analytical challenges and/or loss of power due to cohort heterogeneity is crucial.
Towards that end, we propose a two-stage analysis protocol that uses readily available
patient information to proactively examine the extent to which selected cohorts are
dissimilar over a (broad or narrow) range of morbidities.
NIH-PA Author Manuscript
Due to their wide availability, standard format, and relatively consistent utilization, we
capture morbidities with the International Statistical Classification of Diseases and Related
Health Problems codes (ICD-9). However, the proposed approach is general and can be
applied to other morbidity definitions. At the first stage, the protocol estimates demographic
adjusted measures of cohort morbidity differences across individual ICD-9 sections using
logistic regression and displays odds ratios and associated 95% confidence intervals
graphically. At the second stage, the section-specific differences estimated at the first stage
are combined into a single, general measure of cohort dissimilarity. We call this the
“morbidity dissimilarity index” (MDI), and it can be thought of as a distance between the
morbidity profiles of two cohorts. Results from the two stages of analyses are
complementary. Stage 2 results permit broad summarization of dissimilarity over a range of
morbidities, and stage 1 results can be used to examine observed differences at a finer level.
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 3
B. Background
Comorbidity summarization
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Comorbidity information readily available in EMRs can be a valuable resource for assessing
cohort dissimilarity. Individual level indices that can be derived from EMR such as the
Charlson comorbidity index (10), Elixhauser index (11)(12), APACHE score (13), and
functional comorbidity index (14) capture health outcomes related risk for a given a set of
features. While these measures can be used to compare individuals’ risks, they do not
specifically measure similarity. For example two individuals with equal risk scores may
differ on the items that comprise the score. An information theoretic scoring approach has
been proposed (15) for measuring individual case similarities based on patient-specific
features. From this, one could calculate a measure of cohort similarity with, say, an intraclass correlation coefficient that captures the relative contributions of between- and withincohort variation in the scores. However, by first calculating patient-level scores and then
summarizing the distribution of these scores, we lose all information about the relatedness or
correlation among the components of the score. As we will show, proper acknowledgment
of morbidity correlations is crucial for capturing cohort morbidity similarity. Principal
Components Analysis (PCA) is commonly used to identify population (genetic) structure
(16)(17)(18)(19) and can therefore be used to capture cohort morbidity profile dissimilarity
like we do. That is, one could use PCA to reduce the dimensionality of the morbidity profile
into, say, a single principal component. A distance metric between the cohorts could then be
derived from the morbidity-specific coefficients. However, the morbidity-specific
coefficients have conditional interpretations, and therefore in the presence of correlated
morbidities marginal differences in prevalence between cohorts will be masked. In our twostage approach, the marginal differences are of interest and are captured and examined
explicitly. They are then combined into a single measure of dissimilarity while properly
accounting for morbidity correlations.
Electronic MEdical Records and GEnomics (eMERGE) Network
NIH-PA Author Manuscript
This work is motivated by ongoing GWAS studies performed as part of the electronic
MEdical Records and GEnomics (eMERGE) network, which seeks to use EMR-linked DNA
biobanks as their source of cases and controls. The eMERGE network is a consortium of
five medical centers, Group Health Cooperative (GHC, Seattle WA), Marshfield Clinic
(MAR, Marshfield, WI), Mayo Clinic (MAY, Rochester, MN), Northwestern University
(NU, Chicago, IL), and Vanderbilt University (VU, Nashville, TN). Each eMERGE member
has established a DNA biobank linked to an EMR for clinical data (20). The consortium is
funded by the National Human Genome Research Institute with additional funding by the
National Institute of General Medical Sciences to develop the necessary tools and
techniques to perform GWAS in participants with phenotypes and environmental exposures
derived from EMRs.
The eMERGE sites are investigating seven primary disease phenotypes by GWAS, and a
growing number of secondary phenotypes that seek to reuse GWAS data derived from the
primary phenotypes. Each site has created and refined electronic phenotype selection
algorithms to identify cases and controls using information derived from the EMR. The
algorithms use combinations of administrative billing codes, laboratory and medication data,
and string queries and natural language processing techniques applied to unstructured, freetext clinical narratives. Given the typically small effect size of individual SNP-phenotype
associations, thousands of cases and controls are typically required to ensure adequate
statistical power for successful GWAS (21). Thus, several eMERGE phenotypes require
pooling cases and controls across the network.
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 4
B. Methods
Populations examined
NIH-PA Author Manuscript
Across the eMERGE network, selection algorithms were developed for type 2 diabetes (VU,
NU), cardiac conduction (VU, NU), senile cataracts (MAR, GHC), senile dementia (MAR,
GHC), and peripheral arterial disease (MAY). Each phenotype selection algorithm was
iteratively developed and evaluated by clinician reviewers or chart abstractors at each site
until they performed well enough to obtain a positive predictive value greater than or equal
to 95%. The details of these algorithms are posted on http://gwas.net; their implementation
and rationale will be presented in subsequent publications. Because EMR systems and
structures differ across sites within the eMERGE network, the algorithms implemented at
multiple sites were adapted to accommodate each local environment.
As an example for our analysis protocol, we examined nine site-phenotype cohorts defined
by these algorithms: VU type II diabetes (VU-T2D), VU type II diabetes controls (VUCON), VU patients with normal cardiac conduction as measured by the QRS duration (VUQRS), NU type II diabetes (NU-T2D), NU type II diabetes controls (NU-CON), GHC senile
cataracts (GHC-CAT), MAR senile cataracts (MAR-CAT), MAY peripheral arterial disease
(MAY-PAD), and MAY peripheral arterial disease controls (MAY-CON).
NIH-PA Author Manuscript
Selection of ICD-9 billing codes for analysis
While billing codes are imperfect measures of disease status, they are useful for research
involving EMR because they cover the broad range of diseases and diagnoses, they are
commonly used in large scale research to define populations, they are utilized consistently
across sites, and they are easily extracted from most EMR systems. Current Procedural
Technology (CPT) or ICD-9 procedural codes were not considered because they are
dependent on the procedure being performed at the hospital of interest, and the receipt of a
procedure is influenced by external factors (e.g., insurance, patient preference, and life
expectancy), making them less useful in understanding disease status for many phenotypes.
NLP approaches were not applied because these capabilities were not available to all sites in
the eMERGE network.
All available inpatient and outpatient ICD-9 codes were selected for each subject and
compared against a list of available ICD-9 codes derived from the Unified Medical
Language System (UMLS), version 2009AA (22). Invalid ICD-9 codes, E codes (external
causes of injury) V codes (screening codes and other supplementary factors influencing
health), procedure codes (i.e., 2-digit ICD9 codes), and signs and symptoms (780–799) were
excluded from analyses.
NIH-PA Author Manuscript
Data Preparation
Adequate EMR data were available for differing lengths of time across eMERGE network
sites. For consistency of comparison, the study was limited to the years 2001 to 2007. Fivedigit ICD-9 codes were available on all patients, however, coding at this level is highly
idiosyncratic, thereby precluding meaningful comparative analyses of the cohorts. On the
other hand, regression analyses on codes aggregated to the level of ICD-9 chapters (e.g.,
“Diseases of the digestive system”, n=16) yield coarse and insensitive characterizations of
patient co-morbidity profiles. Therefore, to identify co-morbidities, we use ICD-9 categories
(3-digit codes, n=904) which we believe represent a level of coding that avoids the major
pitfalls of five-digit codes while maintaining sufficient detail to allow meaningful
comparisons. For a category code to be considered present in an individual, it must have
been observed on more than one occasion. Our rationale for this cut off was 1) it favors
chronic conditions over temporary acute conditions, and 2) it reduces potential for noise
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 5
NIH-PA Author Manuscript
induced by singular coding errors, as has been found for some chronic conditions in prior
ICD-9 analyses (23)(24). While some real co-morbidities might be missed, the approach
provides more confidence that the ones observed were indeed true positives. Section (e.g.,
“Noninfectious enteritis and colitis”, n=110) and chapter level co-morbidities were
considered to be present if at least one category code underneath them in the ICD-9
taxonomy was present. We only considered adult patients (age≥18 years) who were
observed for at least three years.
Analysis strategy
Analyses of ICD-9 categories were considered; however, we found that many important
ones did not provide sufficient counts to permit analyses. We base analyses on the 66 of 110
ICD-9 sections that were observed in five percent of patients in at least one cohort and in
one-tenth of a percent of patients in all cohorts. Had we not imposed the ‘observed category
codes twice’ rule, our analyses would have been based on 74 ICD-9 sections.
Our analysis protocol involves two stages. In the first stage, we use logistic regression to
capture the adjusted log odds ratio of observing each ICD-9 section between cohorts, and in
the second stage we summarize section-specific results within and across ICD-9 chapters to
ascertain chapter-specific measures and a single overall measure of cohort dissimilarity.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Stage 1: For each ICD-9 section s in 1, 2, … S, (S=66 in this analysis), we fit a logistic
regression model that included, as predictors, the cohort identification variable (i.e., MAYPAD, NU-T2D, etc.) and covariates: gender, race (white, black, other, and unknown), age,
and length of patient follow-up. The demographic covariate adjustments were crucial since
multi-site studies include these covariates in their statistical analysis models, and our
objective is to characterize cohort morbidity dissimilarity beyond what common adjustment
covariates could explain. To reduce re-identification risk, birthdays were truncated to the
birth years, and birth years were truncated at 1928. For the sake of modeling, age was
represented with two variables: an indicator variable for being born prior to 1928 and then a
continuous age variable for those born in or after 1928. The latter age variable and the length
of follow-up variable were fit with flexible restricted cubic spline functions with six degrees
of freedom (25). Linear combinations of estimated regression parameters and variances were
used to capture differences in the log odds of ICD-9 sections between cohort pairs (e.g.,
GHC-CAT and MAR-CAT), and the associated odds ratios and confidence intervals were
displayed graphically. Because ICD-9 sections were modeled individually, the covariance
matrix required for stage 2 was estimated using a stratified bootstrap approach (26).
Specifically, at each of 1500 replicates, a bootstrap sample was ascertained for each site
separately, section-specific models were fitted, and parameter estimates were saved. The
covariance matrix was estimated across bootstrap replications (26).
Stage 2: In stage 2, ICD-9 section-specific parameter and covariance estimates from Stage 1
were combined to obtain a measure of cohort dissimilarity. The measure can be described as
a modified Mahalanobis Distance. Let β̂ = (β̂1, β̂2,…, β̂S)t be the vector of estimated sectionspecific differences in the log odds (i.e., the log odds ratio) for two populations estimated at
stage 1, and V̂ ≡ V̂ (β̂) be the estimated variance-covariance matrix. For ease of exposition,
we remove ^ from our notation. We define the morbidity dissimilarity index (MDI) with,
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 6
NIH-PA Author Manuscript
where, W = kV−1, V−1 is the inverse of the variance-covariance matrix V and k = 1/tr(V−1) is
the inverse of the trace (sum of the diagonal elements) of V −1. The MDI differs from a
Mahalanobis distance by the coefficient k, which serves to rescale the measure so it is
independent of the magnitude of section-specific variances, and therefore of sample size.
Since estimated variances decrease with sample size, Mahalanobis distance necessarily
increases with sample size. So, if the goal is classification, the Mahalanobis distance is
appropriate; however, our interest is in a sample and interpretable measure of cohort
dissimilarity.
When all variances are equal and in the absence of correlation among parameter estimates,
the MDI is equal to the Euclidean distance between (β1, β2,…, βS) and the origin (0, 0, … 0)
divided by the square root of S. The MDI is on the same scale as the components of β and
therefore, its value has a meaningful interpretation. In contrast, to interpret the Euclidean
distance we must know the dimension of β. For example, consider the scenario where S=10
and β = (1,1,…,1). It is easy to show that the MDI is equal to 1 thereby providing an
insightful measure of how large components of β are; however, the Euclidean distance is
approximately 3.2, which we find to be less insightful.
In the presence of unequal variances and correlation, MDI interpretation is subtle; however
proper acknowledgement of these important data features is crucial for characterizing cohort
dissimilarity validly. For simplicity, assume we wish to calculate the MDI from analysis of
NIH-PA Author Manuscript
are variances for β1 and β2 respectively, and ρ is the
two ICD-9 sections, where
estimated correlation. It is straightforward to show that the MDI is equal to
Upon inspection, it can be seen that MDI does not depend on the magnitude of
(i.e., it does not depend on sample size), but it is affected by their relative size and by ρ.
Figure 1 displays the impact of these data features on the MDI, and just as important, it
shows how misleading dissimilarity measures can be if data features are ignored. Panels are
NIH-PA Author Manuscript
defined by
and ρ, and in each panel, the solid and dashed black lines display the set of
all (β1, β2) that result in MDI values equal to 0.5 and 1.0, respectively. Notice that unequal
variance stretches or contracts and correlation rotates the parameter space, in that the set of
all points corresponding to MDI=0.5 differs across panels in the figure. The point (1.5, 0.5)
is denoted on all panels as a reference point, and the MDI for (1.5, 0.5) in panels a), b),c),
and d), is 1.12, 0.83, 0.86, and 0.51, respectively. That is, if the data structure is given by
panel d), and we ignore the correlation and the differences in variances (e.g., by assuming
panel a) is true) then we will overestimate dissimilarity by more than two-fold on the log
odds ratio scale. With proper analyses, the MDI effectively addresses unequal variances and
correlation. Thus, simpler indices that ignore their impact are not recommended.
D. Results
Demographic characteristics and subject experiences of 17,070 patients observed from
January 1, 2001 to December 31, 2007 from eMERGE network sites are shown in table 1.
The NU-T2D cohort was the most racially diverse with minorities representing 36 percent of
its sample. The proportion of female subjects ranged from 36% in MAY-PAD to 70% in
VU-QRS samples. The GHC-CAT sample was the oldest, with 76 percent of patients being
born prior to 1928. This was due to the requirement that patients included in this sample
must also qualify for a study on dementia in the elderly. MAY-PAD and MAY-CON cohorts
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 7
NIH-PA Author Manuscript
were observed on the fewest number of days with median values equaling 44, while the
medians in the other populations ranged from 76 to 124 days. The two cohorts with the
fewest number of unique codes were the type 2 diabetes controls at NU and VU, where the
median number of unique ICD-9 categories, sections and chapters observed were 11, 9, and
6, and 7, 6, and 4, respectively.
NIH-PA Author Manuscript
Figure 2 displays the raw prevalence of co-morbidities in several phenotype-site cohort pairs
for ICD-9 categories, sections, and chapter using Bland-Altman plots (27), with codes used
to define cohorts (250.* codes for type II diabetics; 366.*, 374.*, 385.*, 743.3*, 744.3,
742.3, and 753.0 for senile cataracts; 440, 440.2, 433.*, 433.*, 434.*, 435.*, 436.*, 437.*,
438.*, 441.*, 442.*, 443.*, and 444.* for peripheral arterial disease) having been removed.
While these plots have limitations since they are not adjusted for demographic and other
characteristics, they demonstrate interesting patterns. The common site – case versus control
plots (NU-T2D versus NU-CON and MAY-PAD versus MAY-CON) in the first two rows
of panels show that the cases tend to exhibit a higher prevalence of co-morbidities than their
associated controls, though this is more pronounced in the NU plots than in the MAY plots.
Chapter level rates between MAY-PAD and MAY-CON appear reasonably similar to one
another while even at this highly aggregated level of summarization the NU-T2D cohort
tends to exhibit higher rates of morbidities than does its control cohort. The lower two rows
of plots display common phenotypes compared across different sites (NU-T2D versus VUT2D and MAR-CAT versus GHC-CAT). The co-morbidity profiles in these pairs of cohorts
are more similar to one another than in the upper two panels. NU-T2D patients tend to
experience slightly higher rates of morbidities than VU-T2D patients, though MAR-CAT
and GHC-CAT populations appear comparable to one another except in one morbidity
category (indicated by the outlying, uppermost point in each of the plots in the bottom row).
NIH-PA Author Manuscript
Figures 3 and 4 display the results from stage 1 of the analysis protocol. They show the
adjusted odds ratios based on multiple logistic regression models for the 66 ICD-9 sections,
ordered alphabetically by ICD-9 chapter and then by section. The size of the plotting points
is inversely related to the confidence interval length, although we limited the size of points
when confidence intervals were tight, and “X” denotes a very large odds ratio with the lower
confidence bound being greater than 20. In Figure 3, we show within-site, case versus
control comparisons at NU, VU, and MAY, and in Figure 4 we show two, same-phenotype,
different-site comparisons (GHC-CAT versus MAR-CAT and NU-T2D versus VU-T2D),
and a different-phenotype, different-site comparison (MAY-PAD versus VU-QRS).
Consistent with Figure 2, morbidity profiles in the cohorts with the same phenotype, but at
different sites (Figures 4a and 4b), are more similar to one another than cases versus controls
at the same site (Figure 3a, 3b, and 3c) and different phenotypes at different sites (Figure 4c)
as odds ratios tend to be closer to one.
Figures 3 and 4 highlight important patterns of differences between pairs of cohorts. For
example, while the GHC-CAT and MAR-CAT populations appear to have similar profiles
(Figure 4), we observe that ICD-9 sections “neoplasms of uncertain behavior” and
“dislocation” occur at higher rates at GHC than at MAR, and section “other metabolic and
immunity disorders” occurs at a much higher rate at MAR than at GHC. This was also
observed in Figure 2. Compared with their controls, adjusted co-morbidity risk was higher
for NU-T2D and VU-T2D cohorts over the range of ICD-9 sections, though this result
appear less pronounced for the MAY-PAD versus MAY-CON comparison. Figure 4c shows
that the MAY-PAD cohort tended to exhibit higher rates of nervous system and (as
expected) circulatory system disorders than the VU-QRS cohort though the opposite was
true for neoplasms.
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 8
NIH-PA Author Manuscript
Simple numerical summaries describing differences among populations are complementary
and sometimes preferred to the graphical depictions of individual differences, such as in
figures 3 and 4. The MDI from stage 2 of the analysis protocol for select cohort pairs are
shown in table 2 for ICD-9 sections within chapters and then over the range of all sections.
Among the pairwise comparisons, the two same-phenotype, different-site cohorts appeared
most similar to one another with an overall MDI of 0.47 for the CAT cohorts (column 5) and
0.44 for the T2D cohorts (column 6). While these values imply non-trivial differences
between the cohorts with the same phenotypes at different sites, it is worth noting that the
overall MDI for NOR-T2D versus VAN-T2D is just over half the size of the MDI for VANT2D versus VAN-CON (MDI=0.80) and for NOR-T2D versus NOR-CON (MDI=0.82).
That is, the impact of type II diabetes on the overall morbidity profile is approximately 80%
larger than the impact of site. Focusing further on ICD-9 sections in the “Endocrine,
metabolic and nutritional immunity” ICD-9 chapter, the impact of type II diabetes is at least
150% larger than the impact of site, where MDI is equal to 2.10, 2.18, and 0.83 for VANT2D versus VAN-CON, NOR-T2D versus NOR-CON, and NOR-T2D versus VAN-T2D,
respectively. ICD-9 sections within the “musculoskeletal system and connective tissue”
ICD-9 chapter appeared to be least associated with sites and phenotypes as MDI values
ranged from 0.16 for MAY-PAD versus MAY-CON, to 0.50 for the VU-T2D versus VUCON.
NIH-PA Author Manuscript
It should be noted that with finite samples, the MDI measure would be non-zero even when
cohorts are randomly sampled from the same populations. However, with large samples
such as those discussed here, under random sampling from a single population, it will be
very close to zero. We conducted all analyses having repeatedly and randomly reassigned
cohort identifiers (e.g., using a Monte-Carlo based randomization approach to simulate
random samples from a single population). After rounding to the nearest hundredth, none of
the values corresponding to those shown in table 2 exceeded 0.02.
E. Discussion
NIH-PA Author Manuscript
We have proposed a general two-stage analysis approach for systematic characterization of
co-morbidity profile differences between cohorts derived from EMRs. The strategy involves
regression modeling over a range of ICD-9 sections, graphical displays of results, and
summarization of the differences with the MDI for broader insights. Results from first and
second stage analyses are complementary, and the breadth of the co-morbidities one chooses
to examine depends on study objectives. If the objective is to characterize dissimilarity
broadly (e.g., comparing the differences between two hospitals or finding the “nearest
neighbor” between two cohorts) then a diverse range of morbidities should be considered.
However, if the objective is to anticipate analytical challenges to a multicenter study (e.g.,
variance inflation or power reduction due to among site heterogeneity) where the target
phenotype has been identified but has not yet been ascertained, then the range of morbidities
to consider should be narrower and should be related to the target phenotype. The MDI is on
the same scale as parameters in logistic regression analyses, and so it has an intuitively
appealing interpretation. It can also be exponentiated if one wishes characterize dissimilarity
with odds ratios.
In the eMERGE study analysis, we found that cohorts with the same phenotypes at different
institutions appeared to have more similar morbidity profiles than those representing
different phenotypes, providing some reassurance for the planned network projects. We
intend to perform this analysis on many eMERGE projects prior to their implementation, as
results and implications will depend upon the phenotype. As more of the phenotype defined
populations become available, these and other data will better inform the development of
general guidelines for how ‘similar’ populations should be for pooled genetic or clinical
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 9
NIH-PA Author Manuscript
analysis. While the MDI can be interpreted as a measure of dissimilarity on the scale of log
odds ratios, its aptness for proactively capturing potentially heterogeneous site-specific
genotype-phenotype associations depends on a number of data features and perhaps most
importantly on the strength of the relationship between the morbidities that comprise it and
the phenotype of interest. The stronger the relationship, the more likely it is to be useful.
That being said, it can only be used as a guide since morbidity profile dissimilarity does not
capture genotype-phenotype association heterogeneity. As an area for future research, we
will explore various data features that impact the utility of the MDI for this aim. We will
also explore the utility of formally incorporating domain structure (i.e., the ICD-9
taxonomy) into the calculation of the overall MDI. In our two-stage approach, we
acknowledged domain structure explicitly by organizing figures 3 and 4 by ICD-9 chapters
and by calculating chapter-specific MDIs; however the domain structure was not implicit in
the calculation of the overall MDI. The formal incorporation of this structure will effectively
involve a reweighting of ICD-9
NIH-PA Author Manuscript
This analytical protocol is not limited to the ICD-9 coding system and could be used for
other classification schemes, such as CPT codes, medications given, or NLP-derived disease
codes mapped to controlled terminologies such as SNOMED or the UMLS. Using NLP may
improve recall and precision of disease identification (28)(29). One challenge, if mapping to
a vocabulary such as the UMLS, would be to aggregate codes at an appropriate level. For
instance, as discussed earlier, we found that performing the tests of associations using ICD-9
category codes (904 unique codes) provided insufficient counts of patients with each code to
allow for statistical analysis. Thus, a large percentage of possibly important codes would
have been removed from the analysis.
NIH-PA Author Manuscript
There are several limitations of this study. There are a number of known problems with
ICD-9 codes for diagnosis, including false positives and false negatives (30). At eMERGE
network institutions, professional coders typically entered inpatient codes, while outpatient
codes resulted from direct physician entry. Invalid or incorrect codes are often entered,
either from memory or from pre-populated lists (e.g., a type 1 diabetes code when a type 2
code is intended). Codes that are difficult to find or that do not lead to significant
reimbursement may be excluded. Some institutions arbitrarily limit the number of codes
stored in their data warehouse from a particular visit, while others do not, and some data
warehouses include both incorrect and corrected codes. The ICD-9 hierarchy itself is not
optimal for phenotypic analysis, since it is designed and maintained to support
administrative and billing operations. In addition, coding practices can vary among
practitioners within institutions and between institutions. We considered only diagnosis
codes and demographics in our comparisons, and due to age truncation, there is likely to be
residual confounding. Other health information, such as medication information and
procedures received, are important markers of the veracity and severity of disease and if
available could also be included in analyses. Finally, we did not utilize disease onset times.
It would be very interesting to conduct analyses that consider morbidity timing and
morbidity coding in relation to disease onset times. For example, one could examine how
coding practices change from before disease onset to after disease onset, or one could
examine coding trends leading up to the time of disease onset.
Future clinical and genomic research will benefit from deriving samples from diverse data
repositories. The ability to investigate rare diseases for genomic and environmental
influences will require aggregation of samples from multiple repositories. We present an
initial attempt to highlight and quantify the non-random influences of geographic and
provider practices to inform analysis of such data. More research is needed to study the
certainty of ICD-9 codes and use of other resources to improve the accuracy of co-morbidity
assessment and severity
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 10
References
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
1. Herzig SJ, Howell MD, Ngo LH, Marcantonio ER. Acid-suppressive medication use and the risk for
hospital-acquired pneumonia. JAMA 2009 May 27;301(20):2120–2128. [PubMed: 19470989]
2. Klompas M, Haney G, Church D, Lazarus R, Hou X, Platt R. Automated identification of acute
hepatitis B using electronic medical record data to facilitate public health surveillance. PLoS ONE
2008;3(7):e2626. [PubMed: 18612462]
3. Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D, Nordyke RJ. Review: use of electronic medical
records for health outcomes research: a literature review. Med Care Res Rev 2009 Dec;66(6):611–
638. [PubMed: 19279318]
4. Denny JC, Arndt FV, Dupont WD, Neilson EG. Increased hospital mortality in patients with bedside
hippus. Am. J. Med 2008 Mar;121(3):239–245. [PubMed: 18328309]
5. Pakhomov S, Weston SA, Jacobsen SJ, Chute CG, Meverden R, Roger VL. Electronic medical
records for clinical research: application to the identification of heart failure. Am J Manag Care
2007 Jun;13(6 Part 1):281–288. [PubMed: 17567225]
6. Seyfried L, Hanauer DA, Nease D, Albeiruti R, Kavanagh J, Kales HC. Enhanced identification of
eligibility for depression research using an electronic medical record search engine. Int J Med
Inform 2009 Dec;78(12):e13–e18. [PubMed: 19560962]
7. Burton PR, Hansell AL, Fortier I, Manolio TA, Khoury MJ, Little J, et al. Size matters: just how big
is BIG?: Quantifying realistic sample size requirements for human genome epidemiology. Int J
Epidemiol 2009 Feb;38(1):263–273. [PubMed: 18676414]
8. Manolio TA. Collaborative genome-wide association studies of diverse diseases: programs of the
NHGRI's office of population genomics. Pharmacogenomics 2009 Feb;10(2):235–241. [PubMed:
19207024]
9. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986 Sep;7(3):177–
188. [PubMed: 3802833]
10. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic
comorbidity in longitudinal studies: development and validation. J Chronic Dis 1987;40(5):373–
383. [PubMed: 3558716]
11. Elixhauser A, Steiner C, Harris DR, Coffey RM. Comorbidity measures for use with administrative
data. Med Care 1998 Jan;36(1):8–27. [PubMed: 9431328]
12. van Walraven C, Austin PC, Jennings A, Quan H, Forster AJ. A modification of the Elixhauser
comorbidity measures into a point system for hospital death using administrative data. Med Care
2009 Jun;47(6):626–633. [PubMed: 19433995]
13. Knaus WA, Zimmerman JE, Wagner DP, Draper EA, Lawrence DE. APACHE-acute physiology
and chronic health evaluation: a physiologically based classification system. Crit. Care Med 1981
Aug;9(8):591–597. [PubMed: 7261642]
14. Groll DL, To T, Bombardier C, Wright JG. The development of a comorbidity index with physical
function as the outcome. J Clin Epidemiol 2005 Jun;58(6):595–602. [PubMed: 15878473]
15. Cao H, Melton GB, Markatou M, Hripcsak G. Use abstracted patient-specific features to assist an
information-theoretic measurement to assess similarity between medical cases. J Biomed Inform
2008 Dec;41(6):882–888. [PubMed: 18487093]
16. Cavalli-Sforza LL, Edwards AW. Phylogenetic analysis. Models and estimation procedures. Am. J.
Hum. Genet 1967 May;19(3 Pt 1):233–257. [PubMed: 6026583]
17. Cavalli-Sforza LL, Feldman MW. The application of molecular genetic approaches to the study of
human evolution. Nat. Genet 2003 Mar;33 Suppl:266–275. [PubMed: 12610536]
18. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components
analysis corrects for stratification in genome-wide association studies. Nat. Genet 2006 Aug;38(8):
904–909. [PubMed: 16862161]
19. Lee C, Abdool A, Huang C. PCA-based population structure inference with generic clustering
algorithms. BMC Bioinformatics 2009;10 Suppl 1:S73. [PubMed: 19208178]
20. The eMERGE network. [cited 2009 9/13]. Available from: http://www.gwas.net
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 11
NIH-PA Author Manuscript
NIH-PA Author Manuscript
21. Ioannidis JPA, Trikalinos TA, Khoury MJ. Implications of small effect sizes of individual genetic
variants on the design and interpretation of genetic association studies of complex diseases. Am. J.
Epidemiol 2006 Oct 1;164(7):609–614. [PubMed: 16893921]
22. UMLS Knowledge Source Server. [cited 2007 July 3]. Available from
http://umlsks/nlm/nih.gov/kss/
23. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, et al. PheWAS:
demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations.
Bioinformatics 2010 May 1;26(9):1205–1210. [PubMed: 20335276]
24. Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM, et al. Robust
replication of genotype-phenotype associations across multiple diseases in an electronic medical
record. Am. J. Hum. Genet 2010 Apr 9;86(4):560–572. [PubMed: 20362271]
25. Harrell, F. Regression modeling strategies : with applications to linear models, logistic regression,
and survival analysis. New York: Springer; 2001.
26. Efron, B. An introduction to the bootstrap. New York: Chapman & Hall; 1993.
27. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of
clinical measurement. Lancet 1986 Feb 8;1(8476):307–310. [PubMed: 2868172]
28. Elkin, PL.; Ruggieri, AP.; Brown, SH.; Buntrock, J.; Bauer, BA.; Wahner-Roedler, D., et al. A
randomized controlled trial of the accuracy of clinical record retrieval using SNOMED-RT as
compared with ICD9-CM; Proc AMIA Symp; 2001. p. 159-163.
29. Li, L.; Chase, HS.; Patel, CO.; Friedman, C.; Weng, C. Comparing ICD9-encoded diagnoses and
NLP-processed discharge summaries for clinical trials pre-screening: a case study; AMIA Annu
Symp Proc; 2008. p. 404-408.
30. Aronsky D, Haug PJ, Lagor C, Dean NC. Accuracy of administrative data for identifying patients
with pneumonia. Am J Med Qual 2005 Dec;20(6):319–328. [PubMed: 16280395]
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 12
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.
Example Morbidity Dissimilarity Indices (MDI) for four configurations. MDIs were drawn
for
equal to (1, 0), (3,0), (1, 0.75), and (3, 0.75) in panels a), b), c), and d),
and ρ is the correlation between β1 and β2.
respectively, where
Different values of correlations (ρ) effectively alter the angle between the axes, as shown in
panels c and d. The solid and dashed contours display the set of all (β1, β2), that yield MDI
equal to 0.5 and 1.0, respectively.
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 13
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 2.
Bland-Altman plots comparing unadjusted rates of ICD-9 categories, sections, and chapters
for pairs of populations.
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 14
NIH-PA Author Manuscript
Figure 3.
Adjusted odds ratios based comparing VU-T2D to VU-CON, NU-T2D to NU-CON, and
MAY-PAD to MAY-CON. The symbol “X” denotes an extremely high odds ratio whose
lower confidence limit exceeds 20.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Schildcrout et al.
Page 15
NIH-PA Author Manuscript
Figure 4.
Adjusted odds ratios based comparing GHC-CAT to MAR-CAT, NU-T2D to VU-T2D, and
MAY-PAD to VU-QRS. The symbol “X” denotes an extremely high odds ratio whose lower
confidence limit exceeds 20.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Table 1
GHC-CAT
MAR-CAT
MAY-CON
MAY-PAD
NU-CON
NU-T2D
VU-CON
VU-QRS
VU-T2D
2217
2614
1181
972
850
672
2236
1055
5273
African American
0.04
0
0
0
0.08
0.23
0.09
0.13
0.18
Asian
0.03
0
0
0
0
0
0.01
0.01
0.01
Other*
0.01
0
0
0
0.07
0.13
0.02
0.01
0.02
Unknown†
0.02
0
0.03
0.02
0
0
0.12
0.01
0.02
White
0.90
0.99
0.96
0.98
0.85
0.64
0.76
0.84
0.77
N
Ethnicity
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Female
0.62
0.58
0.43
0.36
0.65
0.52
0.64
0.7
0.53
Born before 1928
0.76
0.38
0.04
0.22
0.01
0.04
0.05
0.03
0.08
Age if born in or after 1928
Years of observation
70 (65, 73)
65 (53, 72)
60 (52, 69)
64 (50, 72)
41 (27, 59)
55 (40, 68)
46 (25, 64)
48 (28, 64)
53 (33, 68)
6.7 (4.6, 6.9)
6.7 (5.8, 6.9)
6.3 (4.5, 6.9)
6.3 (4, 6.9)
5.6 (3.6, 6.8)
6.3 (3.8, 6.9)
5.5 (3.4, 6.7)
6 (3.7, 6.8)
6.3 (3.8, 6.9)
Unique visit days
97 (45, 198)
101 (44, 198)
44 (13, 152)
44 (14, 143)
76 (25, 186)
86 (32, 219)
95 (32, 226)
113 (35, 238)
124 (43, 244)
Total ICD9s
221 (93, 483)
215 (90, 452)
159 (50, 496)
160 (51, 503)
196 (47, 579)
243 (62, 640)
193 (57, 520)
221 (67, 574)
249 (84, 604)
Unique ICD-9s
62 (37, 116)
62 (34, 114)
46 (21, 99)
48 (21, 117)
45 (15, 101)
52 (16, 101)
53 (19, 109)
60 (22, 118)
63 (26, 117)
Unique categories
36 (19, 63)
34 (18, 57)
21 (10, 38)
27 (11, 51)
11 (5, 25)
26 (10, 52)
7 (2, 16)
15 (5, 34)
21 (8, 47)
Unique sections
23 (13, 35)
20 (11, 31)
15 (8, 23)
17 (8, 29)
9 (4, 17)
18 (7, 30)
6 (2, 12)
11 (4, 22)
15 (6, 28)
Unique chapters
11 (7, 13)
10 (7, 13)
9 (6, 11)
9 (5, 12)
6 (3, 10)
9 (5, 12)
4 (2, 8)
7 (3, 11)
8 (4, 12)
Schildcrout et al.
Demographic characteristics of the nine eMERGE populations under study between January 1, 2001 to December 31, 2007
*
“Other” ethnicities include Hispanics, Pacific Islander, American Indians, and individuals reporting multiple ethnicities.
†
“Unknown” ethnicity indicates that no value for this field was recorded in the EMR.
Categorical variables are summarized with proportions and continuous variables are summarized with, 50th (10th, 90th) percentiles.
Page 16
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Table 2
Chapter
Congenital anomalies
VU-T2D vs VUCON
NU-T2D vs NUCON
MAY-PAD vs
MAY-CON
GHC-CAT vs
MAR-CAT
NU-T2D vs VANT2D
VAN-QRS vs
MAY-PAD
1.10
0.73
0.30
0.77
0.55
0.94
J Biomed Inform. Author manuscript; available in PMC 2011 December 1.
Digestive system
0.88
0.70
0.24
0.39
0.39
0.74
Diseases blood and blood-forming organs
1.70
1.30
0.70
0.41
0.18
0.57
Diseases of the circulatory system
1.71
1.35
1.44
0.34
0.49
1.70
Diseases of the genitourinary system
1.02
0.94
0.88
0.48
0.43
0.51
Diseases of the respiratory system
0.95
0.92
0.61
0.27
0.38
0.64
Diseases of the skin and subcutaneous tissue
0.62
0.68
0.56
0.29
0.52
0.69
Endocrine nutritional metabolic immunity
2.10
2.18
0.91
1.20
0.83
1.47
Infectious and parasitic diseases
1.32
0.65
0.59
0.48
0.51
0.64
Injury and poisoning
0.95
1.20
1.07
0.56
0.53
1.15
Mental disorders
0.84
0.81
0.45
0.47
0.82
0.63
Musculoskeletal system and connective tissue
0.50
0.34
0.16
0.36
0.20
0.26
Neoplasms
0.58
0.51
0.51
0.66
0.56
0.81
Nervous system and sense organs
0.73
0.81
0.72
0.28
0.61
0.82
Across all ICD-9 sections
0.80
0.82
0.66
0.47
0.44
0.75
Schildcrout et al.
Morbidity Dissimilarity Index for cohort pairs
Page 17