Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Gottlieb 2013

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A method for inferring medical diagnoses from

patient similarities
Gottlieb et al.

Gottlieb et al. BMC Medicine 2013, 11:194


http://www.biomedcentral.com/1741-7015/11/194
Gottlieb et al. BMC Medicine 2013, 11:194
http://www.biomedcentral.com/1741-7015/11/194

RESEARCH ARTICLE Open Access

A method for inferring medical diagnoses from


patient similarities
Assaf Gottlieb1*, Gideon Y Stein2,3, Eytan Ruppin2,4, Russ B Altman1 and Roded Sharan4*

Abstract
Background: Clinical decision support systems assist physicians in interpreting complex patient data. However,
they typically operate on a per-patient basis and do not exploit the extensive latent medical knowledge in
electronic health records (EHRs). The emergence of large EHR systems offers the opportunity to integrate
population information actively into these tools.
Methods: Here, we assess the ability of a large corpus of electronic records to predict individual discharge
diagnoses. We present a method that exploits similarities between patients along multiple dimensions to predict
the eventual discharge diagnoses.
Results: Using demographic, initial blood and electrocardiography measurements, as well as medical history of
hospitalized patients from two independent hospitals, we obtained high performance in cross-validation (area
under the curve >0.88) and correctly predicted at least one diagnosis among the top ten predictions for more than
84% of the patients tested. Importantly, our method provides accurate predictions (>0.86 precision in cross
validation) for major disease categories, including infectious and parasitic diseases, endocrine and metabolic
diseases and diseases of the circulatory systems. Our performance applies to both chronic and acute diagnoses.
Conclusions: Our results suggest that one can harness the wealth of population-based information embedded in
electronic health records for patient-specific predictive tasks.
Keywords: Patient similarity, Electronic health records, Diagnosis prediction

Background research [9]. This large corpus of population-based records


Over several decades, the vision of automatic systems is increasingly used in the context of clinical decision mak-
assisting and supporting clinical decisions produced a ing for the individual patient [10]. Nevertheless, there still
plethora of clinical decision support systems [1-4], includ- seems to be no consistent association between EHRs and
ing diagnostic decision support systems for inferring pa- clinical decision support systems (CDSS) and better quality
tient diagnosis. These methods typically focus on a single of care [11].
patient and apply manually or automatically constructed Recently, several methods have been released for
decision rules to produce a diagnosis [2,5,6]. At the same predicting certain patient outcomes using large cohorts of
time, health care is undergoing tremendous changes as patients. Two such examples are the detection of heart
medical information is digitized and archived in a struc- failure more than six months before the actual date of
tured fashion. Electronic health records (EHRs) promise to clinical diagnosis [12] and inference of patient prognosis
revolutionize the processes by which patients are adminis- based on patient similarities [13]. These methods, how-
tered, hospitalized and discharged [7], improve safety [8] ever, use the patient diagnosis for the learning task.
and allow the conduct of post-hospitalization outcome In this paper, we address a different, fundamental chal-
lenge – can we leverage the corpus of EHR patient data,
* Correspondence: assafgo@stanford.edu; roded@post.tau.ac.il even with well-documented quality issues [14], to infer
1
Departments of Bioengineering & Genetics, Stanford University, 318 Campus the discharge diagnosis of patients using minimal med-
Drive, Stanford 94305, USA ical data upon hospitalization. We introduce an auto-
4
Blavatnik School of Computer Science, Tel-Aviv University, Klausner St., Tel
Aviv 69978, Israel mated method that exploits patient records for inferring
Full list of author information is available at the end of the article

© 2013 Gottlieb et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Gottlieb et al. BMC Medicine 2013, 11:194 Page 2 of 9
http://www.biomedcentral.com/1741-7015/11/194

an individual patient discharge diagnosis. For this task, level categories including complications of pregnancy
we use basic patient-specific information gathered at (630 to 679) and codes in the range 740 to 999 for being
admission, including medical history, blood tests, elec- uninformative (for example, general symptoms), a known
trocardiography (ECG) results and demographics to condition (for example, congenital anomalies) or incidental
identify similar patients, subsequently predicting patient conditions (for example, injuries or poisoning). We retained
outcomes. We test our method on two diverse sets of supplementary classification codes V40 to V49 –‘persons
patients admitted to internal medicine departments in with a condition influencing their health status’ for being
large medical centers in the United States and Israel, indicative of procedures a patient underwent.
obtaining high precision and recall, suggesting that such As a sanity check, we extracted the ICD codes that
systems may eventually be useful in the setting of were enriched in patients with extreme blood test values
assisting physicians with medical decisions, hospital relative to other patients (hypergeometric test, false dis-
planning and short-term resource allocation. covery rate (FDR) = 0.01) and verified that these
corresponded to common knowledge associations, for
Methods example, various ICDs coding for cancer are enriched
Data description within patients with high lactic dehydrogenase values
We obtained two EHR datasets from two hospitals: (i) [15] or the troponin-t test is indicative of acute myocar-
9,974 patients with 15,498 admissions, admitted in sev- dial infarction [16] [See Additional file 1: Table S1 for
eral wards belonging to internal medicine (for example, the full association list].
cardiology, oncology) or neurology over the course of The patients were de-identified by using a randomly
two years from the Stanford Medical Center, CA, USA generated patient id. The study was approved by the In-
(USA dataset); and (ii) 5,513 patients with 7,070 admis- stitutional Review Board of Stanford and by the Helsinki
sions in internal medicine wards at the Rabin Medical Committee of the Rabin Medical Center.
Center, Israel between May 2010 and February 2012
(660 days; ISR dataset). Each dataset includes patient Similarity measure construction
demographics (gender and age), medical history (Inter- In order to infer patient diagnosis, we computed a set of
national Classification of Diseases, Clinical Modification ten patient similarities. We computed two ICD similarity
codes (ICD-9-CM) from past in- and out-patient en- measures (1–2) and eight similarity measures between
counters) and hospitalization specific information in- hospitalizations (3–10). All similarity measures were
cluding blood test results and discharge diagnoses, normalized to the range [0, 1]. We used the following
coded as ICD-9 codes. A subset of the patients in the ICD code similarities:
USA dataset includes ECG measurements, while the ISR
dataset (7,261 patients) also contains ICD codes assigned (1)ICD code similarity: We used the levels of the
upon admission. The USA dataset includes 86 com- ICD codes in the ICD coding hierarchy to measure
monly administered blood tests (after filtering, see the similarity between ICD codes ci and cj as
below) and the ISR dataset includes 19 blood tests. Both   NCA ðci ;cj Þ
S ci ; cj ¼ # levels , where NCA is the level of the
patient cohorts include only urgent (non-elective) ad-
nearest common ancestor and #levels are the
missions and a roughly equal number of females and
males. Both datasets cover the entire adult age spectrum number of levels in the ICD hierarchy (five levels)
(USA patients range between 15 and 90 years and ISR (see [17] for similar measures). When using third
patients between 20 and 110), but the ISR cohort is level codes, the number of levels equals three (the
skewed towards older patients (USA median age is 63 third, fourth and fifth levels).
and ISR is 73, where 82% of ISR patients are above 60 (2)Empirical co-occurrence frequency: We used the
while only 55% of the USA patients are). HCUP data to compute empirical co-occurrences
In addition, we obtained records of the Healthcare between ICD codes. Computing the number of
Cost and Utilization Project (HCUP) of the Nationwide co-occurrences of an ICD pair across all patients,
Inpatient Sample (NIS) of 2009 which contains more we first computed the Jaccard score [18] between
than 55 million associations between 5.8 million patients each pair. In order to transform the Jaccard
and 1,125 third level discharge ICD codes. The latter score to a similarity measure, we randomly
data were used to enhance the computation of ICD simi- shuffled the associations of ICD codes to patients,
larities, as described below. keeping the overall ICD distribution as well as the
The ICD codes in the EHR data included 469 (USA) per-patient ICD counts fixed. We then computed
and 396 (ISR) third level ICD codes (diagnostic and pro- the similarity as the percentage of times the
cedural codes). We excluded supplementary classifica- co-occurrence score was higher than the
tion codes (codes starting with E or V) and several first random shuffles.
Gottlieb et al. BMC Medicine 2013, 11:194 Page 3 of 9
http://www.biomedcentral.com/1741-7015/11/194

We used the following inter-patient similarity measures the blood tests, we used only the chronologically
(3–4) Medical history: Each patient may possess first measurement, performed upon admission for
medical history from three sources: (i) past each hospitalization, obtained during the first three
encounters with local health providers (digitally days of hospitalization. Each ECG measurement had
connected to the medical center); (ii) discharge undergone the same normalization and similarity
codes of past hospitalizations; and (iii) personal construction as the blood tests.
history ICD codes provided in the current (9) Age similarity: In order to give precedence to
hospitalization (ICD codes V01to V15, V40 to V49 age differences in younger age, we computed
and V87). The union of these three sources the similarity
  between two patients pi and pj
jp −p j
constitutes the patient medical history profile. To as S pi ; pj ¼ 1− maxi p j;p
compute the similarity of two such profiles, we form ð i jÞ
a bipartite graph over the member ICD codes, (10) Gender similarity: defined as 1 if the two patients
connecting two codes in the two profiles by an edge have the same gender and 0 otherwise.
whose weight is the similarity between the codes.
Our similarity score is the value of a maximal Combining similarity measures to classification features
matching in this graph normalized by the smaller The framework we used scores a hypothetical association
history set size. We performed the maximal according to its maximal similarity to a known, gold-
matching computation using either of the two standard, set of associations. In our case, we scored associ-
ICD similarity measures, resulting in two ations between hospitalization records and ICD codes
similarity measures. based on the highest similarity to the known discharge
(5–6) Blood test similarity: We used only the codes in the background corpus of previously hospitalized
chronologically first blood test of each type, patients (disregarding similarities to previous hospitaliza-
performed upon admission for each hospitalization, tions of the same patient). Specifically, the features used to
retaining only blood test results obtained during the classify hospitalization-primary discharge ICD code pairs
first three days of hospitalization. We filtered blood were constructed from scores computed for each combin-
tests that were performed in less than 5% of the ation of an ICD-similarity measure and a similarity meas-
hospitalizations and those for which the difference ure between patient hospitalizations (see previous section
in distribution between patients with the same for details), resulting in 16 features overall (12 without the
diagnosis and patients without shared diagnosis was ECG similarities). For each such pair of similarity mea-
not statistically significant (Wilcoxon ranked sum sures, the score of a potential discharge code I for a given
test, FDR <0.01). This left us with 86 blood tests for hospitalization H is computed by considering the similar-
the USA dataset and 19 blood tests for the ISR set. ity to known discharge codes associated with other hospi-
Each blood test was then normalized by converting talizations (excluding other hospitalizations of the same
it to a z-score, mean and standard deviation patient) (I’ and H’). The computation is done as follows:
measured across the initial blood tests of all patients. First, for each known associations (H’,I’) we compute the
Most of the patients had undergone only a partial inter-hospitalization similarity S(H,H’) and the ICD codes
set of the tests. We removed patients having fewer similarity S(I,I’). Next, we follow the method of [19] to
than three available blood tests and computed the combine the two similarities to a single score by comput-
similarity between a pair of hospitalizations based on ing their geometric mean. Thus:
the values of the blood tests common to the two
hospitalizations, where patients sharing fewer than qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
three blood tests between them received the ScoreðH; I Þ ¼ maxH 0 ;I 0 ≠H;I S ðH; H 0 Þ⋅ S ðI; I 0 Þ ð1Þ
minimal similarity score of zero. We formed two
types of similarities: (i) using the entire set of
common blood test array between any two Performance evaluation
hospitalizations, we computed the Euclidean We used the MATLAB implementation of the logistic re-
distance between the z-score vectors, normalized by gression classifier (glmfit function with binomial distribu-
their length; and (ii) the average of differences in tion and logit linkage) for the prediction task. We used a
absolute values between the blood tests with the 10-fold cross validation scheme to evaluate the precision
highest z-score for each patient. The distance Dij of our prediction algorithm. The training set used for the
between patients i and j was converted to a cross validation included 41,036 USA associations be-
similarity value by linear transformation. tween hospitalizations and discharge codes and 14,506
(7–8) ECG similarity: The ECG values included eight ISR associations. We considered two types of negative
interval values as well as the heart rate. Similarly to sets, the same size as the positive set in each training set:
Gottlieb et al. BMC Medicine 2013, 11:194 Page 4 of 9
http://www.biomedcentral.com/1741-7015/11/194

(i) randomly sampling for each patient a diagnosis from can be integrated with a background corpus of previ-
the 469 (USA) or 396 (ISR) third level ICD codes (exclud- ous patients to infer the patient’s primary discharge
ing true diagnoses for that patient), termed ‘pre-admis- ICD codes (including both diagnoses and procedure
sion’; and (ii) randomly sampling a set of potential release codes). The patient information we used for this task
codes for each hospitalization, termed ‘post-admission.’ includes medical history, the results of the first admin-
Specifically for the second negative set scenario, we istered blood and ECG tests and demographics
inspected the available admission diagnoses reported upon (Methods). To this end, we defined novel diagnosis-
hospitalization (lacking from the USA dataset) and in- and patient-similarity measures, allowing us to exploit
cluded the set of discharge diagnoses of all the patients the similarity-based inference framework of [19] for in-
who shared the same admission diagnosis (excluding the ferring associations between hospitalization records
true discharge diagnosis for that hospitalization). As an ex- and primary discharge ICD codes (see Methods and
ample, the potential negative set for a patient admitted Figure 1 for an overview).
with chest pain includes the discharge diagnoses of all In order to gain insights about the global properties of
other patients admitted with chest pain, excluding the true the medical history, blood test and ECG similarities, we
final diagnoses of that patient. Additionally, we removed first examined the networks formed by associating an in-
self-similarities of patients (that is, similarities between dividual patient with the closest matching patient in the
hospitalizations of the same patient) to avoid bias for pa- historical database. Interestingly, the networks formed
tients with recurrent admissions. To obtain robust area by these similarities show marked differences (consistent
under the curve (AUC) score estimates, we performed 10 across the two EHR datasets). While medical history
independent cross validation runs, selecting a different similarities tend to connect patients into big clusters,
negative set and a different random partition of the train- blood test and ECG similarities display highly discon-
ing set to 10 parts in each; we then averaged the resulting nected sub-networks [See Additional file 2: Figures S1A-C
AUC scores. Expectedly, taking a negative set of size five, and Additional file 3: Figure S2A-B]. The integration
ten or twenty times the size of the positive set had a negli- of similarity measures with markedly different proper-
gible effect on the resulting AUC score (AUC difference ties boosts classification performance (as displayed in
less than 0.002). Additional file 3: Figure S2).
In order to apply our method in a scenario that mimics
the admission of new patients, we split the hospitalizations Prediction of discharge ICD-9 codes
into training and validation subsets. For the ISR data, we We focused on inferring the primary discharge codes for
used the available admission date to select hospitalizations the hospitalization, as they encompass the most crucial
that spanned the first year of our data (July 2010 to June piece of information for the caring physician. Due to in-
2011) as our training set and validated on hospitalizations formation content and ICD code usage differences be-
occurring in the subsequent 211 days, totaling 999 hospi- tween the two datasets, we train and predict on each
talizations. For the USA data, we split the data into train dataset independently (see also Discussion for expan-
and test sets (two thirds and a third, respectively) using sion). Our EHR datasets included a set of ranked dis-
the available sequential ordering of their admission dates. charge codes assigned by hospital specialists based on
As with the cross-validation scheme, we masked similar- coded and unstructured clinical data in the patient rec-
ities between hospitalizations of the same patient. We ord. We selected a gold standard of ‘primary’ discharge
computed the precision of our predictions by counting the codes consisting of the two top-ranked discharge codes
number of patients for which the top predicted discharge per patient. In the case of the ISR dataset we added a
code was the same as one of its true diagnoses. Similarly, sparse set of release codes assigned by the physician (ac-
we also computed the performance when testing whether companying the free-text release notes) totaling 2.2 ±
the true discharge code of a patient appeared in the top 1.2 codes per patient on average. Overall, our set in-
two predictions, top three and up to the top ten predic- cluded 469 and 396 third level ICD diagnostic and pro-
tions per patient. cedural codes for the USA and ISR datasets, respectively
In order to identify ICD codes that are significantly cor- (Methods).
rectly predicted, we compared the number of correct pre- In order to validate our predictions, we first applied a
dictions for each ICD code against a background of 105 10-fold cross validation scheme. In selecting the negative
randomly shuffled patient-diagnosis associations sets. set, we considered two scenarios: (i) sampling of the en-
tire set of false ICD codes (termed ‘pre-admission’, see
Results Methods); and (ii) a more realistic case, available only in
The inference framework the ISR dataset, in which we sample only from the po-
Our objective was to test whether a minimal amount of tential discharge diagnoses that a physician might con-
patient information, available upon admission in EHRs, sider based on the patient admission diagnoses (termed
Gottlieb et al. BMC Medicine 2013, 11:194 Page 5 of 9
http://www.biomedcentral.com/1741-7015/11/194

Figure 1 A schematic view of the method. Similarities between ICD codes and between hospitalizations are computed (A). A new patient is
scored according to the most similar patients with a certain diagnosis (B). A classifier is applied to select the top scoring diagnoses for this
patient (C). ICD, International Classification of Diseases.

‘post-admission’). We summarize the cross-validation re- Additional file 3: Figure S2). It is noteworthy that the
sults for several scenarios in Table 1, showing that our medical history feature built using the empirical ICD
results are highly robust to differences in information similarity performed much better in the USA dataset
content and across datasets (AUC >0.88). However, the than the ISR dataset, possibly owing to the fact that the
‘post-admission’ scenario proved to be a more demand- empirical ICD similarities were built using an (independ-
ing task due to the need to differentiate between more ent) USA-based patient cohort. We further computed the
similar diagnoses, obtaining a lower AUC score (AUC = AUC scores per feature (blood tests, medical history or
0.77). More importantly, the highest ranking prediction ECG measurements) across different first level ICD cat-
for each hospitalization was correct in 93% (± 0.4%) and egories (Figure 2). Blood tests perform significantly better
92% (± 0.3%) for the USA and ISR datasets, respectively than medical history and ECG as classifiers in most of
(85% ± 0.4% in the post-admission scenario). the categories (Wilcoxon ranked sum test, corrected for
Analyzing the contribution of each feature, we observe multiple hypotheses with FDR <0.01), with a notable per-
that the features involving the hierarchy-based ICD simi- formance increase in diseases of the blood and of the
larity outperformed features built with empirical co- digestive system. Interestingly, we find that blood tests
occurrence ICD similarity. Analyzing the classification perform better in mental disorders than medical history.
power of each of the inter-patient similarity measures, Indeed, the majority of the patients discharged with men-
we found that none was sufficient for obtaining the overall tal disorders in our cohorts had no mention of mental dis-
AUC, with blood tests achieving slightly higher results order in their medical history (69% and 82% in the USA
than medical history or ECG as standalones (AUC <0.85, and ISR datasets, respectively). Medical history performed

Table 1 Performance in cross-validation experiments


Cross validation scenario AUC Best F1 measure AUC, non-chronic patients
USA, 10K patients with ECG data 0.9 ± 9E-4 0.83 ± 0.001 0.89 ± 0.0009
USA, 15K patients without ECG data 0.89 ± 7E-4 0.82 ± 9E-4 0.88 ± 0.001
ISR, pre-admission scenario 0.88 ± 0.001 0.81 ± 0.001 0.86 ± 0.002
ISR, post-admission scenario 0.77 ± 0.002 0.73 ± 0.002 0.76 ± 0.003
Merged datasets 0.87± 9E-4 0.81 ± 7E-4 0.86 ± 0.002
AUC, area under the curve; ECG, electrocardiography; ISR, Israel; USA, United States.
Gottlieb et al. BMC Medicine 2013, 11:194 Page 6 of 9
http://www.biomedcentral.com/1741-7015/11/194

Figure 2 AUC scores for ICD level 1 categories. AUC scores using only the blood test features (red circles), medical history (blue squares), ECG
measurements (black diamonds) and all features (dashed green line) are displayed for the USA (A) and ISR (B) datasets across ICD level 1
categories: Infectious And Parasitic Diseases (A), Neoplasms (B), Endocrine, Nutritional And Metabolic Diseases, And Immunity Disorders (C),
Diseases Of The Blood And Blood-Forming Organs (D), Mental Disorders (E), Diseases Of The Nervous System And Sense Organs (F), Diseases Of
The Circulatory System (G), Diseases Of The Respiratory System (H), Diseases Of The Digestive System (I), Diseases Of The Genitourinary System
(J), Diseases Of The Skin And Subcutaneous Tissue (K), Diseases Of The Musculoskeletal System And Connective Tissue (L), Supplementary
Classification Of Factors Influencing Health Status And Contact With Health Services (M) and Classification Of Procedures (N). AUC, area under the
curve; ECG, electrocardiography; ICD, International Classification of Diseases.

better for neoplasms in the USA dataset, while ECG had per patient, we measure our performance by computing
equivalent performance to blood tests for infectious and the percentage of patients with at least one correct pre-
parasitic diseases and diseases of the respiratory systems. diction (that is, precision). While the top predicted dis-
To ensure that our method is not limited to detecting charge code was correct for 18% (17%) of the patients,
only chronic patients, which we defined as ones for the top ten predictions contained a correct discharge
whom the discharge diagnosis appears also in their code for 67% (64%) of the patients. We note that the
medical history (including previous hospitalizations), task here is more challenging than the previous ‘cross-
we verified that we achieve a similar performance when validation’ one since the latter evaluates a specific set of
applying our method to a set of 9,990 USA or 5,838 ISR options for ICD codes (those in the test set) while here
hospitalizations which include only non-chronic cases we evaluate all possible codes as we have no prior infor-
(Table 1). Expectedly, blood tests perform significantly mation for a new patient. One reason for the lower pre-
better than medical history in this set for all first level cision lies in the fact that discharge diagnosis codes
ICD categories (FDR <0.01). include also ‘secondary’ discharge codes, ranked lower
than the top two discharge codes for a patient. Since the
Prospective validation distinction between primary (top discharge codes and
Next, we applied our method in a scenario that mimics physician release codes) and secondary (additional dis-
the admission of new patients. We split the hospitaliza- charge codes) is done manually and is subjective, we also
tions into training and validation subsets, based on ad- checked the prediction precision relative to the complete
mission date when available (Methods). In the following, set of discharge codes, including both primary codes and
we report first the USA dataset performance and the ISR secondary codes (the latter not appearing in the training
performance is provided in parentheses for clarity. As set) to find that our top prediction was correct for 32%
we focus on predicting at least one primary diagnosis of the patients (both datasets) with 84% (89%) of
Gottlieb et al. BMC Medicine 2013, 11:194 Page 7 of 9
http://www.biomedcentral.com/1741-7015/11/194

patients with at least one correct hit within the top such chronic condition mentioned in his medical history).
ten predictions (Figure 3A). For example, we predicted Other examples include prediction of heart failure for two
diabetes mellitus for a patient who indeed had that con- (ISR) patients (a man and a woman) diagnosed with acute
dition; however it was not marked as the primary diag- myocardial infarction, and the latter often supersedes the
nosis. For comparison, we tested the precision against former [20] and prediction of episodic mood disorders for
1,000 sets of randomly shuffled associations between a (USA) patient with depressive disorder.
diagnoses and patients (maintaining the distribution of Finally, we analyzed the prediction performance over
the ICDs and the number of diagnoses per patient), veri- the different diagnoses. Expectedly, we found a high
fying that none of the shuffled associations obtained correlation (Pearson correlation, rho = 0.9, P <2e-164
comparable precision (P <0.001). (0.85, P <e-131)) between the number of patients in the
As a physician can likely also benefit from a more coarse training set with a certain ICD code and the success
classification, we checked the precision in predicting the rate in predicting it among the top ten predictions [See
second and first level of the ICD (Figures 3B and 3C, Additional file 4: Figure S3]. We identified 33 (17) ICD
respectively). The top prediction was accurate for 47% codes that were significantly correctly predicted in
(41%) of the patients when considering second level ICD each EHR dataset (FDR <0.05, Methods and Additional
codes and 70% (66%) when considering the first level file 5: Table S1). Six ICD diagnosis codes were common to
codes (including also non-primary codes). Similarly, 93% both datasets: diabetes mellitus, pneumonia, bronchitis,
(95%) of the patients had the correct second level ICD diseases of white blood cells, kidney failure and disorders
code in their top ten predictions (and 99% (98%) the cor- of urethra and urinary tract, while an additional nine
rect first level code). Manually examining the hospitaliza- (six USA and three ISR codes) were under the same
tions for which we failed to predict the correct second second level ICD (metabolic disorders, diseases of the
level of the ICD (spanning 7% (5%) of the patients), we blood, hypertensive disease and chronic bronchitis).
found that several of our predictions, while not an exact Additionally, some enriched ICDs belonged to similar
match, had a known association to the correct diagnoses. categories, such as heart related conditions (for example,
For example, a patient with acute bronchitis was predicted cardiomyopathy, cardiac dysrhythmias and heart failure in
to have chronic bronchitis (noting that this patient had no the USA dataset versus chronic ischemic heart disease in

Figure 3 Prediction precision for recent hospitalizations. The prediction precision for primary discharge codes (black) and all discharge codes
(blue) for the USA data (circles) and the ISR data (crosses) as a function of the number of top ranked predictions per patient. Precision is
measured for ICD level 1 (A), level 2 (B) and level 3 (C). ICD, International Classification of Diseases; ISR, Israel; USA, United States.
Gottlieb et al. BMC Medicine 2013, 11:194 Page 8 of 9
http://www.biomedcentral.com/1741-7015/11/194

the ISR dataset). In contrast, ICD codes that had no genitourinary systems (Figure 3). In contrast, lower pre-
successful prediction even when allowing for first level cision is obtained for high level ICD categories which
of the ICD match generally suffered from low representa- generally have a low representation in our data and are
tion in the training data [see Additional file 3: Figure S3] typically complex (for example, neoplasms). A larger and
and were typically accompanied by diagnoses with higher richer EHR data could enhance our prediction precision
success rates. One such example is gastrointestinal hemor- in these cases also. Specifically, a very large corpus of pa-
rhage, appearing in nine patients in our validation set (ISR tients might introduce more of the currently rare cases
dataset). This diagnosis was accompanied by other diagno- and having a larger temporal range within the corpus
ses in all these cases and, indeed, for seven of these patients would allow for richer representations of the medical
we managed to predict all their additional diagnoses. history. This assumption is strengthened by the fact that
Figures 2 and Additional file 6: Figure S4 display the the USA dataset is obtained from a tertiary care facility
AUC scores and prediction precision across different and, thus, harbors more ‘hard’ cases. Yet this dataset
first level ICD categories for the cross and prospective obtained better performance due to a larger corpus of
validations, respectively. patients and more information on each patient than the
ISR dataset which is from a primary and secondary care
Discussion facility. One reason may be that only a small subset of
We used patient cohorts from two different hospitals. the blood tests was available for each patient in the ISR
However, we trained and provided predictions for each dataset, limiting the computation of similarity between
dataset independently. This was done for three reasons: patients and the ability to account for rarer test types. A
(i) combining the two datasets ignores information fuller set of tests allows the computation of more accur-
available in only one dataset (for example ECG data or ate patient similarities.
blood tests that appear in only one set); (ii) the ICD
codes, primarily used for billing purposes, are often
Conclusions
biased due to the health system used in each country;
Our results demonstrate that a large corpus of patient
and (iii) different sources of medical history (that is,
data can be exploited to predict the likely discharge
outpatient versus inpatient facilities) display lower
diagnoses for a new patient. We introduced a general
agreement between patients from different health sys-
method for performing such an inference using informa-
tems. Indeed, we observed that merging the two
tion from past hospitalizations. Our method computes
datasets degraded the performance to that of the worse
patient similarity measures and requires a minimal set of
performing dataset (ISR, see Table 1).
such measures, including medical history, blood tests
In order to assess the potential benefits to a clinician,
performed upon admission and demographics. It is read-
we looked at predictions that could be considered sur-
ily extensible to use the results of other admission infor-
prising with regard to the admission diagnoses (available
mation, such as ECG tests, as shown for the USA
in the ISR dataset). We found multiple examples in
dataset and potentially, in the future, medical images
which the admission diagnosis contained only general
and patient genomic information (for example, gene ex-
symptoms and our method correctly predicted the true
pression measurements or single nucleotide polymorph-
discharge diagnosis. We describe here two such exam-
ism data).
ples: (i) a female patient who was admitted with an un-
Our method is a stepping stone for the full exploit-
specified anemia (ICD code 285.9) was correctly predicted
ation of large population-based data sets. We recognize
for cardiac dysrhythmias (427). Irregular heartbeat is one
that the introduction of new decision support modalities
of the many symptoms of anemia but not a predictive one
requires careful analysis of physician and health-care
[21]; and (ii) a female patient was admitted with fever
system workflows and introduction of the information at
(780.6) and was correctly predicted for acute myocardial
the most pertinent decision points. However, it is clear
infarction (410). Notably, fever is not a common symptom
that the emerging infrastructure of electronic patient in-
for acute myocardial infarction [22].
formation will provide not only better information about
Finally, analyzing our performance, we note that while
quality of care and guidance for policy but will be able
our method provided high quality predictions in cross
to improve the care of the individual, benefitting from
validation, it is likely to display lower performance in
the aggregated information of previous patients.
predicting conditions that evolve substantially over time
and conditions that are rare in the population. We ob-
serve that high level ICD categories that achieve relative Additional files
high precision are typically abundant in our data (above
6% (USA) and 4% (ISR) of the patients), including dis- Additional file 1: Table S2. ICD codes enriched in extreme valued
blood tests.
eases related to endocrine, circulatory, respiratory and
Gottlieb et al. BMC Medicine 2013, 11:194 Page 9 of 9
http://www.biomedcentral.com/1741-7015/11/194

Additional file 2: Figure S1. Networks of patient similarities. The support systems on practitioner performance and patient outcomes.
similarity between patients based on medical history (A), blood test (B) JAMA 2005, 293:1223–1238.
and ECG (C) data. 3. Kawamoto K, Houlihan CA, Balas EA, Lobach DF: Improving clinical practice
using clinical decision support systems: a systematic review of trials to
Additional file 3: Figure S2. The performance of individual features in identify features critical to success. BMJ 2005, 330:765.
cross validation. Displayed are individual feature AUC scores for the USA 4. Wright A, Sittig DF: A four-phase model of the evolution of clinical
data (Red) and ISR data (blue). The abbreviated feature combinations decision support architectures. Int J Med Inform 2008, 77:641–649.
include: ICD hierarchy-based similarity (I1), ICD empirical similarity (I2), 5. Hunt DL, Haynes RB, Hanna SE, Smith K: Effects of computer-based clinical
Age (A), Gender (G), blood tests- average difference (BT1), blood tests- decision support systems on physician performance and patient
difference between extremes (BT2), ECG tests- average difference (ECG1), outcomes. JAMA 1998, 280:1339–1346.
ECG tests-difference between extremes (ECG2), medical history (MH1) 6. Spiegelhalter DJ, Knill-Jones RP: Statistical and knowledge-based
and medical history – empirical ICD similarity based (MH2). approaches to clinical decision-support systems, with an application in
Additional file 4: Figure S3. The precision in predicting ICD codes as a gastroenterology. J R Stat Soc Ser A (General) 1984, 147:35–77.
function of the number of patients in the training set for the USA (A) and 7. Wang SJ, Middleton B, Prosser LA, Bardon CG, Spurr CD, Carchidi PJ, Kittler
ISR (B) datasets. AF, Goldszer RC, Fairchild DG, Sussman AJ: A cost-benefit analysis of
Additional file 5: Table S1. Easy to predict ICD codes. All p-values are electronic medical records in primary care. Am J Med 2003, 114:397–403.
FDR corrected. 8. Kaushal R, Bates DW: Information technology and medication safety:
what is the benefit? Qual Saf Health Care 2002, 11:261–265.
Additional file 6: Figure S4. Prediction precision for ICD level 1 9. Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D, Nordyke RJ: Review: use of
categories. Precision values (blue) and relative prevalence (red) are electronic medical records for health outcomes research: a literature
displayed for the USA (A) and ISR (B) datasets across ICD level 1 review. Med Care Res Rev 2009, 66:611–638.
categories: Infectious And Parasitic Diseases (A), Neoplasms (B), Endocrine, 10. Marcos M, Maldonado JA, Martinez-Salvador B, Bosca D, Robles M:
Nutritional And Metabolic Diseases, And Immunity Disorders (C), Diseases Interoperability of clinical decision-support systems and electronic
Of The Blood And Blood-Forming Organs (D), Mental Disorders (E), health records using archetypes: a case study in clinical trial eligibility.
Diseases Of The Nervous System And Sense Organs (F), Diseases Of The J Biomed Inform 2013, 46:676–689.
Circulatory System (G), Diseases Of The Respiratory System (H), Diseases 11. Romano MJ, Stafford RS: Electronic health records and clinical decision
Of The Digestive System (I), Diseases Of The Genitourinary System (J), support systems: impact on national ambulatory care quality. Arch Intern
Diseases Of The Skin And Subcutaneous Tissue (K), Diseases Of The Med 2011, 171:897–903.
Musculoskeletal System And Connective Tissue (L), Supplementary 12. Wu J, Roy J, Stewart WF: Prediction modeling using EHR data: challenges,
Classification Of Factors Influencing Health Status And Contact With strategies, and a comparison of machine learning approaches. Med Care
Health Services (M) and Classification Of Procedures (N). 2010, 48:S106–113.
13. Wang F, Hu J, Sun J: Medical prognosis based on patient similarity and expert
feedback, Pattern Recognition (ICPR), 2012 21st International Conference on.
Abbreviations IEEE; 2012:1799–1802.
AUC: Area under the curve; ECG: Electrocardiography; EHR: Electronic health 14. Iezzoni LI: Assessing quality using administrative data. Ann Intern Med
records; FDR: False discovery rate; HCUP: Healthcare Cost and Utilization
1997, 127:666–674.
Project; ICD: International Classification of Diseases.
15. Schneider RJ, Seibert K, Passe S, Little C, Gee T, Lee Iii BJ, Mike V, Young CW:
Prognostic significance of serum lactate dehydrogenase in malignant
Competing interests lymphoma. Cancer 1980, 46:139–143.
The authors declare that they have no competing interests. 16. Katus HA, Remppis A, Neumann FJ, Scheffold T, Diederich KW, Vinar G, Noe A,
Matern G, Kuebler W: Diagnostic efficiency of troponin T measurements in
Authors’ contributions acute myocardial infarction. Circulation 1991, 83:902–912.
AG and RS conceived the paper; AG performed the analysis and wrote the 17. Popescu M, Khalilia M: Improving disease prediction using ICD-9 ontological
draft; GS obtained the data, and aided in pre-processing; GS, ER, RA and RS features, IEEE; 2011:1805–1809.
participated in the writing of the paper. All authors read and approved the 18. Jaccard P: Nouvelles recherches sur la distribution florale. Bul Soc
final manuscript. Vaudoise Sci Nat 1908, 44:223–270.
19. Gottlieb A, Stein GY, Ruppin E, Sharan R: PREDICT: a method for inferring
Acknowledgments novel drug indications with application to personalized medicine. Mol
AG was funded by the NIH grants LM05652 and GM102365. RS was Syst Biol 2011, 7:496.
supported by a research grant from the Israel Science Foundation (grant no. 20. Dargie H: Heart failure post-myocardial infarction: a review of the issues.
241/11). Heart 2005, 91(Suppl 2):ii3–ii6.
21. Smith DL: Anemia in the elderly. Iron Disorders Institute Guide to Anemia
Author details 2009, 9:96–103.
1
Departments of Bioengineering & Genetics, Stanford University, 318 Campus 22. Kacprzak M, Kidawa M, Zielinska M: Fever in myocardial infarction: is it still
Drive, Stanford 94305, USA. 2Sackler School of Medicine, Tel Aviv University, common, is it still predictive? Cardiol J 2012, 19:369–373.
Klausner St., Tel Aviv 69978, Israel. 3Department of Internal Medicine "B",
Beilinson Hospital, Rabin Medical Center, 39 Jabotinski St., Petah-Tikva 49100, doi:10.1186/1741-7015-11-194
Israel. 4Blavatnik School of Computer Science, Tel-Aviv University, Klausner St., Cite this article as: Gottlieb et al.: A method for inferring medical
Tel Aviv 69978, Israel. diagnoses from patient similarities. BMC Medicine 2013 11:194.

Received: 12 April 2013 Accepted: 24 July 2013


Published: 2 September 2013

References
1. Warner HR, Haug P, Bouhaddou O, Lincoln M, Warner H Jr, Sorenson D,
Williamson JW, Fan C: ILIAD as an expert consultant to teach differential
diagnosis. In Proceedings of the Annual Symposium on Computer Application
in Medical Care. 4720 Montgomery Lane, Suite 500 Bethesda, Maryland
20814: American Medical Informatics Association; 1988:371–376.
2. Garg AX, Adhikari NKJ, McDonald H, Rosas-Arellano MP, Devereaux PJ,
Beyene J, Sam J, Haynes RB: Effects of computerized clinical decision

You might also like