Deep Semi-Supervised Embedded Clustering (DSEC)
for Stratification of Heart Failure Patients
Oliver Carr 1 Stojan Jovanovic 1 Luca Albergante 1 Fernando Andreotti 1 Robert Dürichen 1 Nadia Lipunova 1
Janie Baxter 1 Rabia Khan 1 Benjamin Irving 1
arXiv:2012.13233v3 [cs.LG] 17 Jan 2021
Abstract
Determining phenotypes of diseases can have considerable benefits for in-hospital patient care and
to drug development. The structure of high dimensional data sets such as electronic health records
are often represented through an embedding of the
data, with clustering methods used to group data
of similar structure. If subgroups are known to exist within data, supervised methods may be used
to influence the clusters discovered. We propose
to extend deep embedded clustering to a semisupervised deep embedded clustering algorithm
to stratify subgroups through known labels in the
data. In this work we apply deep semi-supervised
embedded clustering to determine data-driven patient subgroups of heart failure from the electronic
health records of 4,487 heart failure and control
patients. We find clinically relevant clusters from
an embedded space derived from heterogeneous
data. The proposed algorithm can potentially find
new undiagnosed subgroups of patients that have
different outcomes, and, therefore, lead to improved treatments.
1. Introduction
Patient populations, such as heart failure, are often heterogeneous in presentation and responses to therapy. Heart failure
is a complex disease typically classed into two categories
by clinicians, one with reduced ejection fraction (HFrEF) or
one with preserved ejection fraction (HFpEF) (Inamdar &
Inamdar, 2016). Subgroups of heart failure patients can also
be defined by additional measures that have been known to
reflect poor outcome (e.g. serum urea, serum creatinine) or
co-morbidities (e.g. diabetes) (Inamdar & Inamdar, 2016).
Better characterisation of subgroups may allow for adjusted
1
Sensyne Health plc, Oxford, United Kingdom. Correspondence to: Benjamin Irving <ben.irving@sensynehealth.com>.
Proceedings of the 37 th International Conference on Machine
Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by
the author(s).
treatments in the different patient cohorts.
Embeddings, or low dimension representations of data are
used to extract the most relevant features which describe
the structure of the data. Low dimensional representation
approaches are often based on natural language processing methods (Zhang et al., 2018; Zhu et al., 2016) or autoencoders to produce latent space representations (Wei &
Eickhoff, 2018). Both Denaxas et al. (2018) and Choi et
al. (2016) have investigated embeddings using heart failure
patient data.
Electronic health record (EHR) data is heterogeneous, varies
longitudinally, is sparse, and is not well suited to the application of traditional clustering analysis approaches without
careful pre-processing (Beaulieu-Jones et al., 2018; Donders
et al., 2006). Clustering approaches can be applied directly
to the raw features or to an embedded representation of the
input. Principal component analysis (PCA) (Wold et al.,
1987) is a common approach to finding a lower dimensional
embedding but unsupervised or semi-supervised deep architectures such as autoencoders allow more complex representations of the data to be learnt (Shickel et al., 2018;
Beaulieu-Jones & Greene, 2016). The use of continuous
measures from EHRs have been used for patient subgrouping (Liu et al., 2019; Choi et al., 2019). These measures
have been used in combination with medical codes, medication information, procedural codes, or text (Li et al., 2019;
Miotto et al., 2016; Lasko et al., 2013; Beaulieu-Jones &
Greene, 2016).
In this study we explore an extension of deep embedding
clustering approaches to EHR patient cohorts in both a unsupervised and semi-supervised fashion. This work builds
on previous approaches such as Deep Patient (Miotto et al.,
2016) and Beaulieu-Jones et al. (2016) which use standard
autoencoders. Previous work has also focused on semisupervised clustering, such as Enguehard et al. (2019) who
use convolutional neural networks to cluster partially labelled images and Ren et al. (2019) who use pairwise constraints to cluster data, whereas in this work we focus on
transfer learning of a network. The key contributions of this
paper are i) the application of DEC to patient records, ii)
the extension of DEC as a novel semi-supervised approach
Deep Semi-Supervised Embedded Clustering
that allows the handling of heterogeneous input measures
and, iii) we show that clinically relevant subgroups within
a heart failure cohort can be determined using data driven
approaches.
2. Methods
2.1. Deep Embedded Clustering (DEC)
DEC (Xie et al., 2016) is a powerful approach that combines
an autoencoder with a clustering loss to learn a representation of the data that aims to also produce separable clusters
to improve the analysis of the embedded space.
The DEC method transforms the data space X n using a
non-linear mapping fθ : X n → Z m , where n = nf eat is
the number of features used and Z m is the embedded space,
and is a lower dimensional space than X n (m < n). θ is a
set of learnable parameters and fθ is parametrized as a deep
neural network.
Initially, a multi-layer deep autoencoder is implemented as a
series of stacked de-noising autoencoders, with a bottleneck
layer acting as the embedded space. Rectified linear unit
(ReLU) activation functions (Nair & Hinton, 2010) are used
at each layer except the first and last layers. The multi-layer
deep autoencoder is then trained to optimize a reconstruction
loss, by minimizing the least squares error between the input
and output of the autoencoder. Once the autoencoder has
been pre-trained, the decoder is cut off leaving the encoder
to transform the input, X n to the embedded space, Z m .
Further, a clustering step is then added to the network after
the embedded space which uses k-means clustering in the
embedded space, Z m . For this purpose, the network is fine
tuned using the Kullback-Leibler (KL) divergence as loss
function for the clustering step. An iterative approach is
used to assign points in the embedded space, Z m to the
k cluster centroids, {µj }kj=1 , and to update the non-linear
mapping, fθ , of X n to Z m . This step is repeated until a
convergence criteria is met, full details of method are shown
in (Xie et al., 2016).
2.2. Deep Semi-Supervised Embedded Clustering
(DSEC)
While autoencoders are an effective approach for learning
a lower dimensional embedding, a key challenge of unsupervised approaches to heterogeneous data is that, while the
embedding may be representative of the inputs, the representation might not accurately reflect the features of interest
for patient or disease stratification. This can be partially
resolved by applying a supervised model to the embedding
(Beaulieu-Jones & Greene, 2016) or with transfer learning
of a pre-trained network (Han et al., 2019). We propose,
modifying the latent representation on known patient sub-
groups by transfer learning of the encoder and fine-tuning
of layers. This adapts the embedding to the problem of
interest.
We propose to train the DSEC model in three sequential
steps: training an autoencoder, updating the weights of the
encoder with a classification task, and updating all the layers
of the encoder with a clustering loss.
In our approach, a de-noising autoencoder is used to determine the embedded space, which partially corrupts the
input, X n , before reconstruction of the original data (Vincent et al., 2010). The input data X n , X ∈ Rk , is corrupted
to X̃ n through a stochastic mapping X̃ ∼ qD (X̃|X) by the
addition of a Gaussian noise layer to the input of the denoising autoencoder. The reconstruction error is measured
by the MAE loss.
The non-linear mapping, or encoder, is represented by two
fully connected dense layers, the first with 1000 nodes and
the second with 500 nodes, and the embedded space having
three nodes. The number of nodes were chosen by adapting
the DEC architecture (Xie et al., 2016). Each layer uses
ReLU activation functions. The Adam optimizer (learning
rate = 0.01, β1 = 0.9, β2 = 0.999) is used to optimise the
weights of the network (Kingma & Ba, 2015).
Once the de-noising autoencoder has been pre-trained, the
decoder is removed and a fully connected classification
layer with softmax activation is added to the encoder. The
de-noising corruption is removed from the architecture and
the weights of the first dense layer are fixed. Known labels
for each observation are used to update the weights of the
final dense layer based on the classification task. Transfer
learning updates the weights of the final dense layer and the
embedding layer to minimize the binary cross-entropy loss
function. This results in an updated mapping of the input X
to a new embedded space, Z ′ .
A clustering loss (Kullback-Leibler divergence) is then used
further update the entire encoder and latent space, updating
the non-linear mapping to fθ′′ : X → Z ′′ . The optimization
of the encoder and cluster centers is performed as in (Xie
et al., 2016). The cluster centers, {µj }, and the encoder parameters, θ, are jointly optimized using the Adam optimizer
(learning rate = 0.01, β1 = 0.9, β2 = 0.999).
The number of epochs used for training were determined
from the loss of the validation set in order to avoid overfitting to the training set. For DEC, the autoencoder was
trained for 50 epochs and the clustering step for 200 epochs.
For DSEC, the autoencoder was trained for 50 epochs, the
semi-supervised transfer learning for 10 epochs, and the
clustering for 200 epochs. The models were then trained on
the entire training set before being applied to the test set.
Deep Semi-Supervised Embedded Clustering
Figure 1. Hierarchical clustering of vital signs and laboratory measures shown as a PCA projection of the three-dimensional embedded
space from DSEC.
log odds ratio and corresponding p-value are found. The
p-values are corrected for multiple testing using Bonferroni
correction, with statistically significant odds ratios indicating the ICD-10 code is enriched in one of the clusters. For
hierarchical clustering, enrichment is performed pairwise
between the two clusters which are combined in each of the
agglomeration steps.
3. Data
Figure 2. Receiver operating characteristic curves for classification
of heart failure and control patients. Curves are shown for PCA
and random forest, DEC and random forest, and DSEC.
2.3. Analysis of the Embedded Space
After a low dimensional patient representation is created
using DSEC, we define the patient subgroups using standard
clustering approaches on the embedded space. Agglomerative clustering is a type of hierarchical clustering method in
which all observations initially start as individual clusters
before pairs of clusters are successively merged into a new
cluster (Rokach & Maimon, 2005), with the Ward criteria
used as the linkage function (Ward & Hook, 1963).
Once the subgroups are found we compared the dominant
ICD-10 diagnosis codes in each subgroup using enrichment
analysis. Enrichment analysis is performed using the Fisher
exact test (Fisher, 1922) for pairwise comparisons between
ICD-10 codes within clusters. For each ICD-10 code the
De-identified patient health record data was obtained from
a large UK general trust hospital. These patients underwent digital monitoring of bedside vital sign measurements
(Wong et al., 2017). From this dataset, patients with a primary diagnosis of heart failure (ICD-10 code I50*) were
selected. Each patient has a number of admissions that may
occur before, on or after a heart failure diagnosis. This
resulted in 2,791 patients with 27,143 admissions. We used
propensity matching (based on logistic regression and nearest neighbor matching (Ho et al., 2007)) on age and sex to
derive a control cohort of patients with any other admission
besides I50*. This resulted in a total of 5,498 patients with
39,908 admissions. Within each admission there may be
multiple measurements, in this analysis we take the mean
of each measurement within an admission. The admission
with the first heart failure diagnosis is selected and for the
control cohort, the admission with the fewest missing values
was selected.
3.1. Data Pre-processing
We use Bidirectional Recurrent Imputation for Time Series (BRITS), which combines an imputation loss with the
loss of a prediction or classification task (Cao et al., 2018),
Deep Semi-Supervised Embedded Clustering
which we found to be most effective on previous cohorts.
Each admission contains laboratory work and vital sign
measurements. Vital signs and laboratory measures were
selected if each measure was present in more than 60% of
cases. Features used in this work include systolic blood
pressure, diastolic blood pressure, heart rate, oxygen saturation (SpO2), temperature, alanine aminotransferase (ALT),
creatinine, c-reactive protein (CRP), platelets, potassium,
sodium, urea and white blood cells.
Cases were then excluded if less than 60% of these features
were present for a particular admission. This resulted in a
reduction in patients to 4,497 (2,298 heart failure and 2,199
control). A test set of 25% was removed from the dataset in
a stratified way and was held out of all training procedures.
The remaining 75% of the data was used in 5-fold cross
validation in order to optimize the network parameters and
ensure the model was not overfitting.
4. Results and Discussion
The accuracy of distinguishing heart failure and non-heart
failure cases using the three approaches was determined
through the area under the ROC curves as shown in Figure
2. ROC curves are obtained by training a random forest on
the PCA and DEC embedded spaces, whereas for DSEC
the ROC curve is obtained from the semi-supervised classification step. DSEC obtains an area under the ROC curve
of 0.84, considerably outperforming PCA (0.66) and DEC
(0.73).
Figure 1 shows a hierarchical clustering of the learnt DSEC
embedding, where we iteratively combine the groups into
larger subgroups in order to investigate whether different comorbidities exist within different spaces of the embedding.
Table 1 shows cluster enrichment for the hierarchical splits.
The first split in the hierarchical clustering is between heart
failure (group 2) and controls (group 1). Subgroups of heart
failure can be identified, including dilated cardiomyopathy,
renal failure, and aortocoronary bypass grafts in a heart failure subgroup (group 2.1). Group 2.1 can be further split in
to subgroups associated with left ventricular failure (2.1.1)
and with ascites and hyperkalaemia (2.1.2). Group 2.2 can
also be divided in to patients associated with secondary
pulmonary hypertension, atrial fibrillation and flutter, and
a history of anticoagulant use (2.2.1). This demonstrates
that the method is capable of determining clinically relevant (Maisel & Stevenson, 2003; Dickhout et al., 2011)
subgroups from vital signs and laboratory measures. Further analysis of the subgroups and a comparison between
PCA, DEC, and DSEC showing the superior performance
of DSEC is shown in the supplementary materials.
While this is a powerful extension of the standard autoencoder embedding, which has been previously applied, cur-
Table 1. Enriched ICD10 codes between hierarchical splits in the
clustering (log odds-ratio shown in brackets). Enrichment is performed between pairs of clusters, (for example 1 vs 2, 2.1 vs 2.2,
and 1.1.1 vs 1.1.2). Sub-hierarchies of cluster 1 (control) are not
shown.
G ROUP
1
2
2.1
2.2
2.1.1
2.1.2
2.1.1.1
2.1.1.2
2.2.1
E NRICHED ICD-10 C ODES
E78.0 P URE HYPERCHOLESTEROLAEMIA (0.96)
I50.0 C ONGESTIVE HEART FAILURE (1.69)
E87.7 F LUID OVERLOAD (1.39)
I50.9 H EART FAILURE (1.32)
N18.9 C HRONIC RENAL FAILURE (1.22)
I34.0 M ITRAL ( VALVE ) INSUFFICIENCY (1.17)
I42.0 D ILATED CARDIOMYOPATHY (1.56)
N17.9 ACUTE RENAL FAILURE (1.37)
N39.0 U RINARY TRACT INFECTION (1.34)
Z95.1 AORTOCORONARY BYPASS GRAFT (1.10)
N18.9 C HRONIC RENAL FAILURE (1.04)
I50.1 L EFT VENTRICULAR FAILURE (2.35)
R18 A SCITES (3.00)
E87.5 H YPERKALAEMIA (2.13)
I42.0 D ILATED CARDIOMYOPATHY (1.55)
I50.0 C ONGESTIVE HEART FAILURE (1.16)
N17.9 ACUTE RENAL FAILURE (1.15)
E87.5 H YPERKALAEMIA (3.26)
Z51.5 PALLIATIVE CARE (2.69)
I27.2 S ECOND . PULMONARY HYPERTENSION (2.91)
N18.9 C HRONIC RENAL FAILURE (1.45)
I48.9 ATRIAL FIBRILLATION AND FLUTTER (1.36)
E87.7 F LUID OVERLOAD (1.09)
Z92.1 U SE OF ANTICOAGULANTS (0.95)
2.1.2
rent limitations are that we only consider a single admission
per patient and the inability to deal with missing values,
which are a common problem in EHRs.
We have shown our method outperforms other clustering
algorithms to determine subgroups within heart failure patients. We aim to extend this approach to handle multiple
admissions and to develop imputation free methods of embedding to further improve phenotyping of heart failure and
other diseases from EHRs. This has the potential to allow
for adjusted treatments of the different patient subgroups.
5. Conclusions
In this paper we demonstrate the application of DSEC to
features derived from EHRs. We show our approaches can
distinguish heart failure and non-heart failure cases based on
laboratory measurements and vital signs. We illustrate that
optimizing the embedding on known subgroups allows us
to learn a more powerful representation and that subgroups
within the heart failure cohort show enrichment of certain
co-morbidities (ICD-10 codes).
Deep Semi-Supervised Embedded Clustering
Acknowledgments
This work uses data provided by patients and collected by
the NHS as part of their care and support. We believe using
patient data is vital to improve health and care for everyone
and would, thus, like to thank all those involved for their
contribution. The data were extracted, anonymised, and
supplied by the Trust in accordance with internal information governance review, NHS Trust information governance
approval, and General Data Protection Regulation (GDPR)
procedures outlined under the Strategic Research Agreement (SRA) and relative Data Sharing Agreements (DSAs)
signed by the Trust and Sensyne Health plc.
This research has been conducted using the Oxford University Hospitals NHS Foundation Trust Clinical Data Warehouse, which is supported by the NIHR Oxford Biomedical Research Centre and Oxford University Hospitals NHS
Foundation Trust. Special thanks to Kerrie Woods, Kinga
Varnai, Oliver Freeman, Hizni Salih, Zuzana Moysova, Professor Jim Davies and Steve Harris.
References
Dickhout, J. G., Carlisle, R. E., and Austin, R. C. Interrelationship between cardiac hypertrophy, heart failure, and
chronic kidney disease: Endoplasmic reticulum stress as
a mediator of pathogenesis. Circulation Research, 108
(5):629–642, 2011.
Donders, A. R. T., van der Heijden, G. J., Stijnen, T., and
Moons, K. G. Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology,
59(10):1087–1091, 2006.
Enguehard, J., O’Halloran, P., and Gholipour, A. SemiSupervised Learning With Deep Embedded Clustering
for Image Classification and Segmentation. IEEE Access,
7(1):11093–11104, 2019.
Fisher, R. A. On the Interpretation of χ 2 from Contingency
Tables, and the Calculation of P. Journal of the Royal
Statistical Society, 85(1):87, 1922.
Han, K., Vedaldi, A., and Zisserman, A. Learning to Discover Novel Visual Categories via Deep Transfer Clustering. 2019. URL http://arxiv.org/abs/1908.
09884.
Beaulieu-Jones, B. K. and Greene, C. S. Semi-Supervised
Learning of the Electronic Health Record for Phenotype
Stratification. J Biomed Inform, 64:168–178, 2016.
Ho, D. E., Imai, K., King, G., and Stuart, E. A. Matching as
nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis,
15(3):199–236, 2007.
Beaulieu-Jones, B. K., Lavage, D. R., Snyder, J. W., Moore,
J. H., Pendergrass, S. A., and Bauer, C. R. Characterizing and Managing Missing Structured Data in Electronic
Health Records: Data Analysis. JMIR Medical Informatics, 6(1):e11, 2018.
Inamdar, A. and Inamdar, A. Heart Failure: Diagnosis, Management and Utilization. Journal of Clinical Medicine, 5
(7):62, 2016.
Cao, W., Zhou, H., Wang, D., Li, Y., Li, J., and Li, L.
BRITS: Bidirectional recurrent imputation for time series.
Advances in Neural Information Processing Systems,
(NeurIPS):6775–6785, 2018.
Choi, E., Schuetz, A., Stewart, W. F., and Sun, J. Medical concept representation learning from electronic
health records and its application on heart failure prediction. 2016. URL http://arxiv.org/abs/1602.
03686.
Choi, E., Xu, Z., Li, Y., Dusenberry, M. W., Flores, G., Xue,
Y., and Dai, A. M. Graph Convolutional Transformer:
Learning the Graphical Structure of Electronic Health
Records. pp. 1–17, 2019. URL http://arxiv.org/
abs/1906.04716.
Denaxas, S., Stenetorp, P., Riedel, S., Pikoula, M., Dobson,
R., and Hemingway, H. Application of Clinical Concept
Embeddings for Heart Failure Prediction in UK EHR
data. 2018. URL http://arxiv.org/abs/1811.
11005.
Kingma, D. P. and Ba, J. L. Adam: A method for stochastic
optimization. 3rd International Conference on Learning
Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15, 2015.
Lasko, T. A., Denny, J. C., and Levy, M. A. Computational
phenotype discovery using unsupervised feature learning
over noisy, sparse, and irregular clinical data. PLoS One,
8(6):e66341, 2013.
Li, Y., Rao, S., Solares, J. R. A., Hassaine, A., Canoy, D.,
Zhu, Y., Rahimi, K., and Salimi-Khorshidi, G. BEHRT:
Transformer for Electronic Health Records. 2019. URL
http://arxiv.org/abs/1907.09538.
Liu, L., Li, H., Hu, Z., Shi, H., Wang, Z., Tang, J., and
Zhang, M. Learning Hierarchical Representations of
Electronic Health Records for Clinical Outcome Prediction. 2019. URL http://arxiv.org/abs/1903.
08652.
Maisel, W. H. and Stevenson, L. W. Atrial fibrillation in
heart failure: Epidemiology, pathophysiology, and rationale for therapy. American Journal of Cardiology, 91(6):
2–8, 2003.
Deep Semi-Supervised Embedded Clustering
Miotto, R., Li, L., Kidd, B. A., and Dudley, J. T. Deep
Patient: An Unsupervised Representation to Predict the
Future of Patients from the Electronic Health Records.
Scientific Reports, 6:26094, 2016.
Nair, V. and Hinton, G. Rectified Linear Units Improve Restricted Boltzmann Machines. International Conference
on Machine Learning, 2010.
Ren, Y., Hu, K., Dai, X., Pan, L., Hoi, S. C., and Xu, Z.
Semi-supervised deep embedded clustering. Neurocomputing, 325:121–130, 2019.
Rokach, L. and Maimon, O. Clustering Methods. In Maimon, O. and Rokach, L. (eds.), Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer US,
Boston, MA, 2005.
Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. Deep
EHR: A Survey of Recent Advances in Deep Learning
Techniques for Electronic Health Record (EHR) Analysis.
IEEE J Biomed Health Inform, 22(5):1589–1604, 2018.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. A. Stacked denoising autoencoders: Learning
Useful Representations in a Deep Network with a Local Denoising Criterion. Journal of Machine Learning
Research, 11:3371–3408, 2010.
Ward, J. H. and Hook, M. E. Application of an Hierarchical
Grouping Procedure to a Problem of Grouping Profiles.
Educational and Psychological Measurement, 23(1):69–
81, 1963.
Wei, X. and Eickhoff, C. Embedding Electronic Health
Records for Clinical Information Retrieval. 2018. URL
http://arxiv.org/abs/1811.05402.
Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory
systems, 2(1-3):37–52, 1987.
Wong, D., Wu, N., and Watkinson, P. Quantitative metrics
for evaluating the phased roll-out of clinical information
systems. International Journal of Medical Informatics,
105:130–135, 2017.
Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep
embedding for clustering analysis. 33rd International
Conference on Machine Learning, ICML 2016, 1:740–
749, 2016. URL http://arxiv.org/abs/1511.
06335.
Zhang, J., Kowsari, K., Harrison, J. H., Lobo, J. M.,
and Barnes, L. E. Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record. IEEE Access, 6:65333–65346,
2018.
Zhu, Z., Yin, C., Qian, B., Cheng, Y., Wei, J., and Wang,
F. Measuring patient similarities via a deep architecture
with medical concept embedding. In 2016 IEEE 16th
International Conference on Data Mining (ICDM), pp.
749–758. IEEE, 2016.