research-article

Open access

Federated Learning for Electronic Health Records

Authors:

Trung Kien Dang,

Xiang Lan,

Jianshu Weng,

Mengling FengAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 13, Issue 5

Article No.: 72, Pages 1 - 17

https://doi.org/10.1145/3514500

Published: 21 June 2022 Publication History

All formats PDF

Abstract

In data-driven medical research, multi-center studies have long been preferred over single-center ones due to a single institute sometimes not having enough data to obtain sufficient statistical power for certain hypothesis testings as well as predictive and subgroup studies. The wide adoption of electronic health records (EHRs) has made multi-institutional collaboration much more feasible. However, concerns over infrastructures, regulations, privacy, and data standardization present a challenge to data sharing across healthcare institutions. Federated Learning (FL), which allows multiple sites to collaboratively train a global model without directly sharing data, has become a promising paradigm to break the data isolation. In this study, we surveyed existing works on FL applications in EHRs and evaluated the performance of current state-of-the-art FL algorithms on two EHR machine learning tasks of significant clinical importance on a real world multi-center EHR dataset.

1 Introduction

The broad adoption of electronic health records (EHRs) presents opportunities for collaboration among hospitals. For medical research, multi-center studies have long been considered superior to single-center ones. The larger combined cohort allows for certain hypothesis testings and subgroup analyses that are often not possible in a single-center setting due to inadequate statistical power [46, 70]. In machine learning, multi-center datasets could lead to more robust models. A model trained only on single-center data is prone to poor generalizability, i.e., it may only perform well on data from the same hospital that provided the training data but do poorly when applied to data from others [25]. This is potentially due to differences among hospitals in medical practices, patient demographics, genotypes and phenotypes, as well as variations in software, hardware, and protocols used for data collection. Environmental, social, political, and cultural variations may also play a part. Multi-center datasets may enable models to capture and adapt to the heterogeneity caused by these factors and thus improve their generalizability [25, 63, 81]. In addition, simply by collecting data from several sources, studies end up with a larger dataset for training, which reduces the expected generalization error of the model [24].

Despite the benefits, in reality, conducting machine learning on EHRs across multiple sources faces several tough challenges. The traditional approach of centralized model training is to gather datasets from different silos and store them in a centralized data warehouse so that machine learning models could be trained on the combined dataset. In practice, such collaborative EHR repositories have been established by healthcare organizations who were willing to pool their data together to conduct their research on a larger scale [49, 69]. However, these efforts faced multiple challenges regarding logistics, infrastructures, regulations, privacy and data standardization [12, 73, 75]. Central data repositories increase the risk of data security and privacy compromises. Examples include data leakage due to an increase in the number of parties with access to the data as well as subject re-identification due to linkage across multiple data sources [21]. Moreover, EHRs are subject to a set of rigorous regulations regarding accessing, analyzing and sharing personal health information [3, 27, 79]. Additionally, each hospital also imposes its internal policies on the matter. Significant efforts are required to ensure compliance with these regulations and policies. Last but not least, the high cost to set up and maintain the infrastructure for centralized data storage presents yet another roadblock against collaborative machine learning based on data centralization.

Given these challenges associated with centralized learning, a method that enables collaborative model training to occur in a decentralized manner, without the need for aggregating all data in one place, would make machine learning on EHRs across multiple centers much more feasible. Federated Learning (FL) is an emerging paradigm that enables building machine learning models collaboratively using decentralized data. It was originally proposed by Google for the use case of Gboard query suggestion [43]. The project involved developing a language model over hundreds of thousands of mobile devices for keyboard autosuggestion, predicting the most likely words that a user would type next. Each participating device trained a separate local model using only their own local dataset. Local models were then sent to a central coordinator where they were aggregated into a global model to be sent back to each participant, either for inference or further training. The main purpose of FL is to enable participants to collaborate and produce a better model than they could on their own without compromising data privacy. It achieves this by requiring participants to share only model parameters, not data.

Given these characteristics, FL has the potential to help facilitate collaboration on data-driven research on EHRs across multiple institutions while preserving data privacy. However, that FL, or horizontal FL to be specific, requires data from participating parties to be of the same format might present a challenge. Recently there have been increasing efforts by healthcare institutions towards data harmonization. More and more organizations are adopting a common data model, such as i2b2 [55], PCORnet [22], and OMOP CDM [80], for their EHRs. These organizations are thus well-positioned to employ FL to facilitate collaborations with others who utilize the same data model [17].

In this study, we give a brief survey of existing applications of FL on EHR data. We then provide an overview of common FL algorithms and evaluate their performance on two EHR machine learning tasks, in-hospital mortality prediction and acute kidney injury (AKI) prediction in the intensive care unit (ICU). These two tasks have significant clinical importance and have been shown to greatly benefit from machine learning. In literature there exist various state-of-the-art FL algorithms that achieve good results on general domain or benchmark datasets. However, there has not been a study that evaluates how well they perform in the context of EHRs. Related to this work, Xu et al. [83] published a survey that summarized the progress and challenges of FL as well as gave an overview of current applications and potential opportunities of FL in healthcare. In addition, there exist many studies that successfully applied FL to outcome prediction using EHRs [32, 59, 66, 77]. To the best of our knowledge, ours is the first one that aims to systematically compare the performance of several state-of-the-art FL algorithms on an EHR machine learning task.

Among state-of-the-art FL methods, the most well-known algorithm is Federated Averaging, also known as FedAvg [50]. FedAvg builds a shared global model by periodically averaging the weights of the locally evolving models. Despite being the standard federated optimization method, it suffers from slow convergence or even divergence in some situations when data distribution differs among clients. Multiple variants of FedAvg have been shown to improve the algorithm convergence behavior [30, 31, 38, 45, 62]. We conducted a previous pilot study investigating the performance of FedAvg and FedProx [45], another popular FL method, on predicting in-hospital mortality [14]. In this study, we extended our earlier work by inspecting a more comprehensive set of FL algorithms and evaluating them on an additional task of AKI prediction.

2 Existing Applications of Federated Learning on Electronic Health Records

Wide adoptions of EHRs among healthcare institutions have given rise to various studies on applying machine learning to biomedical research [52]. However, conducting machine learning on EHRs is not without challenges. Among those, lack of data and poor model generalizability are two major problems. For research projects involving rare diseases and conditions, small hospitals may not have enough data for machine learning models to learn meaningful patterns. In addition, as mentioned above, machine learning models trained on data obtained from a single source may not generalize well and thus perform poorly when applied to a different context. FL is becoming a promising approach to mitigate these shortcomings. It allows training a global model on a larger and more diverse set of EHRs from multiple institutions while keeping the data locally, thereby preserving privacy and enhancing the model’s external validity. Several studies have looked into the effectiveness of solving healthcare problems in an FL setting using EHRs. They can be summarized into two categories, predictive modeling and representation learning.

Many studies have achieved success in applying FL to predictive modeling on EHRs. Sharma et al. [66] proposed an FL framework to predict in-hospital mortality for patients in the ICU. Results showed performance obtained by models trained in an FL setting to be on par with those trained in a centralized manner. Vaid et al. [77] achieved an improvement in predicting 7-day mortality for hospitalized COVID-19 patients by employing FL to utilize data from 5 different hospitals. To predict preterm birth in the context of distributed EHRs, Boughorbel et al. [8] presented a federated uncertainty-aware learning algorithm (FUALA) based on FedAvg. FUALA is capable of dynamically adjusting the aggregation model weights by taking into account each model’s uncertainty, thus reducing the adverse effects of models with high uncertainties. Brisimi et al. [10] proposed an iterative cluster Primal-Dual Splitting (cPDS) algorithm for solving the large-scale soft-margin l1-regularized sparse Support Vector Machine to predict hospitalizations due to cardiac events in FL settings. Huang el al. [33] proposed an FL algorithm called LoAdaBoost to predict the mortality of patients admitted to the ICU based on drugs prescribed during the first 48 h of their ICU stay. Pfohl et al. [59] comprehensively studied the efficacy of FL and differential privacy versus centralized training in predicting prolonged length of stay and in-hospital mortality across thirty-one hospitals. Grama et al. [28] evaluated the performance of different robust FL aggregation methods on two disease prediction tasks, diabetes mellitus onset prediction and heart failure prediction. Results showed that adaptive federated averaging (AFA) [54] not only performs well but also is robust against malicious or faulty clients. Tan et al. [72] proposed a tree-based FL method for treatment effect estimation. The method was used to study the effect of oxygen saturation on hospital mortality among ICU patients with respiratory diseases. Tuladhar et al. [76] presented an ensemble approach to distributed learning of machine learning models for rare disease detection. In this approach, inference is done by ensembling predictions provided by local models instead of aggregating their weights to produce a single global model. Xue et al. [85] introduced a federated reinforcement learning system that employs Double Deep Q-Network (DDQN) to provide supports for personalized clinical decisions. The system utilizes data from smart devices at the edge as well as electronic medical records (EMRs).

FL has also been applied to representation learning in the context of EHRs. Liu et al. [47] proposed a two-stage federated natural language processing (NLP) method for phenotyping and patient representation learning. The first stage constructs a representation of patient data using medical notes from multiple hospitals without sharing the notes. The learned presentation is not constrained to any specific medical task. The second stage builds a machine learning model for a specific phenotyping task based on relevant features extracted from representations learned in the first stage. Lee et al. [44] and Xu et al. [84] presented two federated patient hashing frameworks for patient similarity learning. The model learns context-specific hash codes to represent patients across multiple hospitals. The learned hash codes are then used to calculate similarities among patients. Ultimately, the model can match patients with high similarity among multiple hospitals. Lu et al. [48] proposed an efficient decentralized FL approach to extract latent features from patient data. Kim et al. [41] proposed a tensor factorization method that generates meaningful clinical concepts (phenotypes) from a large volume of EHRs. Vepakomma et al. [78] introduced three configurations of a distributed deep learning method called Split Learning [29], which differs from conventional FL in that it does not require participants to share the weights of the entire locally trained model. This leads to improved data privacy and security. Huang et al. [32] proposed a method called community-based federated learning (CBFL), which clusters the distributed patient data into clinically meaningful groups that share similar characteristics, such as drug features and diagnoses, while simultaneously training one model for each group. The method achieved good results on predicting mortality and length of stay.

3 Evaluation of Current Well-known FL Algorithms On EHRs

This section provides an overview of common FL algorithms that have been shown to work well outside of healthcare domain. Their performance was then evaluated on two machine learning tasks in the ICU, in-hospital mortality prediction and AKI prediction, using a dataset containing EHRs from multiple ICUs.

3.1 Overview of Common FL Algorithms

In general, FL involves each individual participants training local models on their local dataset alone and then exchanging model parameters, e.g., the weights and or gradients, at some frequency. There is no exchange of data among different participants. The local model parameters are then aggregated to generate a global model. Aggregation can be conducted with or without the coordination of a central party. Different FL algorithms vary in how the aggregation steps or the local update steps are performed. Among those, FedAvg [50], is the most well-known. FedAvg aims to optimize the following objective:

\begin{equation} \min _\mathrm{w}\Big (F(\mathrm{w}) = \sum _{k=1}^K p_kF_k(\mathrm{w})\Big), \end{equation}

(1)

where \(N\) is the number of participants and \(p_k\) is the weight of participant \(k\) and \(\sum _{i=k}^N p_k = 1\). \(p_k\) is usually proportional to the size of each participant dataset. \(F_k(\cdot)\) is the local objective function.

At each communication round \(t\), a global model with weights \(w_t\) is sent to all \(K\) participants. Each participant \(k\) performs local training for \(E\) epochs, producing a new local model with weights \(w^k_{t+1}\). Each participant then sends their newly learned local model weights to a central server where they are aggregated to obtain a new global model with updated weights \(w_{t+1}\) equal to the weighted average of all local models:

\begin{equation} w_{t+1} = \sum _{k=1}^K p_kw^k_{t+1}. \end{equation}

(2)

FedAvg performs well in the case of homogeneity, where all local datasets are identically and independently distributed (IID). In the presence of statistical heterogeneity where data are not identically and independently distributed (non-IID) across participants, the global model might perform poorly or not even converge. A number of different approaches have been proposed to counter this problem and improve the convergence rate and performance of FL for non-IID datasets.

FedProx [45] and SCAFFOLD [38] aim to improve the convergence rate in FedAvg by correcting client drift, a phenomenon where client heterogeneity causes a drift in the local updates in each round of local training, resulting in slow convergence. FedProx introduces a proximal term that restricts the local updates to be closer to the latest global update. Instead of optimizing \(F_k(\cdot)\), each participant now optimizes the local objective:

\begin{equation} h_k(\mathrm{w}, \mathrm{w}^t) = F_k(\mathrm{w})+\frac{\mu }{2}\left\Vert \mathrm{w}-\mathrm{w^t} \right\Vert ^2. \end{equation}

(3)

SCAFFOLD works by measuring the amount of drift caused by each client in each round and then adjusts their local update accordingly. How much a client drifts is measured by the difference in the direction of the global update versus the direction of the client local update.

Instead of controlling local training, several FL algorithms tackle the slow convergence problem by experimenting with server optimization. The global model update step specified in Equation (2) can be rewritten as

\begin{equation} w = w - \Delta w, \end{equation}

(4)

where \(\Delta w = \sum _{k=1}^K p_k\Delta w_k\) and \(\Delta w_k\) is the weight updates from client \(k\),

\begin{equation} \Delta w_k = w^{t+1}_k - w^t. \end{equation}

(5)

\(w = w - \Delta w\) has the same form as a gradient-based optimization step where \(\Delta w\) acts as a pseudo-gradient. Reddi et al. [62] formalized this as a server optimization step that optimizes the model from a global perspective, in addition to the client optimization step 5 that aims to optimize the model from a local perspective. Their proposed FL algorithms FedAdagrad, FedAdam, and FedYogi employ adaptive server optimization by applying adaptive optimization methods Adagrad, Adam, and Yogi in the server optimization step. FedAvgM [30, 31] is another algorithm that uses adaptive server optimization, by adding momentum to the server optimization step, computing \(w = w - v\) where \(v = \beta v + \Delta w\).

3.2 Experiments

We evaluated the performance of well-known FL algorithms, FedAvg, FedProx, FedAvgM, FedAdagrad, FedAdam, and FedYogi, on two common and clinically crucial machine learning tasks in the ICU, in-hospital mortality prediction and AKI prediction. Their results were compared against those obtained from local learning, centralized learning and two non-FL methods that also enable collaborative model training without data sharing, namely, IIL and cyclic institutional incremental learning (CIIL) [11, 67, 68]. In IIL, each party trains the model on their local dataset then passes the model to the next one until all parties have trained the model. CIIL repeats the same process over multiple rounds, but fixes the number of training epochs carried out by each party at each round. The data for both tasks come from the eICU dataset [60], which collected EHRs from more than 200 hospitals and over 139,000 patients across the United States admitted to the ICU in 2014 and 2015. The dataset contains a wide range of data, including demographics, medication, diagnoses, procedures, timestamped vital signs, and lab test results.

For each task, several hospitals in the eICU database were selected as participants. The extracted data were split into a train, validation, and test set for each of the hospitals, each taking up 80%, 10%, and 10% of the whole population, respectively. In the local training setting, a separate model was trained for each hospital using only their own local data. The training was done over a number of epochs, and for each hospital, the model that gave the best performance on the validation set in terms of Area under the ROC Curve (AUC-ROC) became the final model for evaluation.

In the centralized setting, the train, validation, and test sets from all participating hospitals were concatenated to produce a single train, validation, and test set. A single model was then trained on the combined training set. Like in the local setting, training was conducted for several epochs and the best model was picked based on the AUC-ROC score on the combined validation set.

In the IIL and CIIL settings, since there was no global aggregated validation set due to no data sharing among participants, the model produced by the last party that conducted the training was selected as the final model.

In the FL setting, training was done over several communication rounds. Similar to the IIL and CIIL settings, since the central server that coordinated the training and carried out the global model aggregation process did not have access to a global validation set, the final model was the one obtained after all the communication rounds had finished.

Performance among the methods was compared based on global test scores. The metrics used are AUC-ROC and Area under the Precision-Recall Curve (AUC-PR). Delong’s method [15] and logit method [9] were employed to compute 95% confidence intervals for AUC-ROC and AUC-PR, respectively.

3.2.1 In-hospital Mortality Prediction.

In this experiment, we investigated the performance of FL algorithms on predicting a patient’s in-hospital mortality based on data collected during the first 24 h of their ICU stay. This is a crucial task in clinical setting. When a patient is admitted to the ICU, predicting their mortality, either at the end of the ICU stay, hospital stay, or within a fixed period, e.g., 28 days, one month, or three months, provides a proxy for the severity of their condition and helps healthcare providers plan treatment pathways and allocate resources more effectively. There exist several works on successfully applying machine learning to predict in-hospital mortality [6, 61, 82].

Data. The same data extraction process in References [14, 37] was employed. For each hospital in the entire eICU dataset, we extracted a cohort of patients age 16 and above in their first ICU stay who had their in-hospital mortality status recorded. Patients without an APACHE IVa score were excluded. This criterion serves as a proxy for identifying patients with insufficient data or those who were only in the database for administration purpose. Twenty hospitals with the largest cohorts were then selected as participants in the study. The combined cohort contains 87,003 ICU stays.

For each patient, data within 24 h from ICU admission were extracted. The set of features includes

•

demographic information: gender, age, and ethnicity,

•

the first and last results of the following laboratory tests: PaO2, PaCO2, PaO2/FiO2 ratio, pH, base excess, Albumin, the significant band of arterial blood gas, HCO3, Bilirubin, Blood Urea Nitrogen (BUN), Calcium, Creatinine, Glucose, Hematocrit, Hemoglobin, international normalized ratio (INR), Lactate, Platelet, Potassium, Sodium, white blood cell count,

•

the first and last as well as the minimum and maximum measurements of the following vital signs: heart rate, systolic blood pressure, mean blood pressure, respiratory rate, temperature (Celcius), SpO2, Glasgow Coma Scale (GCS),

•

total urine output,

•

whether the hospital admission was for an elective surgery.

A total of 82 covariates were obtained.

Methods. A neural network consisting of two fully connected hidden layers with ReLU activation function and L2 normalization was used. The first hidden layer contains 100 nodes and the second 50. In the local and centralized settings, the model was trained for 90 epochs. In FL settings, the training took place over 30 communication rounds, with each hospital training the model locally for ten epochs each round.

3.2.2 AKI Prediction.

The purpose of this experiment was to evaluate the performance of FL algorithms on predicting the risk of a patient developing AKI within the next hour based on data collected during the previous 7 h. AKI is a sudden onset of renal damage or kidney failure that happens within a few hours or a few days and occurs in at least 5% of hospitalized patients [16]. AKI can affect other organs such as lungs, heart, and brain. It significantly increases hospitalization cost as well as mortality risk [13]. A timely detection of AKI could prevent patients from developing chronic kidney disease [39, 71]. There have been several studies that show strong performance of machine learning models in predicting AKI [26, 53, 58].

Data. We followed the same data extraction process in Reference [16]. The RIFLE criteria [7] were used to define AKI. Specifically, a patient at time \(t\) will be labeled as suffering from AKI if their urine output is less than 0.5 ml/kg/h for \(t\gt =6\). The cohort exclusion criteria include (1) patients who were under 16 years old or stayed in the ICU for less than 12 h and (2) patients whose data for the selected variables were not recorded at least once during their ICU stay. A total of 10,967 patients in 168 hospitals remained after the filtering. The top 75% hospitals with the most number of patients were selected to participate in the study. The final cohort contains 28 hospitals with a total of 6,641 patients.

For each patient, we extracted data in 7-h sliding windows. The full set of covariates includes

•

demographic information: age and gender,

•

the minimum and maximum values as well as the range (the difference between the maximum and minimum values) of the following vital signs: heart rate, respiratory rate, mean blood pressure,

•

the minimum and maximum values as well as the range of the following lab measurements: SpO2/SaO2, pH, Potassium, Calcium, Glucose, Sodium, HCO3, Hemoglobin, white blood cell count, Platelet count, Urea Nitrogen, Creatinine, GCS,

•

interventions: use of vasoactive medications, use of sedative medications, and use of mechanical ventilation.

•

total urine output.

A total of 22 covariates were obtained.

Methods. Similar to the previous task, a fully connected neural network consisting of two hidden layers with ReLU activation function and L2 normalization was used. However, here each of the two hidden layers contains 512 nodes instead of 100 and 50. In the local and centralized settings, the model was trained for 30 epochs. In FL settings, training took place over four communication rounds. Each hospital trained a local model for 10 local epochs during the first round and 5 local epochs during each subsequent round.

3.3 Results and Discussion

Global test performance in terms of AUC-ROC and AUC-PR obtained with each method is shown in Table 1 for in-hospital mortality prediction and Table 2 for AKI prediction. Comparison of ROC curves obtained with FL methods versus centralized and local training is visualized in Figures 1 and 2. Similarly, Figures 3 and 4 in Appendix A show comparison of ROC curves obtained with FL methods compared to those obtained with CIIL and IIL. In both tasks, all FL methods outperform local training in either metric with the exception of FedProx in predicting AKI. In particular, for mortality prediction, all FL methods perform significantly better than local training. In comparison with IIL and CIIL, for mortality prediction, all FL methods achieve better results. For AKI prediction, the same is true for most FL methods. Only exceptions are FedProx, which obtains worse AUC-ROC and AUC-PR than both IIL and CIIL, and FedAdam, whose AUC-PR is slightly lower than that of CIIL. Overall, for both tasks, FL methods enjoy improvement over IIL and CIIL. This is unsurprising given that IIL is known to suffer from catastrophic forgetting [23, 42, 68] while it is non-trivial to obtain optimal results with CIIL due to its instability [68]. Results obtained by FL are also comparable to centralized learning, with the best FL method in each task achieving AUC-ROC within 0.01 of the global AUC-ROC for centralized learning in terms of point estimates. FedAvg and FedAvgM perform consistently well and are among the top three FL methods with the highest global AUC-ROCs and AUC-PRs in either task, only behind FedYogi in AKI prediction. In both cases, FedAvgM obtained slightly better results than FedAvg. However, FedProx achieved the lowest scores in both mortality and AKI prediction.

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Table 1.

Method	AUC-ROC (95% CI)	AUC-PR (95% CI)
Local	0.833 (0.808–0.859)	0.472 (0.373–0.480)
Centralized	0.918 (0.907–0.929)	0.668 (0.633–0.701)
IIL	0.877 (0.857–0.897)	0.505 (0.451–0.559)
CIIL	0.828 (0.803–0.852)	0.426 (0.373–0.480)
FedAvg	0.901 (0.882–0.921)	0.638 (0.584–0.688)
FedProx	0.895 (0.877–0.914)	0.577 (0.523–0.630)
FedAvgM	0.906 (0.888–0.925)	0.645 (0.591–0.695)
FedAdam	0.890 (0.870–0.911)	0.578 (0.524–0.631)
FedAdagrad	0.893 (0.873–0.913)	0.596 (0.543–0.649)
FedYogi	0.895 (0.875–0.915)	0.594 (0.539–0.646)

Table 1. Global Test Performance on the In-hospital Mortality Prediction Task

Table 2.

Method	AUC-ROC (95% CI)	AUC-PR (95% CI)
Local	0.709 (0.697–0.722)	0.748 (0.734–0.761)
Centralized	0.735 (0.724–0.747)	0.783 (0.770–0.796)
IIL	0.664 (0.652–0.677)	0.723 (0.709–0.737)
CIIL	0.712 (0.70–0.724)	0.764 (0.750–0.777)
FedAvg	0.724 (0.712–0.736)	0.770 (0.757–0.783)
FedProx	0.691 (0.679–0.703)	0.740 (0.726–0.754)
FedAvgM	0.725 (0.713–0.736)	0.775 (0.762–0.788)
FedAdam	0.716 (0.704–0.728)	0.760 (0.746–0.773)
FedAdagrad	0.720 (0.708–0.732)	0.767 (0.753–0.780)
FedYogi	0.732 (0.720–0.743)	0.773 (0.760–0.786)

Table 2. Global Test Performance on the AKI Prediction Task

Results strongly favor FL as a viable strategy for facilitating collaboration among organizations in clinical research. Even though performance does not vary much among the different FL methods in our experiments, it is observed that simple FL algorithms, namely, FedAvg and FedAvgM, perform slightly better than FedProx, FedAdam and FedAdagrad. It has been shown that FedProx works well in the presence of heavy data heterogeneity [45]. In our dataset, all hospitals are located in the United States and therefore expected to experience consistencies in clinical practices and patient demographics. Furthermore, they all participated in the Philips eICU program, which guarantees a certain degree of data standardization. Thus, the differences in data distribution among them are not significant enough to benefit from FedProx. Plus, the total number of participants is relatively small compared to FL in an IoT setting with a large number of participating devices where FedProx usually shines [45]. Data homogeneity might also contribute to the lack of performance gain in FedAdam and FedAdagrad compared to FedAvg. In addition, both the tasks of predicting mortality and predicting AKI, similar to most machine learning tasks on tabular EHR data, only require the use of feed forward fully connected neural networks with a small number of layers, which might not see considerable performance gain through the use of Adam and Adagrad.

4 Conclusion

This study gave a brief survey on applications of FL on EHR data and then evaluated the performance of multiple common FL algorithms on two typical EHR machine learning tasks, in-hospital mortality prediction and AKI prediction. FL shows notable improvement compared to local training and performs close to centralized learning. This is promising for organizations that seek to collaborate with others in data-driven clinical research using EHRs. FL could help them build better machine learning models than individually using only their own local data while preserving data privacy without compromising on model performance.

Our results also suggest that simple FL algorithms FedAvg and FedAvgM work particularly well for machine learning tasks on tabular EHR compared to more complex methods such as FedProx, FedAdam and FedAdagrad. Data homogeneity due to the fact that all local datasets in our experiments come from hospitals located in the U.S. and thus share certain characteristics might contribute to this finding. This is one limitation in our study that we aim to overcome in future works. We plan to expand the pool of participants in our experiments to include datasets from ICUs in Europe [34, 74], Australia, and New Zealand [70] and investigate how FL performs on more heterogeneous EHR data as a future research direction.

In addition, we plan to validate our results on more recent data. We are currently working with public hospitals in Singapore to establish a federated data network whose participants adopt the OMOP Common Data Model [65, 80] for their EHRs. Once the network is set up, we would replicate our study on this more up to date dataset and also expand it to cover more tasks other than mortality and AKI prediction to improve the generalizability of our findings. Another direction would be to evaluate the performance of different FL methods on other data modals in healthcare such as time series, digital signals [2] and medical imaging [35, 36, 56], those that require more complex model architectures.

It is important to note that even though FL aims to preserve data privacy by sharing model parameters instead of data during training, it by itself does not mitigate all data privacy and security concerns. Studies showed that it is possible to make inference about the raw data by examining model parameter updates [5]. There exists methods that add extra security measures on top of FL to counter this [1, 4, 5, 40], namely, Differential Privacy (DF) [18, 19, 20], Homomorphic Encryption (HE) [64] and Secure Multi-Party Computation (SMC) [51, 86]. They enhance FL data security and privacy at the cost of communication efficiency and model performance. In particular, by adding noise to client training data, DF offers improvement in data privacy but also results in a decrease in model accuracy. HE ensures that only encrypted model parameters are exchanged. This provides data protection but also imposes a penalty on model performance [57]. SMC preserves knowledge of client inputs but is computationally intensive and requires extensive communication among parties. Further research is needed to understand the privacy-accuracy trade-offs of combining these methods with different FL algorithms in the context of EHR.

A Comparison of ROC Curves Obtained by FL Methods Versus IIL and CIIL

Figures 3 and 4 show comparison of ROC curves obtained with IIL, CIIL, and FL methods for in-hospital mortality prediction and AKI prediction, respectively.

References

[1]

Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2018. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surveys 51, 4 (2018), 1–35.

Abstract

1 Introduction

2 Existing Applications of Federated Learning on Electronic Health Records

3 Evaluation of Current Well-known FL Algorithms On EHRs

3.1 Overview of Common FL Algorithms

3.2 Experiments

3.2.1 In-hospital Mortality Prediction.

3.2.2 AKI Prediction.

3.3 Results and Discussion

4 Conclusion

A Comparison of ROC Curves Obtained by FL Methods Versus IIL and CIIL

References

Cited By

Index Terms

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Meaningful Use of Electronic Health Records for Physician Collaboration: A Patient Centered Health Care Perspective

Do Health Care Users Think Electronic Health Records are Important for Themselves and Their Providers? Exploring Group Differences in a National Survey

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations