Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Paper 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Received November 15, 2021, accepted December 6, 2021, date of publication December 7, 2021,

date of current version December 22, 2021.


Digital Object Identifier 10.1109/ACCESS.2021.3133700

A Machine Learning Analysis of Health Records


of Patients With Chronic Kidney Disease
at Risk of Cardiovascular Disease
DAVIDE CHICCO 1, CHRISTOPHER A. LOVEJOY 2,3 , AND LUCA ONETO 4,5
1 Institute
of Health Policy Management and Evaluation, University of Toronto, Toronto ON M5T 3M7, Canada
2 Computer Science Department, University College London, London WC1E 6BT, U.K.
3 Department of Medicine, University College London Hospital, London NW1 2BU, U.K.
4 DIBRIS, Università di Genova, 16146 Genoa, Italy
5 ZenaByte S.r.l., 16121 Genoa, Italy

Corresponding author: Davide Chicco (davidechicco@davidechicco.it)


This work involved human subjects or animals in its research. Approval of all ethical and experimental procedures and protocols was
obtained by the original dataset curators [28] and granted to them by Tawam Hospital and the United Arab Emirates University Research
and Ethics Board under Application No. IRR536/17.

ABSTRACT Chronic kidney disease (CKD) describes a long-term decline in kidney function and has many
causes. It affects hundreds of millions of people worldwide every year. It can have a strong negative impact on
patients, especially when combined with cardiovascular disease (CVD): patients with both conditions have
lower survival chances. In this context, computational intelligence applied to electronic health records can
provide insights to physicians that can help them make better decisions about prognoses or therapies. In this
study we applied machine learning to medical records of patients with CKD and CVD. First, we predicted
if patients develop severe CKD, both including and excluding information about the year it occurred or date
of the last visit. Our methods achieved top mean Matthews correlation coefficient (MCC) of +0.499 in the
former case and a mean MCC of +0.469 in the latter case. Then, we performed a feature ranking analysis
to understand which clinical factors are most important: age, eGFR, and creatinine when the temporal
component is absent; hypertension, smoking, and diabetes when the year is present. We then compared our
results with the current scientific literature, and discussed the different results obtained when the time feature
is excluded or included. Our results show that our computational intelligence approach can provide insights
about diagnosis and relative important of different clinical variables that otherwise would be impossible to
observe.

INDEX TERMS Machine learning, computational intelligence, feature ranking, electronic health records,
chronic kidney disease, CKD, cardiovascular diseases, CVD.

I. INTRODUCTION In this context, computational intelligence methods applied


Chronic kidney disease (CKD) kills around 1.2 million to electronic medical records of patients can provide inter-
people and affects more than 700 million people worldwide esting and useful information to doctors and physicians,
every year [1]. CKD is commonly caused by diabetes and helping them to more precisely predict the trend of the
high blood pressure, and are more likely to be developed in condition and consequently to make decisions on the
subjects with a family history of CKD. therapies. Several studies involving analyses done with
Individuals with chronic kidney disease are at higher risk of machine learning applied to clinical records of patients with
cardiovascular disease (such as myocardial infarction, stroke, CKD have appeared in the biomedical literature in the recent
heart failure) [2], and patients with both diseases are more past [4]–[26].
likely to have worse prognoses [3]. Among the studies found, a large number involves
applications of machine learning methods to the Chronic
The associate editor coordinating the review of this manuscript and Kidney Disease dataset of the University of California Irvine
approving it for publication was Xianzhi Wang . Machine Learning Repository [27].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
165132 VOLUME 9, 2021
D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

On this dataset, Shawan et al. [16] and Abrar et al. [18]


employed several data mining methods for patient classi-
fication in their PhD theses. Wibawa et al. [8] applied a
correlation-based feature selection methods and AdaBoost
to this dataset, while Al Imran et al. [13] employed deep
learning techniques to the same end.
Rashed-al-Mahfuz et al. [24] also employed a number
of machine learning methods for patient classification and
described the dataset precisely. Ali et al. [21] applied several
machine learning methods to the same dataset to determine
a global threshold to discriminate between useful clinical
factors and irrelevant ones.
Salekin and Stankovic [6] used Lasso for feature selection,
while Belina et al. [15] applied a hybrid wrapper and filter
based feature selection for the same scope.
Tazin et al. [5] employed several data mining methods
for patient classification. Ogunleye and Wang [11] used
an enhanced XGBoost method for patient classification.
Satukumati and Satla [17] used several techniques for feature
extraction. Elhoseny et al. [19] developed a method called
Density based Feature Selection (DFS) with Ant Colony
based Optimization (D-ACO) algorithm for the classification
of patients with CKD. Polat et al. [7] showed an application
of a Support Vector Machine variant for patient classification FIGURE 1. Flowchart of the computational pipeline of this study. Cylinder
to the same dataset. Chittora et al. [22] applied numerous shape: dataset. Rectangular shape: process. Parallelogram shape:
machine learning classifiers and their variants for patient input/output.

classification. Zeynu and Patil [12] published a survey on


computational intelligence methods for binary classification
and feature selection applied on the same dataset. Charleon- In this manuscript, we analyzed a dataset of 491
nan et al. [4] applied numerous machine learning classifiers patients from United Arab Emirates, released by
and their variants for patient classification. Subasi et al. [9] Al-Shamsi et al. [28] in 2018 (section II). In their original
focused on Random Forests for patient classification and study, the authors employed multivariable Cox’s proportional
feature ranking. Zeynu and Patil [10] applied numerous hazards to identify the independent risk factors causing
machine learning classifiers for patient classification and CKD at stages 3-5. Although this analysis was interesting,
clinical feature selection. All these studies were focused it did not involve a data mining step, which instead could
more on the improvement and enhancement of computational retrieve additional information or unseen patterns in these
intelligence methods, rather than on clinical implications of data.
the results. To fill this gap, we perform here two analyses: first,
Few studies published recently employed datasets different we apply machine learning methods to binary classify
from the UC Irvine ML Repository one. Ventrella et al. [23] the serious CKD development, and then to rank the
applied several machine learning methods to an original clinical features by importance. Additionally to what
dataset of EHRs collected at the hospital of Vimercate (Italy) Al-Shamsi et al. [28] did, we also performed the same
for assessing Chronic Kidney Disease progression. This study analysis excluding the year when the disease happened to
indicated creatinine level, urea, red blood cells count, eGFR each patient (Figure 1).
trend among the most relevant clinical factors for CKD As major results, we show that computational intelligence
advancement, highlighting that eGFR did not resulted being is capable of predicting a serious CKD development with or
the top most important one. without the time information, and that the most important
Ravizza et al. [20] employed machine learning methods clinical features change if the temporal component is
on a dataset of patients with diabetes from the IBM considered or not.
Explores database to predict if they will develop CKD. We organize the rest of the paper as follows. After this
This study states that the usage of diabetes-related data Introduction, we describe the dataset we analyzed (section II)
can generate better predictions on data of patients with and the methods we employed (section III). We then report the
CKD. binary classification and feature ranking results (section IV)
To the best of our knowledge, no study published before and discuss them afterwards (section V). Finally, we recap the
involves the usage of machine learning methods to investigate main points of this study and mention limitations and future
a dataset of patients with both CKD and CVD. developments (section VI).

VOLUME 9, 2021 165133


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

TABLE 1. Meaning, measurement unit, and possible values of each feature of the dataset. ACEI: Angiotensin-converting enzyme inhibitors. ARB:
Angiotensin II receptor blockers. mmHg: millimetre of mercury. kg: kilogram. mmol: millimoles.

II. DATASET TABLE 2. Binary features quantitative characteristics. All the binary
features have meaning true for the value 1 and false for the value 0,
In this study, we examine a dataset of electronic medical except sex (0 = female and 1 = male). The dataset contains medical
records of 491 patients collected at the Tawam Hospital in records of 491 patients.
Al-Ain city (Abu Dhabi, United Arab Emirates), between
1st January and 31st December 2008 [28]. The patients
included 241 women and 250 men, with an average age of
53.2 years (Table 2 and Table 3).
Each patient has a chart of 13 clinical variables, expressing
her/his values of laboratory tests and exams or data about
her/his medical history (Table 1). Each patient included
in this study had cardiovascular disease or was at risk of
cardiovascular disease, according to the standards of Tawam
Hospital [28].
Several features regard the personal history of the
patient: diabetes history, dyslipidemia history, hypertension
history, obesity history, smoking history, and vascular
disease history (Table 2) state if the patient biography
had those specific diseases or conditions. Dyslipidemia
indicates excessive presence of lipids in the blood. Two
variables refer to the blood pressure (diastolic blood pressure
and systolic blood pressure), and other variables refer
to blood levels obtained through laboratory tests (choles-
terol, creatinine). Few features state if the patients have
taken specific-disease medicines (dyslipidemia medica-
tions, diabetes medications, and hypertension medications)
or inhibitors (angiotensin-converting-enzyme inhibitors,
or angiotensin II receptor blockers) which are known to
be effective against cardiovascular diseases [29] and hyper-
tension [30]. The remaining factors describe the physical
conditions of each patient: age, body–mass index, biological Outcomes (KDIGO) organization [31], CKD’s can be
sex (Table 2). grouped into 5 stages:
Among the clinical features available for this dataset, • Stage 1: normal kidney function, no CKD;
the EventCKD35 binary variable states if the patient had • Stage 2: mildly decreased function of kidney, mild CKD;
chronic kidney disease at high stage (3rd , 4th , or 5th • Stage 3: moderate decrease of kidney function, moderate
stage). According to the Kidney Disease Improving Global CKD;

165134 VOLUME 9, 2021


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

TABLE 3. Numeric feature quantitative characteristics. σ : standard belong to a possibly infinite sorted set). In case of categorical
deviation.
features, one-hot encoding [34] can map them in a series of
numerical features. The consequent resulting feature space is
X ⊆ Rd .
A set of data Dn = {(x1 , y1 ), . . . , (xn , yn )}, with xi ∈ X
and yi ∈ Y, is available in a binary classification framework.
Moreover, some values of xi might be missing [35]. In this
case, if the missing value is categorical, we introduce an
additional category for missing values for the specific feature.
Instead, if the missing value is associated with a numerical
feature, we replace the missing value with the mean value of
the specific feature, and we introduce an additional logical
• Stage 4: severe decrease of kidney function, severe feature to indicate if the value of the feature is missing for a
CKD; particular sample [35].
• Stage 5: extreme CKD and kidney failure.
Our goal is to identify a model M : X → Y, which best
When the EventCKD35 variable has value 0, the patient’s approximates R, through an algorithm AH characterized by
kidney condition is at stage 1 or 2. Instead, when its set of hyper-parameters H. The accuracy of the model M
EventCKD35 equals to 1, the patient’s kidney is at stage 3, to represent the unknown relation R is measured using dif-
4, or 5 (Table 1). ferent indices of performance (Supplementary information).
Even if the value of eGFR has a role to the definition of Since the hyper-parameters H influence the ability of
the CKD stages in the KDIGO guidelines [31], we found AH to estimate R, we need to adopt a proper Model
weak correlation between the eGFRBaseline variable and Selection (MS) procedure [36]. In this work, we exploited
the target variable EventCKD35 in this dataset. The two the Complete Cross Validation (CCV) procedure [36]. CCV
variables have Pearson correlation coefficient equal to −0.36 relies on a simple idea: we resample the original dataset
and Kendall distance of −0.3, both in the [−1, +1] interval Dn many (nr = 500) times without replacement to build
where −1 indicates perfectly opposite correlation, 0 indicates a training set of size l Lrl while the remaining samples are
no correlation, and +1 indicates perfect correlation, kept in the validation set Vvr , with r ∈ {1, · · · , nr }. In order
The time year derived factor indicates in which year the to perform the MS phase, to select the best combination of
patient had a serious chronic kidney disease, or the year when the hyper-parameters H in the set of possible ones H =
he/she had his/her last outpatient visit, whichever occurred {H1 , H2 , · · · } using the algorithm AH , the hyper-parameters
first (Supplementary information),in the follow-up period. which minimize the average performance of the model,
All the dataset features refer to the first visits had by the trained on the training set, and evaluated on the validation
patients in January 2008, except the EventCKD35 and the set, should be selected. Since the data in Lrl are independent
time year variables that refer to the end of the follow-up from the ones in Vvr , the idea is that H∗ should be the set of
period, in June 2017. hyper-parameters which allows to achieve a small error on a
More information about this dataset can be found in the data set that is independent from the training set.
original article [28]. Finally, we need to estimate the error (EE) of the optimal
model with a separate set of data Tm = {(xt1 , yt1 ), · · · ,
III. METHODS (xtm , ytm )} since the error that our model commits over Dn
The problem described earlier (section I) can be addressed would be optimistically biased since Dn has been used to
as conventional binary classification framework, where the find M.
goal is to predict EventCKD35, using the data described Additionally, another aspect to consider in this analy-
earlier (section II). This target feature indicates if the patient sis is that data available in health informatics are often
has the chronic kidney disease in the stage 3 to 5, which unbalanced [37]–[39], and most learning algorithms do not
represents an advanced stage. work well with imbalanced datasets and tend to poorly
In binary classification, the problem is to identify the perform on the minority class. For these reasons, several
unknown relation R between the input space X (in our case: techniques have been developed in order to address this
the features described in Section II) and an output space issue [40]. Currently the most practical and effective method
Y ⊆ {0, 1} (in our case: the EventCKD35 target) [32]. Once involves the resampling of the data in order to synthesize
a relation is established, one can find a way to discover a balanced dataset [40]. For this purpose, we can under-
what the most influencing factors are in the input space for sample or over-sample the dataset. Under-sampling balances
predicting the associated element in the output space, namely the dataset by reducing the size of the abundant class.
to determine the feature importance [33]. By keeping all samples in the rare class and randomly
Note that, X can be composed by categorical features selecting an equal number of samples in the abundant class,
(the values of the features belong to a finite unsorted set) a new balanced dataset can be retrieved for further modeling.
and numerical–valued features (the values of the features Note that this method wastes a lot of information (many

VOLUME 9, 2021 165135


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

samples might be discarded). For this reason, scientists take in {4, 8, 16, 24, 32} (rpart2 in the caret [56] R pack-
advantage of the over-sampling strategy more often. Over- age). For XGBoost we set tree gradient boosting and we
sample tries to balance the dataset by increasing the size searched the Booster Parameters in {0.001, 0.002, 0.004,
of rare samples. Rather than removing abundant samples, 0.008, 0.01, 0.02, 0.04, 0.08} the number of trees in
new rare samples are generated (for example by repetition, {100, 500, 1000}, the minimum loss reduction to make a
by bootstrapping, or by synthetic minority). The latter method split in {0, 0.001, 0.005, 0.01}, the fraction of samples in
is the one that we employed in this study: synthetic minority {1, 0.9, 0.7} and features {1, 0.5, 0.2, 0.1} used train the trees
oversampling [41], [42]. and the maxim number of leaves in {1, 2, 4, 8, 16}, and
Another important property of M is its interpretability, the regularization hyper-parameters in {10−6.0,−5.8,··· ,4 } [50].
namely the possibility to understand how it behaves. There For One Rule we did not have to tune hyper-parameters
are two options to investigate this property. The first one is to (OneR in the caret [56] R package).
learn a M such that its functional form is, by construction, Note that these methods have shown to be a set of the
interpretable [43] (for example, Decision Trees and Rule simplest yet best performing methods available in scientific
based models); this solution, however, usually results in literature [57], [58]. The difference between the methods is
poor generalization performances. The second one, used just the functional form of the model which tries to better
when the functional form of M is not interpretable by approximate a learning principle.
construction [43] (for example, Kernel Methods or Neural For example, Random Forests and XGBoost try to
Network), is to derive its interpretability a posteriori. A implement the wisdom of the crowd principles, Support
classical method for reaching this goal is to perform a feature Vector Machines are robust maximum margin classifiers,
ranking procedure [33], [44] which gives an hint to the users and Decision Tree and One Rule represent very easy to
of M about the most important features which influence its interpret models. In this paper we tested multiple algorithms
results. since the no-free-lunch theorem [59] assures us that, for a
specific application, it is not possible to know, a-priori, what
A. BINARY CLASSIFICATION ALGORITHMS algorithm will better perform on a specific task. Then we
In this paper, for the A , we will exploit different state-of-the- tested the ones which, in the past, have shown to perform well
art models. In particular we will exploit Random Forests [45], on many tasks and identified the best one for our application.
Support Vector Machines (linear and kernelized with the
Gaussian Kernel) [46], [47], Neural Network [48], Decision
Tree [49], XGBoost [50], and One Rule [51]. B. FEATURE RANKING
We tried a number of different hyper-parameter configu- Feature rankings methods based on Random Forests are
rations for the machine learning methods employed in this among the most effective techniques [60], [61], particularly
study. in the context of bioinformatics [62], [63] and health
For Random Forests, we set the number of trees to informatics [64]. Since Random Forests obtained the top
1000 and we searched number of variables randomly prediction scores for binary classification, we focus on this
sampled as candidates at each split in {1, 2, 4, 8, 16}, method for feature ranking.
the minimum size of samples in the terminal nodes of Several measures are available for feature importance in
the trees in {1, 2, 4, 8}, the percentage samples (sam- Random Forests. A powerful approach is the one based on
pled with bootstrap) during the creation of each tree the Permutation Importance or Mean Decrease in Accuracy
in {60, 80, 100, 120} [52]–[55]. For the linear and ker- (MDA), where the importance is assessed for each feature by
nelized Support Vector Machines [46], we searched the removing the association between that feature and the target.
regularization hyper-parameters in {10−6.0,−5.8,··· ,4 } and, This effect is achieved by randomly permuting [65] the values
for the kernelized Support Vector Machines, we used the of the feature and measuring the resulting increase in error.
Gaussian Kernel [47] and we searched the kernel hyper- The influence of the correlated features is also removed.
parameters in {10−6.0,−5.8,··· ,4 }. For the Neural Network In details, for every tree, the method computes two
we used a single hidden layer network (hyperbolic tan- quantities: the first one is the error on the out-of-bag samples
gent as activation function in the hidden layer) with as they are used during prediction, while the second one is the
dropout (mlpKerasDropout in the caret [56] R package), error on the out-of-bag samples after a random permutation of
we train it with adaptive subgradient methods (batch the values of a variable. These two values are then subtracted
size equal to 32), and we tuned the following hyper- and the average of the result over all the trees in the ensemble
parameters: the number of neurons in the hidden layer in is the raw importance score for the variable under exam.
{10, 20, 40, 80, 160, 320, 640, 1280}, the dropout rate of the Despite the effectiveness of MDA, when the number
hidden layer in {0.001, 0.002, 0.004, 0.008}, the learning of samples is small these methods might result being
rate in {0.001, 0.002, 0.005, 0.01, 0.02.0.05}, the fraction unstable [66]–[68]. For this reason, in this work, instead
of gradient to keep at each step in {0.01, 0.05, 0.1, 0.5}, of running the Feature Ranking (FR) procedure just once,
and the learning rate decay in {0.01, 0.05, 0.1, 0.5}. For analogously to what we have done for MS and EE, we sub-
Decision Tree we searched the max depth of the trees sample the original dataset and we repeat the procedure many

165136 VOLUME 9, 2021


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

TABLE 4. CKD development binary classification results. Linear SVM: Support Vector Machine with linear kernel. Gaussian SVM: Support Vector Machine
with Gaussian kernel. MCC: Matthews correlation coefficient (worst value = −1 and best value = +1). TP rate: true positive rate, sensitivity, recall. TN rate:
true negative rate, specificity. PR: precision-recall curve. PPV: positive predictive value, precision. NPV: negative predictive value. ROC: receiver operating
characteristic curve. AUC: area under the curve. F1 score, accuracy, TP rate, TN rate, PPV, NPV, PR AUC, ROC AUC: worst value = 0 and best value = +1.
Confusion matrix threshold for TP rate, TN rate, PPV, and NPV: 0.5. We highlighted in blue and with an asterisk * the top results for each score. We report
the formulas of these rates in the Supplementary Information.

times. The final rank of a feature will be the aggregation of features and the derived year feature, both for supervised
the different ranking using the Borda’s method [69]. binary classification and feature ranking. We measured the
prediction with the typical confusion matrix rates (MCC, F1
C. BIOSTATISTICS UNIVARIATE TESTS score, and others), and the importance for each variable as
Before employing machine learning algorithms, we applied the logistic regression model coefficient. This method has
traditional univariate biostatistics techniques to evaluate no significant hyper-parameters so we did not perform any
the relationship between the EventCKD35 target and each optimization (glm method of the stats R package).
feature.
We made use of the Mann–Whitney U test (also known IV. RESULTS
as Wilcoxon rank–sum test) [70] for the numerical features In this section, we report the results for the prediction of
and of the chi–square test [71] for the binary features. The the chronic kidney disease (subsection IV-A) and its feature
p-values of both these tests range between 0 and 1: a low ranking (subsection IV-B).
p-value of this test means that the analyzed variable strongly
relates to the target feature, while a high p-value means the no A. CHRONIC KIDNEY DISEASE PREDICTION RESULTS
evident relation. These tests are also useful to detect the 1) CKD PREDICTION
importance of each feature with respect to the target: the We report the results obtained for the static prediction of the
lower the p-value of a feature, the stronger its association with CKD measured with traditional confusion matrix indicators
the target. Following the recent advice of Benjamin et al. [72], in Table 4. We rank our results by the Matthews correlation
we use 0.005 as threshold of significance for the p-values, that coefficient (MCC) because it is the only confusion matrix
is 5×10−3 . If the p-value of a test applied to a variable and the rate that generates a high score if the classifier was able to
target results being lower than 0.005, we consider significant correctly predict most of the data instances and correctly
the association between the variable and the target. make most of the predictions, both on the positive class and
the negative class [75]–[78].
D. PREDICTION AND FEATURE RANKING INCLUDING Random Forests outperformed all the other methods for
TEMPORAL FEATURE MCC, F1 score, accuracy, sensitivity, negative predictive
In the second analysis we performed for chronic kidney value, precision recall AUC, and receiver operating charac-
disease prediction, we decided to include the temporal teristic AUC (Table 4), while the support vector machine with
component expressing in which year the disease occurred for Gaussian kernel achieved the top specificity and precision.
the CKD patients or which year they had their last outpatient Because of the imbalance of the dataset (section II), all
visit (Supplementary information). the classifiers attained better results among the negative
We applied a Stratified Logistic Regression [73], [74] data instances (specificity and NPV) than among the posi-
to this complete dataset, including all the original clinical tive elements (sensitivity and precision). This consequence

VOLUME 9, 2021 165137


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

FIGURE 3. Calibration plot for the Stratified Logistic Regression


FIGURE 2. Calibration curve and plots for the results obtained by Random predictions applied on the dataset including the temporal
Forests predictions applied on the dataset excluding the temporal component (Table 5).
component (Table 4).

happens because each classifier can observe and learn to the disease in the previous analysis. We then decided to
recognize more individuals without CKD during training, performed a stratified prediction including a time feature
and therefore are more capable of recognizing them than indicating the year when the patient developed the chronic
recognizing patients with CKD during testing. kidney disease, or the last visit for non-CKD patients (Sup-
XGBoost and One Rule obtained Matthews correlation plementary information). After having included the year
coefficients close to 0, meaning that their performance was information in the dataset, we applied a Stratified Logistic
similar to random guessing. Random Forests, linear SVM, Regression [74], [80], as described earlier (section III).
and Decision Tree were the only methods able to correctly The presence of the temporal feature actually improved
classify most of the true positives (TP rate = 0.792, 0.6, and the prediction, allowing the regression to obtain a MCC
0.588, respectively). No technique was capable of correctly of +0.469, better than all the MCC’s achieved by the
making most of the positive predictions: all PPVs are below classifiers applied to the static dataset version except Random
0.5 Table 4. Forests (Table 5). Also in this case, sensitivity and precision
Regarding positives, SVM with Gaussian kernel obtained result being much higher than sensitivity and NPV, because
an almost perfect specificity (0.940), while Random Forests of the imbalance of the dataset.
achieved an almost perfect NPV of 0.968 Table 4. This result comes with no surprise: it makes complete sense
These results show that the machine learning classifiers that the inclusion of a temporal feature describing the trend
Random Forests and SVM with Gaussian kernel can effi- of a disease could improve the prediction quality.
ciently predict patients with CKD and patients without CKD To better understand the prediction obtained by the
from their electronic health records, with high prediction Stratified Logistic Regression, we plotted a calibration
scores, in few minutes. curve [79] of its predictions (Figure 3). As one can notice,
Since Random Forests resulted being the best performing the Stratified Logistic Regression returns well calibrated
classifier, we also included the calibration curve plot [79] of predictions, as it trends follows the x = y line which
its predictions (Figure 2), for the sake of completeness. The represents the perfect calibration from approximately 5%
curve follows the trend of the x = y perfect line translated to approximately 75% of the probabilities. This calibration
on the x axis between approximately 5% and approximately curve confirms that the Stratified Logistic Regression made
65%, indicating well calibrated predictions in this interval. a good prediction.

2) CKD PREDICTION EXCLUDING TEMPORAL COMPONENT B. FEATURE RANKING RESULTS


To show a scenario where no previous disease history of 1) CKD PREDICTIVE FEATURE RANKING
a patient is available, we did not include any temporal After verifying that computational intelligence is able to
component providing information about the progress of predict CKD developments among patients, we applied

165138 VOLUME 9, 2021


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

TABLE 5. CKD prediction results including the temporal feature. The dataset analyzed for these tests contains the time year feature indicating in which
year after the baseline visits the patient developed the CKD. All the abbreviations have the same meaning described in the caption of Table 4.

TABLE 6. Feature ranking through biostatistics univariate tests. TABLE 7. Feature ranking generated by Random Forests. MDA average
We employed the Mann–Whitney U test [70] for the numerical features position: average position obtained by each feature through the accuracy
and the chi–square test [71] for the binary features. We reported in blue decrease feature ranking of Random Forests.
and with an asterisk * the only feature having a p-value lower than the
0.005 threshold, that is 5 × 10−03 .

The two rankings show some common aspects, both listing


AgeBaseline and eGFRBaseline in top positions, but show
also some significant differences. The biostatistics standing,
for example, lists dBPBaseline as unrelevant predictive
feature (Table 6), while Random Forests puts it on the 4th
a feature ranking approach to detect the most predictive position out of 19 (Table 7). Also, the biostatistics tests stated
features in the clinical records. We employed two techniques: that HistoryDiabetes is one of the most significant factors,
one based on traditional univariate biostatistics tests, and one with p-value of 0.0005 (Table 6), while the machine learning
based on machine learning. approach put the same feature on the last position of its
Regarding the biostatistics phase, applied the ranking.
Mann–Whitney test and of chi-squared test to each variable The two rankings contain other minor differences that we
in relationship with the CKD target (subsection III-C), and consider unimportant.
ranked the features by p-value (Table 6).
The application of these biostatistics univariate tests, 2) CKD PREDICTIVE FEATURE RANKING CONSIDERING
although useful, show a huge number of relevant variables: THE TEMPORAL COMPONENT
13 variable of out 19 result being significant, having a p- As we did early for the CKD prediction, we decided to re-
value smaller than 0.005 (Table 6). Since the biostatistics run the feature ranking procedure by including the temporal
tests affirm that 68.42% of clinical factors are important, component regarding the year when the patient developed
this information does not help us to detect the relevance chronic kidney disease or the year of the last visit. Again,
of the features with enough precision. For this reason, we employed Stratified Logistic Regression.
we decided to calculate the feature ranking with machine The ranking generated considering the time compo-
learning, by employing Random Forests, which is the method nent (Table 8) showed several differences with respect to the
that achieved the top performance results in the binary previously described ranking generated without it (Table 7).
classification earlier (subsection IV-A). The most relevant differences in ranking positions are the
We therefore applied the Random Forests feature ranking, following:
and ranked the results by mean accuracy decrease posi- • HTNmeds is at the 1st position in this ranking, while it
tion (Table 7 and Figure 4). is 14th without considering time;

VOLUME 9, 2021 165139


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

TABLE 8. Clinical feature ranking generated by the Stratified Logistic


Regression, depending on the temporal component (the year when the
CKD happened or of patient’s last visit). Importance: average coefficient
of the trained logistic regression model out of 100 executions.

FIGURE 4. Barplot of the Random Forests feature ranking. MDA average


position: average position obtained by each feature through the accuracy
decrease feature ranking of Random Forests.
if used efficiently, our methods will provide quick, reliable,
fast information to physicians to help them with medical
• HistoryHTN is at the 3rd position in this ranking, while
decision making.
it is 10th without considering time;
• ACEIARB is at the 4th position in this ranking, while it B. FEATURE RANKING
is 17th without considering time; As mentioned earlier (subsection IV-B), some significant
• AgeBaseline is at the last position in this ranking, while differences emerge between the feature ranking obtained
it is 1st without considering time; without the time component and generated through Ran-
• CreatinineBaseline is at the 18th position in this ranking, dom Forests (Table 7) and the feature ranking obtained
while it is 9th without considering time. considering the year when the patient had the serious
We also decided to measure the difference between CKD development and generated through Stratified Logistic
these two rankings through two traditional metrics such Regression (Table 8).
as Spearman’s rank correlation coefficient and Kendall The features HTNmeds, ACEIARB, and HistoryDiabetes
distance [81]–[83]. Both these metrics range between –1.0 had an increase of 13 positions in the year standing (Table 8),
and +1, with –1 meaning opposite rank orders, 0.0 meaning compared to their original position in the static rank-
no correlation between lists, and +1.0 meaning identical ing (Table 7). Also, the feature BMIBaseline had an increase,
ranking. of 10 positions. The AgeBaseline variable, instead, had the
The comparison between ranking without time (Table 7) biggest position drop possible: it moved from the most
and ranking considering time (Table 8) generated Spearman’s important feature in the static standing (Table 7) to the less
ρ = −0.209 and Kendall τ = −0.146. relevant position in the year standing (Table 8). The other
variables in the year standing did not show so high position
V. DISCUSSION
changes.
A. CKD PREDICTION
These results show that taking medication for hyperten-
Our results show that machine learning methods are capable sion, taking ACE inhibitors, having a personal history of
of predicting chronic kidney disease from medical records diabetes, and body–mass index have an important role in
of patients at risk of cardiovascular disease, both including predicting if a patient will have serious CKD, when the
the temporal information about the year when the patient information about the disease event is included. The age of
has developed the CKD and without it. These findings can the patient is very important when the CKD year is unknown,
have an immediate impact in the clinical settings: physicians, but becomes irrelevant here.
in fact, can take advantage of our methods to forecast the
likelihood of a patient having chronic kidney disease, in a C. DIFFERENCE BETWEEN TEMPORAL FEATURE RANKING
few minutes, and then use this information to establish the AND NON-TEMPORAL FEATURE RANKING
urgency of the case. Our techniques, of course, do not replace The significant differences that emerge suggest strong
laboratory exams and tests, that will still be needed to further overlap between the information contained within the time
verify and understand the prognosis of the disease. However, variable with certain variables in the previous model. It is

165140 VOLUME 9, 2021


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

plausible that some predictors encode a ‘baseline’ level of risk theirs, we can notice the difference in the ranking positions
of developing CKD, which is negated if the model knows in between the two studies.
which year the CKD developed. Hypertension resulted being the 4th most important factor
The variables which reduce most significantly between the in Salekin’s study [6], confirming the importance of the
models are age, eGFR and creatinine, which are all clinical HistoryHTN variable which is ranked at the 3rd position in
indicators of an individual’s baseline risk of CKD. Inspection our Stratified Logistic Regression ranking (Table 8). Also
of variables which maintain or increase their position diabetes history has high ranking in both the standings: 3rd
when the year feature is added identifies hypertension, position in the ranking of Salekin’s study [6], and 6th of
smoking and diabetes as key predictive factors in the model importance in our Stratified Logistic Regression ranking,
(subsection IV-B). These are all known to play a central role as HistoryDiabetes (Table 8).
in the pathogenesis of micro- and macrovascular disease,
including of the kidney. While the former variables may VI. CONCLUSION
encode baseline risk, the latter are stronger indicators for rate Chronic kidney disease affects more than 700 millions people
of progression. in the world annually, and kills approximately 1.2 million
It is also worth noting that without the temporal infor- of them. Computational intelligence can be an effective
mation, the model is tasked with predicting whether the means to quickly analyze electronic health records of patients
individual will develop CKD within the next 10 years. Here, affected by this disease, providing information about how
the baseline is highly relevant as it indicates how much further likely they will develop severe stages of this disease, or stating
the renal function needs to deteriorate. However, when the which clinical variables are the most important for diagnosis.
configuration is altered to include the year in which year the In this article, we analyzed a medical record dataset of 491
CKD developed, the relative importance of risk factors may patients from UAE with CKD and at risk of cardiovascular
be expected to increase – and indeed, we observed this in our disease, and developed machine learning methods able to
models. predict the likelihood they will develop CKD at stages 3-5,
with high accuracy. Afterwards, we employed machine
D. COMPARISON WITH RESULTS OF THE learning to detect the most important variables contained in
ORIGINAL STUDY the dataset, first excluding the temporal component indicating
The original study of Al-Shamsi et al. [28] included a feature the year when the CKD happened or the patient’s last visit,
ranking phase generated through a multivariable Cox’s and then including it. Our results confirmed the effectiveness
proportional hazards analysis, which included the temporal of our approach.
component [84]. Their ranking listed older age (AgeBase- Regarding limitations, we have to report that we performed
line), personal history of coronary heart disease (Histo- our analysis only on a single dataset. We looked for
ryCHD), personal history of diabetes mellitus (HistoryDLD), alternative public datasets to use as validation cohorts, but
and personal history of smoking (HistorySmoking) as most unfortunately we could not find any that have the same
important factors for risk of CKD serious event. clinical features.
In contrast to their findings, AgeBaseline was ranked in the In the future, we plan to further investigate the probability
last position in our Stratified Logistic Regression standing, of diagnosis prediction in this dataset through classifier
while HistoryCHD and HistoryDLD were at unimportant calibration and calibration plots [85], and to perform the
positions: 10th and 16th ranks out of 19 variables, respectively. feature ranking with a different feature ranking method such
Smoking history, instead, occupied a high rank both in our as SHapley Additive exPlanations (SHAP) [86]. Moreover,
standing and in the original study standing: our approach, we also plan to study chronic kidney disease by applying our
in fact, listed it as 5th out of 19. methods to CKD datasets of other types, such as microarray
gene expression [87], [88] and ultrasonography images [89].
E. COMPARISON WITH RESULTS OF OTHER STUDIES
Several published studies include a feature ranking phase LIST OF ABBREVIATIONS
to detect the most relevant variables to predict chronic AUC: area under the curve. BP: blood pressure. CHD:
kidney disease from electronic medical records. Most of coronary hearth disease. CKD: chronic kidney disease.
them, however, use feature ranking to reduce the num- CVD: cardiovascular disease. DLD: dyslipidemia. EE: error
ber of variables for the binary classification, without estimation. FR: feature ranking. KDIGO: Kidney Disease
reporting a final standing of clinical factors ranked by Improving Global Outcomes. HTN: hypertension. MCC:
importance [10], [12], [21]. Matthews correlation coefficient. MDA: Model Decrease in
Only the article of Salekin and Stankovic [6] reports Accuracy. MS: model selection. NPV: negative predictive
the most relevant variables found in their study: specific value. p-value: probability value. PPV: positive predictive
gravity, albumin, diabetes, hypertension, hemoglobin, serum value. PR: precision–recall. ROC: receiver operating char-
creatinine, red blood cells count, packed cell volume, acteristic. SHAP: SHapley Additive exPlanations. SVM:
appetite, and sodium resulted being at top positions. Even if Support Vector Machine. TN rate: true negative rate. TP rate:
the clinical features present in our datasets mainly differ from true positive rate. UAE: United Arab Emirates.

VOLUME 9, 2021 165141


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

COMPETING INTERESTS [15] S. Belina V. J. Sara and K. Kalaiselvi, ‘‘Ensemble swarm behaviour based
The authors declare they have no competing interest. feature selection and support vector machine classifier for chronic kidney
disease prediction,’’ Int. J. Eng. Technol., vol. 7, no. 2, p. 190, May 2018.
[16] N. R. Shawan, S. S. A. Mehrab, F. Ahmed, and A. S. Hasmi, ‘‘Chronic
ACKNOWLEDGMENT kidney disease detection using ensemble classifiers and feature set
The authors thank Saif Al-Shamsi (United Arab Emirates reduction,’’ Ph.D. dissertation, Dept. Comput. Sci. Eng., BRAC Univ.,
Dhaka, Bangladesh, 2019.
University) for having provided additional information about [17] S. B. Satukumati and R. K. S. Satla, ‘‘Feature extraction techniques for
the dataset. chronic kidney disease identification,’’ Kidney, vol. 24, no. 1, p. 29, 2019.
[18] T. Abrar, S. Tasnim, and M. Hossain, ‘‘Early detection of chronic kidney
disease using machine learning,’’ Ph.D. dissertation, Dept. Comput. Sci.
DATA AND SOFTWARE AVAILABILITY Eng., BRAC Univ., Dhaka, Bangladesh, 2019.
The dataset used in this study is publicly available under [19] M. Elhoseny, K. Shankar, and J. Uthayakumar, ‘‘Intelligent diagnostic
the Creative Commons Attribution 4.0 International (CC BY prediction and classification system for chronic kidney disease,’’ Sci. Rep.,
vol. 9, no. 1, pp. 1–14, Dec. 2019.
4.0) license at: https://figshare.com/articles/dataset/Chronic_ [20] S. Ravizza, T. Huschto, A. Adamov, L. Böhm, A. Büsser, F. F. Flöther,
kidney_disease_in_patients_at_high_risk_of_cardiovascular R. Hinzmann, H. König, S. M. McAhren, D. H. Robertson, T. Schleyer,
_disease_in_the_United_Arab_Emirates_A_population-bas B. Schneidinger, and W. Petrich, ‘‘Predicting the early risk of chronic
kidney disease in patients with diabetes using real-world data,’’ Nature
ed_study/6711155?file=12242270 Med., vol. 25, no. 1, pp. 57–59, Jan. 2019.
Our software code is publicly available under GNU Gen- [21] S. I. Ali, B. Ali, J. Hussain, M. Hussain, F. A. Satti, G. H. Park, and
eral Public License v3.0 at: https://github.com/davidechicco/ S. Lee, ‘‘Cost-sensitive ensemble feature ranking and automatic threshold
selection for chronic kidney disease diagnosis,’’ Appl. Sci., vol. 10, no. 16,
chronic_kidney_disease_and_cardiovascular_disease p. 5663, Aug. 2020.
[22] P. Chittora, S. Chaurasia, P. Chakrabarti, G. Kumawat, T. Chakrabarti,
REFERENCES Z. Leonowicz, M. Jasiński, Ł. Jasiński, R. Gono, E. Jasińska, and
V. Bolshev, ‘‘Prediction of chronic kidney disease—A machine learning
[1] V. A. Luyckx, M. Tonelli, and J. W. Stanifer, ‘‘The global burden of kidney
perspective,’’ IEEE Access, vol. 9, pp. 17312–17334, 2021.
disease and the sustainable development goals,’’ Bull. World Health Org., [23] P. Ventrella, G. Delgrossi, G. Ferrario, M. Righetti, and M. Masseroli,
vol. 96, no. 6, p. 414, 2018. ‘‘Supervised machine learning for the assessment of chronic kidney disease
[2] S. Said and G. T. Hernandez, ‘‘The link between chronic kidney disease advancement,’’ Comput. Methods Programs Biomed., vol. 209, Sep. 2021,
and cardiovascular disease,’’ J. Nephropathol., vol. 3, no. 3, p. 99, 2014. Art. no. 106329.
[3] K. Damman, M. A. E. Valente, A. A. Voors, C. M. O’Connor, [24] M. Rashed-Al-Mahfuz, A. Haque, A. Azad, S. A. Alyami, J. M. W. Quinn,
D. J. van Veldhuisen, and H. L. Hillege, ‘‘Renal impairment, worsening and M. A. Moni, ‘‘Clinically applicable machine learning approaches to
renal function, and outcome in patients with heart failure: An updated identify attributes of chronic kidney disease (CKD) for use in low-cost
meta-analysis,’’ Eur. Heart J., vol. 35, no. 7, pp. 455–469, Feb. 2014. diagnostic screening,’’ IEEE J. Transl. Eng. Health Med., vol. 9, pp. 1–11,
[4] A. Charleonnan, T. Fufaung, T. Niyomwong, W. Chokchueypattanakit, 2021.
S. Suwannawach, and N. Ninchawee, ‘‘Predictive analytics for chronic [25] S. Krishnamurthy, K. Ks, E. Dovgan, M. Luštrek, B. G. Piletič,
kidney disease using machine learning techniques,’’ in Proc. Manage. K. Srinivasan, Y.-C.-J. Li, A. Gradišek, and S. Syed-Abdul, ‘‘Machine
Innov. Technol. Int. Conf. (MITicon), Bang-Saen, Thailand, Oct. 2016, learning prediction models for chronic kidney disease using national
pp. 80–83. health insurance claim data in Taiwan,’’ Healthcare, vol. 9, no. 5, p. 546,
[5] N. Tazin, S. A. Sabab, and M. T. Chowdhury, ‘‘Diagnosis of chronic kidney May 2021.
disease using effective classification and feature selection technique,’’ in [26] M. Gupta and P. Gupta, ‘‘Predicting chronic kidney disease using
Proc. Int. Conf. Med. Eng., Health Informat. Technol. (MediTec), Dhaka, machine learning,’’ in Emerging Technologies for Healthcare: Internet
Bangladesh, Dec. 2016, pp. 1–6. of Things and Deep Learning Models. Hoboken, NJ, USA: Wiley, 2021,
[6] A. Salekin and J. Stankovic, ‘‘Detection of chronic kidney disease pp. 251–277.
and selecting important predictive attributes,’’ in Proc. IEEE Int. Conf. [27] University of California Irvine Machine Learning Repository.
Healthcare Informat. (ICHI), Chicago, IL, USA, Oct. 2016, pp. 262–270. (Oct. 4, 2021). Chronic Kidney Disease Data Set. [Online]. Available:
[7] H. Polat, H. D. Mehr, and A. Cetin, ‘‘Diagnosis of chronic kidney disease https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease
based on support vector machine by feature selection methods,’’ J. Med. [28] S. Al-Shamsi, D. Regmi, and R. D. Govender, ‘‘Chronic kidney disease
Syst., vol. 41, no. 4, p. 55, 2017. in patients at high risk of cardiovascular disease in the United Arab
[8] M. S. Wibawa, I. M. D. Maysanjaya, and I. M. A. W. Putra, ‘‘Boosted Emirates: A population-based study,’’ PLoS ONE, vol. 13, no. 6, Jun. 2018,
classifier and features selection for enhancing chronic kidney disease Art. no. e0199920.
diagnose,’’ in Proc. 5th Int. Conf. Cyber IT Service Manage. (CITSM), [29] G. S. Francis, ‘‘ACE inhibition in cardiovascular disease,’’ New England J.
Denpasar, Indonesia, Aug. 2017, pp. 1–6. Med., vol. 342, no. 3, pp. 201–202, Jan. 2000.
[9] A. Subasi, E. Alickovic, and J. Kevric, ‘‘Diagnosis of chronic kidney [30] J. Agata, D. Nagahara, S. Kinoshita, Y. Takagawa, N. Moniwa, D. Yoshida,
disease by using random forest,’’ in Proc. Int. Conf. Med. Biol. Eng. N. Ura, and K. Shimamoto, ‘‘Angiotensin II receptor blocker prevents
(CMBEBIH). Singapore: Springer, 2017, pp. 589–594. increased arterial stiffness in patients with essential hypertension,’’
[10] S. Zeynu and S. Patil, ‘‘Prediction of chronic kidney disease using data Circulat. J., vol. 68, no. 12, pp. 1194–1198, 2004.
mining feature selection and ensemble method,’’ Int. J. Data Mining [31] Kidney Disease: Improving Global Outcomes (KDIGO) Transplant Work
Genomics Proteomics, vol. 9, no. 1, pp. 1–9, 2018. Group, ‘‘KDIGO clinical practice guideline for the care of kidney
[11] A. Ogunleye and Q.-G. Wang, ‘‘Enhanced XGBoost-based automatic transplant recipients,’’ Amer. J. Transplantation, vol. 9, p. S1, Nov. 2009.
diagnosis system for chronic kidney disease,’’ in Proc. IEEE 14th Int. Conf. [32] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning:
Control Autom. (ICCA), Anchorage, AK, USA, Jun. 2018, pp. 805–810. From Theory to Algorithms. Cambridge, U.K.: Cambridge Univ. Press,
[12] S. Zeynu and S. Patil, ‘‘Survey on prediction of chronic kidney disease 2014.
using data mining classification techniques and feature selection,’’ Int. J. [33] A. Altmann, L. Toloşi, O. Sander, and T. Lengauer, ‘‘Permutation
Pure Appl. Math., vol. 118, no. 8, pp. 149–156, 2018. importance: A corrected feature importance measure,’’ Bioinformatics,
[13] A. A. Imran, M. N. Amin, and F. T. Johora, ‘‘Classification of chronic vol. 26, no. 10, pp. 1340–1347, 2010.
kidney disease using logistic regression, feedforward neural network and [34] M. A. Hardy, Regression With Dummy Variables. Newbury Park, CA, USA:
wide & deep learning,’’ in Proc. Int. Conf. Innov. Eng. Technol. (ICIET), Sage, 1993.
Osaka, Japan, Dec. 2018, pp. 1–6. [35] A. R. T. Donders, G. J. M. G. van der Heijden, T. Stijnen, and
[14] A. Shrivas, S. K. Sahu, and H. Hota, ‘‘Classification of chronic kidney K. G. M. Moons, ‘‘Review: A gentle introduction to imputation of missing
disease with proposed union based feature selection technique,’’ in Proc. values,’’ J. Clin. Epidemiol., vol. 59, no. 10, pp. 1087–1091, Oct. 2006.
3rd Int. Conf. Internet Things Connected Technol., Jaipur, India, 2018, [36] L. Oneto, Model Selection and Error Estimation in a Nutshell. Berlin,
pp. 26–27. Germany: Springer, 2020.

165142 VOLUME 9, 2021


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

[37] K. F. Kerr, ‘‘Comments on the analysis of unbalanced microarray data,’’ [64] D. Chicco and C. Rovelli, ‘‘Computational prediction of diagnosis and
Bioinformatics, vol. 25, no. 16, pp. 2035–2041, Aug. 2009. feature selection on mesothelioma patient health records,’’ PLoS ONE,
[38] R. Laza, R. Pavón, M. Reboiro-Jato, and F. Fdez-Riverola, ‘‘Evaluating vol. 14, no. 1, Jan. 2019, Art. no. e0208737.
the effect of unbalanced data in biomedical document classification,’’ [65] P. Good, Permutation Tests: A Practical Guide to Resampling Methods for
J. Integrative Bioinf., vol. 8, no. 3, pp. 105–117, Dec. 2011. Testing Hypotheses. New York, NY, USA: Springer, 2013.
[39] K. Han, K. Z. Kim, J. M. Oh, I. W. Kim, K. Kim, and T. Park, ‘‘Unbalanced [66] M. L. Calle and V. Urrea, ‘‘Letter to the editor: Stability of random
sample size effect on the genome-wide population differentiation studies,’’ forest importance measures,’’ Briefings Bioinf., vol. 12, no. 1, pp. 86–89,
Int. J. Data Mining Bioinf., vol. 6, no. 5, pp. 490–504, 2012. Jan. 2011.
[40] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. [67] M. B. Kursa, ‘‘Robustness of random forest-based gene selection
Bing, ‘‘Learning from class-imbalanced data: Review of methods and methods,’’ BMC Bioinf., vol. 15, no. 1, pp. 1–8, Dec. 2014.
applications,’’ Expert Syst. Appl., vol. 73, pp. 220–239, May 2017. [68] H. Wang, F. Yang, and Z. Luo, ‘‘An experimental study of the intrinsic
[41] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE: stability of random forest variable importance measures,’’ BMC Bioinf.,
Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16, vol. 17, no. 1, p. 60, 2016.
no. 1, pp. 321–357, 2002. [69] D. Sculley, ‘‘Rank aggregation for similar items,’’ in Proc. SIAM Int. Conf.
[42] T. Zhu, Y. Lin, and Y. Liu, ‘‘Synthetic minority oversampling technique for Data Mining, Minneapolis, MN, USA, Apr. 2007, pp. 587–592.
multiclass imbalance problems,’’ Pattern Recognit., vol. 72, pp. 327–340, [70] T. W. MacFarland and J. M. Yates, ‘‘Mann–Whitney U test,’’ in
Dec. 2017. Introduction to Nonparametric Statistics for the Biological Sciences Using
[43] C. Molnar. (2018). Interpretable Machine Learning. [Online]. Available: R. Berlin, Germany: Springer, 2016, pp. 103–132.
https://christophm.github.io/book/ [71] P. E. Greenwood and M. S. Nikulin, A Guide to Chi–Squared Testing,
[44] I. Guyon and A. Elisseeff, ‘‘An introduction to variable and feature vol. 280. Hoboken, NJ, USA: Wiley, 1996.
selection,’’ J. Mach. Learn. Res., vol. 3, pp. 1157–1182, Mar. 2003. [72] D. J. Benjamin et al., ‘‘Redefine statistical significance,’’ Nature Hum.
[45] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32, Behav., vol. 2, no. 1, pp. 6–10, 2018.
2001. [73] C. R. Mehta and N. R. Patel, ‘‘Exact logistic regression: The-
[46] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. ory and examples,’’ Statist. Med., vol. 14, no. 19, pp. 2143–2160,
Cambridge, U.K.: Cambridge Univ. Press, 2004. Oct. 1995.
[47] S. S. Keerthi and C.-J. Lin, ‘‘Asymptotic behaviors of support vector [74] D. Chicco and G. Jurman, ‘‘Machine learning can predict survival of
machines with Gaussian kernel,’’ Neural Comput., vol. 15, no. 7, patients with heart failure from serum creatinine and ejection fraction
pp. 1667–1689, Mar. 2003. alone,’’ BMC Med. Informat. Decis. Making, vol. 20, no. 1, p. 16,
[48] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, Dec. 2020.
MA, USA: MIT Press, 2016. [75] D. Chicco, ‘‘Ten quick tips for machine learning in computational
[49] M. J. Zaki and W. Meira, Jr., Data Mining and Machine Learning: biology,’’ BioData Mining, vol. 10, no. 35, pp. 1–17, 2017.
Fundamental Concepts and Algorithms. Cambridge, U.K.: Cambridge [76] D. Chicco, M. J. Warrens, and G. Jurman, ‘‘The Matthews correlation coef-
Univ. Press, 2019. ficient (MCC) is more informative than Cohen’s Kappa and brier score in
[50] T. Chen and C. Guestrin, ‘‘XGBoost: A scalable tree boosting system,’’ binary classification assessment,’’ IEEE Access, vol. 9, pp. 78368–78381,
in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining 2021.
(KDD), San Francisco, CA, USA, 2016, pp. 785–794. [77] D. Chicco, N. Tötsch, and G. Jurman, ‘‘The Matthews correlation
[51] R. C. Holte, ‘‘Very simple classification rules perform well on most coefficient (MCC) is more reliable than balanced accuracy, bookmaker
commonly used datasets,’’ Mach. Learn., vol. 11, no. 1, pp. 63–90, informedness, and markedness in two-class confusion matrix evaluation,’’
Apr. 1993. BioData Mining, vol. 14, Feb. 2021, Art. no. 13.
[52] I. Orlandi, L. Oneto, and D. Anguita, ‘‘Random forests model selection,’’ in [78] D. Chicco, V. Starovoitov, and G. Jurman, ‘‘The benefits of the Matthews
Proc. Eur. Symp. Artif. Neural Netw., Comput. Intell. Mach. Learn., Bruges, correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in
Belgium, 2016, pp. 441–446. binary classification assessment,’’ IEEE Access, vol. 9, pp. 47112–47124,
[53] F. Hutter, H. Hoos, and K. Leyton-Brown, ‘‘An efficient approach for 2021.
assessing hyperparameter importance,’’ in Proc. 31st Int. Conf. Mach. [79] P. C. Austin and E. W. Steyerberg, ‘‘Graphical assessment of
Learn. (ICML), Beijing, China, 2014, pp. 754–762. internal and external calibration of logistic regression models by
[54] S. Bernard, L. Heutte, and S. Adam, ‘‘Influence of hyperparameters on using loess smoothers,’’ Statist. Med., vol. 33, no. 3, pp. 517–535,
random forest accuracy,’’ in Proc. Int. Workshop Multiple Classifier Syst., Feb. 2014.
Reykjavik, Iceland, 2009, pp. 171–180. [80] N. E. Breslow, L. P. Zhao, T. R. Fears, and C. C. Brown, ‘‘Logistic
[55] P. Probst, M. Wright, and A.-L. Boulesteix, ‘‘Hyperparameters and tuning regression for stratified case–control studies,’’ Biometrics, vol. 44, no. 3,
strategies for random forest,’’ Wiley Interdiscipl. Rev., Data Mining Knowl. pp. 891–899, 1988.
Discovery, vol. 9, no. 3, p. e1301, 2019. [81] J. H. Zar, ‘‘Spearman rank correlation,’’ in Encyclopedia Biostatistics,
[56] M. Kuhn, ‘‘Building predictive models in R using the caret package,’’ vol. 7. Hoboken, NJ, USA: Wiley, 2005.
J. Statist. Softw., vol. 28, no. 5, pp. 1–26, 2008. [82] F. J. Brandenburg, A. Gleißner, and A. Hofmeier, ‘‘Comparing and
[57] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, ‘‘Do we aggregating partial orders with Kendall tau distances,’’ in Proc. 6th Int.
need hundreds of classifiers to solve real world classification problems?’’ Workshop Algorithms Comput. (WALCOM). Dhaka, Bangladesh: Springer,
J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014. 2012, pp. 88–99.
[58] M. Wainberg, B. Alipanahi, and B. J. Frey, ‘‘Are random forests truly the [83] D. Chicco, E. Ciceri, and M. Masseroli, ‘‘Extended Spearman and Kendall
best classifiers?’’ J. Mach. Learn. Res., vol. 17, no. 1, pp. 3837–3841, coefficients for gene annotation list correlation,’’ in Proc. 11th Int.
2016. Meeting Comput. Intell. Methods Bioinf. Biostatistics (CIBB), in Lecture
[59] D. H. Wolpert, ‘‘The lack of a priori distinctions between learning Notes in Computer Science, vol. 8623. Cambridge, U.K.: Springer, 2015,
algorithms,’’ Neural Comput., vol. 8, no. 7, pp. 1341–1390, Oct. 1996. pp. 19–32.
[60] Y. Saeys, T. Abeel, and Y. V. D. Peer, ‘‘Robust feature selection using [84] D. Clayton and J. Cuzick, ‘‘Multivariate generalizations of the proportional
ensemble feature selection techniques,’’ in Proc. Joint Eur. Conf. Mach. hazards model,’’ J. Roy. Stat. Soc., A, General, vol. 148, no. 2, pp. 82–108,
Learn. Knowl. Discovery Databases (ECML PKDD), Antwerp, Belgium, 1985.
2008, pp. 313–325. [85] P. A. Flach, ‘‘Classifier calibration,’’ in Encyclopedia of Machine Learning
[61] R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, ‘‘Variable selection using and Data Mining. Berlin, Germany: Springer, 2016.
random forests,’’ Pattern Recognit. Lett., vol. 31, no. 14, pp. 2225–2236, [86] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model
Oct. 2010. predictions,’’ in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS),
[62] Y. Qi, ‘‘Random forest for bioinformatics,’’ in Ensemble Machine 2017, pp. 4768–4777.
Learning. Boston, MA, USA: Springer, 2012. [87] L.-T. Zhou, S. Qiu, L.-L. Lv, Z.-L. Li, H. Liu, R.-N. Tang, K.-L. Ma, and
[63] R. Díaz-Uriarte and S. A. De Andrés, ‘‘Gene selection and classification B.-C. Liu, ‘‘Integrative bioinformatics analysis provides insight into the
of microarray data using random forest,’’ BMC Bioinf., vol. 7, no. 1, p. 3, molecular mechanisms of chronic kidney disease,’’ Kidney Blood Pressure
Dec. 2006. Res., vol. 43, no. 2, pp. 568–581, 2018.

VOLUME 9, 2021 165143


D. Chicco et al.: Machine Learning Analysis of Health Records of Patients With CKD at Risk of CVD

[88] Z. Zuo, J.-X. Shen, Y. Pan, J. Pu, Y.-G. Li, X.-H. Shao, and W.-P. Wang, CHRISTOPHER A. LOVEJOY received the bach-
‘‘Weighted gene correlation network analysis (WGCNA) detected loss of elor’s degree in medicine from the University
MAGI2 promotes chronic kidney disease (CKD) by podocyte damage,’’ of Cambridge, U.K., and the master’s degree in
Cellular Physiol. Biochem., vol. 51, no. 1, pp. 244–261, 2018. data science and machine learning from University
[89] C.-Y. Ho, T.-W. Pai, Y.-C. Peng, C.-H. Lee, Y.-C. Chen, Y.-T. Chen, College London, U.K. He is currently a Medical
and K.-S. Chen, ‘‘Ultrasonography image analysis for detection and Doctor with interests in applied machine learning
classification of chronic kidney disease,’’ in Proc. 6th Int. Conf. and bioinformatics.
Complex, Intell., Softw. Intensive Syst. (CISIS), Palermo, Italy, Jul. 2012,
pp. 624–629.

LUCA ONETO received the Bachelor of Science


and Master of Science degrees in electronic
engineering from the Università di Genova, Italy,
in 2008 and 2010, respectively, and the Ph.D.
degree from the School of Sciences and Technolo-
gies for Knowledge and Information Retrieval,
Università di Genova, in 2014, with the the-
sis entitled Learning Based on Empirical Data.
In 2017, he obtained the Italian National Scientific
Qualification for the role of an Associate Professor
in computer engineering, and in 2018, he obtained the one in computer
DAVIDE CHICCO received the Bachelor of Sci- science. He worked as an Assistant Professor in computer engineering with
ence and Master of Science degrees in computer the Università di Genova, from 2016 to 2019, where he is currently an
science from the Università di Genova, Genoa, Associate Professor in computer engineering. In 2018, he was a Co-Funder
Italy, in 2007 and 2010, respectively, and the Ph.D. of ZenaByte s.r.l., spin-off company. In 2019, he obtained the Italian National
degree in computer engineering from the Politec- Scientific Qualification for the role of a Full Professor in computer science
nico di Milano University, Milan, Italy, in Spring and computer engineering. In 2019, he became an Associate Professor in
2014. He also spent a semester as a Visiting computer science with the Università di Pisa. His first main topic of research
Doctoral Scholar with the University of California is the statistical learning theory with particular focus on the theoretical
Irvine, USA. From September 2014 to September aspects of the problems of (semi) supervised model selection and error
2018, he was a Postdoctoral Researcher with the estimation. His second main topic of research is data science with particular
Princess Margaret Cancer Centre and a Guest with the University of Toronto. reference to the problem of trustworthy AI and the solution of real world
From September 2018 to December 2019, he was a Scientific Associate problems by exploiting and improving the most recent learning algorithms
Researcher with the Peter Munk Cardiac Centre, Toronto, ON, Canada. From and theoretical results in the fields of machine learning and data mining.
January 2020 to January 2021, he was a Scientific Associate Researcher He has been involved in several Horizon 2020 projects (S2RJU, ICT, and
with the Krembil Research Institute, Toronto. In January 2021, he started to DS) and awarded with the Amazon AWS Machine Learning and Somalvico
work as a Scientific Research Associate with the Institute of Health Policy (Best Italian Young AI Researcher) Awards.
Management and Evaluation, University of Toronto.

165144 VOLUME 9, 2021

You might also like