Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dutta 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Expert Systems With Applications 159 (2020) 113408

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

An efficient convolutional neural network for coronary heart disease


prediction
Aniruddha Dutta a,b, Tamal Batabyal c,d,∗, Meheli Basu e, Scott T. Acton c,f
a
Department of Pathology & Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
b
Haas School of Business, University of California, Berkeley, CA 94720, USA
c
Department of Electrical & Computer Engineering, University of Virginia, VA 22904, USA
d
Department of Neurology, School of Medicine, University of Virginia, VA 22904, USA
e
Katz Graduate School of Business, University of Pittsburgh, PA 15260, USA
f
Department of Biomedical Engineering, University of Virginia, VA 22904, USA

a r t i c l e i n f o a b s t r a c t

Article history: This study proposes an efficient neural network with convolutional layers to classify significantly class-
Received 3 September 2019 imbalanced clinical data. The data is curated from the National Health and Nutritional Examination Sur-
Revised 17 March 2020
vey (NHANES) with the goal of predicting the occurrence of Coronary Heart Disease (CHD). While the
Accepted 24 March 2020
majority of the existing machine learning models that have been used on this class of data are vulner-
Available online 21 May 2020
able to class imbalance even after the adjustment of class-specific weights, our simple two-layer CNN
Keywords: exhibits resilience to the imbalance with fair harmony in class-specific performance. Given a highly im-
Coronary heart disease balanced dataset, it is often challenging to simultaneously achieve a high class 1 (true CHD prediction
Machine learning rate) accuracy along with a high class 0 accuracy, as the test data size increases. We adopt a two-step
LASSO regression approach: first, we employ least absolute shrinkage and selection operator (LASSO) based feature weight
Convolutional neural network assessment followed by majority-voting based identification of important features. Next, the important
Artificial Intelligence
features are homogenized by using a fully connected layer, a crucial step before passing the output of
NHANES
the layer to successive convolutional stages. We also propose a training routine per epoch, akin to a sim-
ulated annealing process, to boost the classification accuracy.
Despite a high class imbalance in the NHANES dataset, the investigation confirms that our proposed CNN
architecture has the classification power of 77% to correctly classify the presence of CHD and 81.8% to
accurately classify the absence of CHD cases on a testing data, which is 85.70% of the total dataset. This
result signifies that the proposed architecture can be generalized to other studies in healthcare with a
similar order of features and imbalances. While the recall values obtained from other machine learning
methods, such as SVM and random forest, are comparable to that of our proposed CNN model, our model
predicts the negative (Non-CHD) cases with higher accuracy. Our model architecture exhibits a way for-
ward to develop better investigative tools, improved medical treatment and lower diagnostic costs by
incorporating a smart diagnostic system in the healthcare system. The balanced accuracy of our model
(79.5%) is also better than individual accuracies of SVM or random forest classifiers. The CNN classifier
results in high specificity and test accuracy along with high values of recall and area under the curve
(AUC).
© 2020 Elsevier Ltd. All rights reserved.

1. Introduction US (Benjamin, 2019). Timely diagnosis of heart disease is crucial


in reducing health risk and preventing cardiac arrests. An Amer-
Heart disease is a leading cause of death today, with coronary ican Heart Association study projects an almost 100% increase in
heart disease (CHD) being the most common form of cardiovascu- CHD cases by 2030 (Benjamin, 2019; Roger, 2010). Major risk fac-
lar disease that accounts for approximately 13% of deaths in the tors such as smoking, hypertension, hyper cholesterol and dia-
betes have been studied in connection to CHD (Ahmed et al., 2017;
Burke et al., 1997; Celermajer et al., 1993; Chobanian et al., 2003;

Corresponding author at: Department of Neurology, School of Medicine, Univer- Haskell et al., 1994; Kannel, 1996, 1971; Stamler, Vaccaro, Neaton,
sity of Virginia, VA 22904, USA. and Wentworth, 1993; Vasan et al., 2001; Zeiher, Drexler, Saur-
E-mail address: tb2ea@virginia.edu (T. Batabyal).

https://doi.org/10.1016/j.eswa.2020.113408
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
2 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

bier, and Just, 1993). Ahmed et al. (2017) show that Body Mass network during training); 3) in conjunction with the architecture,
Index (BMI) and systolic blood pressure are the two most critical we propose a simulated annealing-like training schedule that is
factors affecting hypertension. Fava et al. (2013) conclude signif- shown to minimize the generalization error between train and test
icant association between age, sex, BMI and heart rate with hy- losses.
pertension. Studies in general population indicate that high level It is important to note that our work is not intended to pro-
of creatinine in blood can increase the risk of CHD (Irie et al., vide a sophisticated architecture using a neural network. We also
2006; Wannamethee, Shaper, and Perry, 1997). Additionally, blood do not focus on providing theoretical explanation on how our net-
cholesterol and glycohaemoglobin levels are found to be persis- work offers resistance to data imbalance. Instead, our goal is to es-
tently and significantly high in patients with CHD (Burchfiel, Tracy, tablish that under certain constraints one can apply convolutional
Chyou, and Strong, 1997; Meigs et al., 1997). Several researchers stages despite the scarcity of data and the absence of well-defined
have used statistical and machine learning models on echocar- data augmentation techniques and to show that the shallow lay-
diography images (Madani, Arnaout, Mofrad, and Arnaout, 2018; ers of convolution indeed offer resilience to the data imbalance
Nakanishi et al., 2018) and electrocardiography signals (Jin, Sun, problem by dint of a training schedule. The proposed pipeline con-
and Cheng, 2009; Shen et al., 2016) to predict clinically signifi- tributes to improving CHD prediction rates in imbalanced clinical
cant parameters related to CHD in patients, such as heart rate and data, based on a robust feature selection technique using LASSO
axis deviation. Boosted algorithms such as gradient boost and logit and shallow convolutional layers. This serves to improve predic-
boost have been used in literature to predict FFR and cardiovas- tion algorithms included in smart healthcare devices where so-
cular events (Goldstein, Navar, and Carter, 2017; Weng, Reps, Kai, phisticated neural algorithms can learn from past user data to pre-
Garibaldi, and Qureshi, 2017). Frizzell et al. and Mortazavi et al. dict the probability of heart failure and strokes. Prediction rates
built prediction models to determine the presence of cardiovascu- could be integrated in healthcare analytics to provide real time
lar disease using the 30-day readmission electronic data for pa- monitoring which not only benefits the patients but also medi-
tients with heart failure. The reported C-statistic of the models cal practitioners for efficient operations. The present research also
varied from 0.533 to 0.628, showing an improvement in predic- focuses on a systematic training schedule which can be incorpo-
tion with the machine learning approach over traditional statistical rated in smart devices to improve tracking of different predic-
methods. tor variable levels for heart failure. The rest of the paper is or-
Numerous risk factor variables often make the prediction of ganized as follows: Section 2 explains data preparation and the
CHD difficult, which in turn, increases the cost of diagnosis and preprocessing techniques. In Section 3, we illustrate the convolu-
treatment. In order to resolve the complexities and cost of diag- tional neural network architecture with details on the training and
nosis, advanced machine learning models are being widely used testing methodology. In Section 4 we demonstrate the results ob-
by researchers to predict CHD from clinical data of patients. tained from our model with performance evaluation metrics and
Kurt, Ture, and Kurum (2008) compared prediction performances compare it with existing models. Section 5 is the conclusion and
of a number of machine learning models including the multi- discussion section. Here, several extensions to the research are
layer perceptron (MLP) and radial basis function (RBF) to pre- proposed.
dict the presence of CHD in 1245 subjects (Kurt et al., 2008).
The MLP was found to be the most efficient method, yielding 2. Data preprocessing
an area under the receiver operating characteristic (ROC) curve
of 0.78. Kahramanli and Allahverdi (2008), Shilaskar and Gha- Our study uses the NHANES data from 1999–20 0 0 to 2015–
tol (2013), Haq, Li, Memon, Nazir, and Sun (2018) proposed a hy- 2016. The dataset is compiled by combining the demographic, ex-
brid forward selection technique wherein they were able to se- amination, laboratory and questionnaire data of 37,079 (CHD –
lect smaller subsets and increase the accuracy of the presence 1300, Non-CHD – 35,779) individuals as shown in Fig. 1. Demo-
of cardiovascular disease with reduced number of attributes. Sev- graphic variables include age and gender of the survey partici-
eral other groups have reported techniques, such as artificial neu- pants at the time of screening. Participant weight, height, blood
ral network (ANN), fuzzy logic (FL) and deep learning (DL) meth- pressure and body mass index (BMI) from the examination data
ods to improve heart disease diagnosis (Das, Turkoglu, & Sen- are also considered as a set of risk factor variables to study their
gur, 2009; Olaniyi, Oyedotun, & Khashman, 2015; Uyar, 2017; effect on cardiovascular diseases. NHANES collects laboratory and
Venkatesh, 2017). Nonetheless, in most of the previous studies, the survey data from participants once in every two years depending
patient cohort was limited to a few thousand with limited risk on their age and gender. In addition, based on the already existing
factors. validated experimental research, a comprehensive list of risk fac-
We propose an efficient neural network with convolutional lay- tor variables is selected from the laboratory tests conducted. Ques-
ers using the NHANES dataset to predict the occurrence of CHD. tionnaire data comprises of questions asked at home by interview-
A complete set of clinical, laboratory and examination data are ers using a Computer-Assisted Personal Interview (CAPI) system as
used in the analysis along with a feature selection technique by mentioned in the NHANES website (NHANES, 2015). A total of 5
LASSO regression. Data preprocessing is performed using LASSO dichotomous predictor categorical variables are selected from the
followed by a feature voting and elimination technique. The per- questionnaire data which have been shown to affect CHD (refer-
formance of the network is compared to several existing tradi- ences, required). In all, 30 continuous and 6 categorical indepen-
tional ML models in conjunction with the identification of a set dent variables are used to predict the likelihood of coronary heart
of important features for CHD. Our architecture is simple in de- disease. For this study, coronary heart disease (CHD) is used as the
sign, elegant in concept, sophisticated in training schedule, effec- dichotomous dependent variable. Awareness of CHD is defined as
tive in outcome with far-reaching applicability in problems with “yes” response to the question “Have you been ever told you had
unbalanced datasets. Our research contributes to the existing stud- coronary heart disease?” Table 1 shows the categorical indepen-
ies in three primary ways: 1) our model uses a variable elimina- dent and dependent variables in the dataset considered for model
tion technique using LASSO and feature voting as preprocessing development.
steps; 2) we leverage a shallow neural network with convolutional The exhaustive list of variables is: gender, age, annual-family-
layers, which improves CHD prediction rates compared to exist- income, ratio-family-income-poverty, 60 s pulse rate, systolic, dias-
ing models with comparable subjects (the ‘shallowness’ is dictated tolic, weight, height, body mass index, white blood cells, lympho-
by the scarcity of class-specific data to prevent overfitting of the cyte, monocyte, eosinophils, basophils, red blood cells, hemoglobin,
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 3

Fig. 1. Data compilation from National Health and Nutritional Survey (NHANES). The data is acquired from 1999 to 2016 in three categories – Demography, Examination and
Laboratory. Based on the nature of the factors that are considered, the dataset contains both the quantitative and the qualitative variables.

Table 1.
Description of the risk factor independent variables and the dependent variable.

Variable Name Description Code Meaning

Gender Gender of the participant 1 Male


2 Female
Vigorous Activity Vigorous activity in last 1 Yes
12 years and above one week or 30 days 2 No
3 Unable to do activity
Moderate Activity Moderate activity in last 1 Yes
12 years and above one week or 30 days 2 No
3 Unable to do activity
Diabetes Doctor told that the 1 Yes
1 yr and above participant has diabetes 2 No
3 Borderline
Blood Relative Diabetes Biological blood 1 Yes
relatives ever told that 2 No
20 yrs and above they have diabetes
Blood Relative Stroke Biological blood 1 Yes
relatives ever told that 2 No
20 yrs and above they have hypertension or stroke before the age of 50
Coronary Heart Disease Ever told that the 1 Yes
20 yrs and above participant had coronary heart disease 2 No

mean cell volume, mean concentration of hemoglobin, platelet 3. Proposed architecture


count, mean volume of platelets, neutrophils, hematocrit, red blood
cell width, albumin, alkaline phosphatase (ALP), aspartate amino- 3.1. LASSO shrinkage and majority voting
transferase (AST), alanine aminotransferase (ALT), cholesterol, cre-
atinine, glucose, gamma-glutamyl transferase (GGT), cholesterol, LASSO or least absolute shrinkage and selection operator is a
creatinine, glucose, iron, iron, lactate dehydrogenase (LDH), phos- regression technique for variable selection and regularization to
phorus, bilirubin, protein, uric acid, triglycerides, total choles- enhance the prediction accuracy and interpretability of the sta-
terol, high-density lipoprotein (HDL), glycohemoglobin, vigorous- tistical model it produces. In LASSO, data values are shrunk to-
work, moderate-work, health-Insurance, diabetes, blood related di- ward a central point, and this algorithm helps in variable selection
abetes, and blood related stroke. However, in this list of variables, and parameter elimination. This type of regression is well-suited
there are a couple of linearly dependent variables in terms of for models with high multicollinearity. LASSO regression adds a
their nature of acquisition or quantification and some uncorrelated penalty equal to the absolute value of the magnitude of coeffi-
variables (annual family income, height, ratio of family income- cients, and some coefficients can become zero and are eventu-
poverty, 60 s pulse rate, health insurance, lymphocyte, monocyte, ally eliminated from the model. This results in variable elimina-
eosinophils, total cholesterol, mean cell volume, mean concentra- tion and hence models with fewer coefficients. LASSO solutions are
tion of hemoglobin, hematocrit, segmented neutrophils). We do quadratic problems and the goal of the algorithm is to minimize:
not consider these variables for subsequent processing and anal-
ysis.
4 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

deep neural network is extremely susceptible to be attacked by ad-


 2

n  
p  versarial examples. In addition, owing to millions of parameters in
yi − xi j γ j +λ γ j  (1) a typical deep architecture, the trained network may be overfitted,
i=1 j j=1 especially in cases where there is scarcity of examples.
Among various algorithms that attempted to overcome
which is the same as minimizing the sum of squares with con- this problem, data augmentation (Krizhevsky et al., 2012;
straint  |γ j | ≤ s. Some of the γ values are shrunk to exactly zero, Radford, Metz, and Chintala, 2015) is a widely used technique
resulting in a regression model that’s easier to interpret. A tun- that artificially generates examples to populate small datasets.
ing parameter, λ which is the amount of shrinkage, controls the However, such a procedure is biologically implausible in most
strength of the regularization penalty. When λ = 0, no parame- clinical datasets. For example, augmented measurements of a
ters are eliminated. The estimate is equal to the one found with CHD phenotype, such as platelet count, might not correspond
linear regression. As λ increases, more coefficients are set to zero to possible readings of a subject. It is because the underly-
and eliminated. As λ increases, bias increases and as λ decreases, ing principles of the statistical generation and the biological
variance increases. The model intercept is usually left unchanged. sources of platelet count readings may be fundamentally dif-
The γ value for a variable (factor) can be interpreted as the im- ferent. Poor training due to small or imbalanced datasets and
portance of the variable in terms of how the it contributes to the susceptibility to adversarial examples lead to poor and unrea-
underlying variation in the data. The variable with a zero γ is con- sonable classification. Unlike many computer vision tasks, such
sidered unimportant. It is to note that LASSO shows misleading re- as semantic labeling, chat-bot configuration, and hallucinogenic
sults in case of data imbalance, which may prompt incorrect se- image synthesis (Mordvintsev, Olah, and Tyka, 2015), erroneous
lection of important variables if we perform LASSO on the entire prediction in medical research is accompanied by a significant
dataset. penalty.
Note that the variables in our dataset are mixed data type – a For example, faulty prediction of a subject having chronic CHD
subset of them are categorical. In this work, as a standard prac- may leave the subject untreated or misdirect the possible thera-
tice, we use group LASSO and we refer it as LASSO for simplic- peutic medication. Therefore, one of the prime objectives of this
ity. In order to mitigate the effect of imbalance, we adopt a strat- paper is to improve classification accuracy, i.e. the prediction accu-
egy to randomly subsample the dataset and iterate LASSO multi- racy of the subjects with and without the presence of CHD. There
ple times. Majority voting is performed on the set of γ values to are several other relevant concerns related to misclassification in
identify the variable that are nonzero in major number of itera- medical research (Marcus, 2018). To overcome these limitations
tions. Let us assume, that LASSO is performed N times on N ran- and driven by the success of deep networks, we propose a shal-
domly subsampled dataset, where each instance has equal num- low convolutional neural network, where the convolution layers
ber of examples in case of CHD and no-CHD. With 45 variables are ‘sandwiched’ between two fully connected layers as shown in
at hand, we obtain γ i = [γ i,1 γ i,2 …… γ i,45 ] at ith instance of Fig. 2.
LASSO. For any variable c, we count the number of instances in
which the variable is non-zero, and with a manually set thresh- 3.3. CNN architecture
old, we decide the selection of that variable for further analysis.
Mathematically, The architecture is a sequential one-input-one-output feedfor-
 ward network. For simplicity, we assume the class of subjects with
1 i f γ = 0 presence of CHD as class ‘1 and the subjects with absence of CHD
χ (γ ) =
0 otherwise (2) as class ‘0 . As mentioned in the previous section, the number of
[χ (γ1,c ) χ (γ2,c ) . . . . . . χ (γN,c )]1 ≥ Nα ⇒ c is selected active phenotypes of CHD obtained from majority voting is 50.
Let the number of training examples be N, which indicates that
3.2. Convolutional neural network (CNN) the input layer in Fig. 2 has dimension of RN X 50 . The dense or
fully connected layers, consisting of 64 neurons collectively act as
The challenge of predicting the existence of CHD in patients a linear combiner of the 50 variables and bias, which effectively
refers to the task of binary classification. Under certain constraints, homogenizes different variable types before nonlinear transforma-
neural network (Goodfellow, Bengio, Courville, and Bengio, 2016; tion. The nonlinear transformation is carried out by rectified linear
LeCun, Bengio, and Hinton, 2015) has been proven to be an ef- unit (ReLU). Dropout with 20% probability is performed to reduce
fective parametric classifier under supervised settings. Recently, overfitting. Following the fully connected layer, there is a cascade
with the explosion of structured data, deep neural networks in- of convolution layers. In the first convolution layer, there are two
corporating a large number of application-specific hidden layers, filters of kernel width 3 and stride 1. The layer is not provided
have demonstrated significant improvement in several areas in- with external zero-padding. In the pooling layer, we rigorously ex-
cluding speech processing, applications involving image process- periment with different pooling strategies and find average pooling
ing, and time series prediction (LeCun, 1995). There is a vast body working marginally better than max pooling under all constraints.
of deep learning architectures that are fine-tuned and rigorously The first convolution layer converts the output of fully connected
trained using big datasets. An artificial neural network (He, Zhang, block ∈ RN X 64 to a tensor of dimension RN X 64 X 1 . The ten-
Ren, and Sun, 2016; Iandola et al., 2016; Krizhevsky, Sutskever, and sor is then subjected to batch normalization, nonlinear transfor-
Hinton, 2012; Szegedy, Ioffe, Vanhoucke, and Alemi, 2017) succes- mation and average pooling with an output tensor of dimension
sively transforms the input data over sequential hidden layers and RN X 31 X 2 .
estimates the error at the output layer. The error is back prop- The filters in the last no-zero-padded convolution layers are
agated to iteratively update the layer weights using gradient de- taken with kernel 5 and stride 1, delivering an output tensor of
scent algorithm. Rigorous experimentations and analyses have pro- RNX 13X 4 to the next dense layer after the average pooling layers.
posed several improvements in the gradient descent algorithm, the The categorical output is observed at the end of the softmax layer,
nonlinearity of layers, overfitting reduction, training schedule, hid- where we set the loss function as the categorical cross-entropy
den layer visualization and other modifications. Despite resounding loss. The bias in each layer is initialized with random numbers
success in applications, the working principle of deep neural net- drawn from a truncated normal distribution with variance √1n ,
works is still poorly understood. It is also found in practice that a where n is the number of ‘fan-in’ connections to the layer. We use
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 5

Fig. 2. Proposed convolutional neural network architecture. The ‘Input’ is a 1D numerical array corresponding to all the factors/variables from LASSO-Majority Voting pre-
processing stage. The ‘Dense’ layer, immediately after the ‘Input’, combines all the factors and each neuron (computing node) at the output of ‘Dense’ layer is a weighted
combination of all the variables, indicating a homogeneous mix of different variable types. The next two convolution layers seek representation of the input variables via the
‘Dense’ layer. The next two ‘Dense’ layers are followed by the ‘Softmax’ layer. The last two ‘Dense’ layers (before the ‘Softmax’ layer) can be retrained for transfer learning
in case new data is obtained. The associated training parameters, such as dropout probability, number of neurons, activation function (we used ReLU), pooling types, and
convolution filter parameters are shown in the above figure. Owing to the large number of parameters that can lead to overfitting of training data points, we propose a
training schedule in Section 3.2.1.

Adam optimizer with learning rate 0.005, β 1 = 0.9, β 2 = 0.999 3.4. Competitive approaches
and zero decay. Our proposed architecture consists of 32,642 train-
able and 1164 non-trainable parameters. We experiment with sev- Machine learning classification methods have shown to poten-
eral hyperparameters that are associated with our model to obtain tially improve prediction outcomes in coronary heart disease. Such
consistent class-wise accuracy. We provide results by varying sub- classification methods include logistic regression, support vector
sampling of input data, epochs, class-weights, the number of neu- machines, random forests, boosting methods and multilayer per-
rons in each dense layer except the last one, and the number of ceptron (Goldstein et al., 2017). Logistic regression models the pre-
filters in each convolution layer during training. diction of a binomial outcome with one or more explanatory vari-
ables, using a standard logistic function which measures the re-
lationship between the categorical dependent variable and one or
more independent variables by estimating the probabilities. The lo-
3.3.1. Training schedule
gistic function is given by, f (x ) = 1+1e−x which in common practice
During training, the class weight ratio, which is adjusted as a
is known as the sigmoid curve. Support vector machine (SVM) is
penalty factor due to class imbalance, is defined as the ratio of
a binary classification algorithm which generates a (N-1) dimen-
CHD and Non-CHD dataset. For example, a class weight ratio of
sional hyperplane to separate points into two distinct classes in an
10:1 indicates that any misclassification of a CHD training sam-
N dimensional plane. The classification hyperplane is constructed
ple will be penalized 10 times more than a misclassified Non-CHD
in a high dimensional space that represents the largest separation,
sample during the error calculation at the output prior to back-
or margin, between the two classes.
propagation after each epoch. Although, we use dropout layers in
Random forests are an ensemble learning algorithm where de-
our CNN model, we also use this training schedule in order to fur-
cision trees that grow deep are averaged and trained on different
ther reduce possible overfitting. The intuition is to initially train
parts of the training set to reduce variance and avoid overfitting.
the model with 1:N weight ratio for sufficiently large number of
Random forests algorithm employs bagging or bootstrap aggregat-
epochs and then, gradually increase the weight ratio with a steady
ing and at each split a random subset of features are selected. Bag-
decline in the number of epochs. Let the actual class weight ratio
ging is a parallel ensemble because each model is built indepen-
is ρ 0 : 1, which we take as a factor ρ 0 .
dently. Boosting on the other hand is a sequential ensemble where
each model is built based on correcting the misclassifications of
Fitting our CNN model, M, by varying the number of epochs (ω) and weight the previous model. In boosting methods, the weights are initial-
ratio (ρ )
ized on training samples and for n iterations, a classifier is trained
1. Initialize ρ = N, ω (large number, we set as N), M, end_iter (5–10 using a single feature and training error evaluated. Then the clas-
depending on the instance), i = 1 sifier with the lowest error is chosen and the weights are updated
2. While ρ ≤ ρ 0
M.fit (Data, ω , ρ )
accordingly; the final classifier is formed as a linear combination
T
ρ ← f loor ( ρ2 ) of n classifiers. A boost classifier is in the form, FT (x ) = t=1 ft (x )
ω ← f loor ( ω2 ) where each ft is a weak learner with x as input. Each weak learner
3. While (i ≤ end _iter) and Trainloss(i) ≤ Trainloss(i − 1) produces an output hypothesis, h(xi ), for each sample in the train-
M.fit (Data, i, ρ 0 )
ing set. At each iteration t, a weak learner is selected and assigned
6 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

Fig. 3. Correlation table for the independent predictor variables. In this table, moderately strong correlations among few pairs are observed (Glucose and Glycohemoglobin,
Red blood cells and Hemoglobin, ALT and AST, Weight and Body-Mass-Index). Rest of the pairs show fairly low correlation values, implying the variables after the LASSO-
Majority voting stage are sufficiently decorrelated.

a coefficient α t such that the sum training error Et of the resulting class data by either replicating or synthesizing new data. One per-
t-stage boost classifier is minimized. A multilayer perceptron (MLP) tinent issue with regard to synthesized data is that, unlike im-
is a feedforward artificial neural network (ANN) which consists of ages, the data in the context of biological factors (variables) may
an input layer, an output layer and one or more hidden layers and be implausible as it is difficult to verify the authenticity of newly
utilizes backpropagation for training the data. The MLP commonly augmented data, especially when both classes of data are closely
uses a nonlinear activation function which maps the weighted in- spaced. In this paper, we provide comparative results by using the
puts to the output of each neurons in the hidden layers. In an above algorithms. In addition, we provide the corresponding visu-
MLP, the connection weights are changed based on the error be- alization of the augmented data via t-SNE. Please keep in mind
tween the generated output and expected result. Two of the most that t-SNE is a non-convex algorithm, generating embedding that
common activation functions are the rectified linear unit (ReLU), depends on the initialization low-dimensional embedding. We em-
f(x) = x+ and the hyperbolic tangent, y(xi ) = tanh(xi ). ploy random undersampling strategy to select a subset of data for
Data augmentation demands attention in the context of data training the CNN. Similar to data augmentation, there are several
imbalance. Algorithms, such as random oversampling (ROS), syn- data undersampling strategies. We compare our results with edited
thetic minority over-sampling technique (SMOTE) (Chawla, Bowyer, nearest neighbor (EDN) (Wilson, 1972), instance hardness thresh-
Hall, and Kegelmeyer, 2002), and adaptive synthetic sampling old (IHT) (Smith, Martinez, and Giraud-Carrier, 2014) and three
(ADASYN) (He, Bai, Garcia, and Li, 2008) augment the minority versions of near-miss (NM-v1, v2 and v3) (Mani, 2003) algorithms.
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 7

4. Results replacement. For training purposes, the neural network is trained


by varying the misclassification penalty from 10:1 (CHD: Non-CHD)
4.1. Summary statistics to 3:1 as shown in Table 2. The maximum accuracy of 83.51% and
minimum training loss of 0.489 is obtained while training with a
For the purpose of variable selection in our classification model, sampling ratio of 130 0:40 0 0 (CHD: Non-CHD) and misclassification
we start out by investigating correlations among 30 continuous penalty of 3:1(CHD:
predictor variables. Correlation is found to be high (0.77) between Non-CHD). The trained network is tested on a cohort of 31,779
serum alanine aminotransferase (ALT) and aspartate aminotrans- (85.70% of the whole dataset) remaining samples and a test accu-
ferase (AST) as given from Fig 3. However, AST is a major risk racy of 82.32% is obtained subsequently as reported in Table 2. The
factor in the prediction of CHD, as has been reported in the lit- optimal sampling ratio (130 0:40 0 0) of class 1 to class 0 is main-
erature (Shen et al., 2015). Jianying (2015) finds AST levels to be tained for the final training of the network as illustrated in Table
significantly higher in CHD patients than in the control group and 3.
hence can be used as biochemical markers to predict the sever- Table 3 reports the final training of the CNN architecture with
ity of CHD. A high correlation of 0.89 was determined between varying class (CHD: Non-CHD) weights to check for consistency
body-mass-index and weight which seemed normal whereas cor- of our results. With decrease in class weights, the training accu-
relation between hemoglobin and red blood cells was 0.74. While racy increases from 59.43% to 83.17% when the difference between
the association of hemoglobin with clinically recognized CHD is the training and test accuracies becomes a minimum, indicating a
limited in research (Chonchol, 2008), the role of red blood cells reasonable generalization error as shown in Fig. 5. During train-
in CHD is well researched in the literature (Madjid, 2013). It has ing the number of epochs, the number of neurons in each dense
been investigated that high blood glucose levels for non-diabetic layer except the last one, and the number of filters in each convo-
patients can significantly increase the risk for development of CHD lution layer are varied to obtain the best fit of the model. We have
(Neilson, Lange, and Hadjokas, 2006) which is depicted in Fig. 3 by fine-tuned several hyperparameters that are associated with our
a high correlation value of 0.79 between glycohemoglobin and glu- model to obtain consistent class-wise accuracy. The optimization
cose. Further, we find a correlation of 0.46 between protein and is performed using Adam as the activation function with a learn-
albumin. Lower levels of serum albumin have been reported to ing rate 0.006 and 60 epochs, and no scheduling of learning rate is
be linked with increased levels of cardiovascular mortality as well used during the training. The best test accuracy obtained is 82.32%
as CHD (Shaper, Wannamethee, and Whincup, 2004) while higher where the penalty of misclassification of class 1 is set three times
level of protein is reported to increase risks of CHD (Clifton, 2011). higher than that of class 0.
Serum Lactate dehydrogenase (LDH) is found to be correlated with The performance of the proposed CNN classifier can be evalu-
AST (correlation coefficient of 0.41), consistent with previous stud- ated from the confusion matrix in Table 4. We specify the classifi-
ies which suggest that increased value of LDH in active popula- cation parameters as follows:
tion is associated with low risk of CHD (Kopel, Kivity, Morag-Koren,
– TP: true positive classification cases (true predictions for class
Segev, and Sidi, 2012). Due to the importance of the risk factors (as
1, i.e., true CHD predictions),
reported in existing literature) of some of the correlated variables
– TN: true negative classification cases (true predictions for class
and their association with CHD, LASSO regression was performed
0, i.e., true non- CHD predictions),
to correctly determine the predictor variables for further analysis.
– FN: false negative classification cases (false predictions for class
1, i.e., false CHD predictions),
4.2. Model results
– FP: false positive classification cases (false predictions for class
0, i.e., false non- CHD predictions).
To identify variables that contribute to the variation in data, we
apply LASSO to 100 instances of randomly sampled datasets, with Some commonly applied performance rates calculated are the
each containing 1300 examples of class CHD (negative class) and true positive rate (TPR), the accuracy of predicting CHD (class 1)
1300 of class no-CHD (positive class). We set α in Eq. (2) as 6 and the true negative rate (TNR), the accuracy of predicting non-
and find that ALT, glucose, hemoglobin, body mass index fails to CHD (class 0). The detailed values of TPR, TNR, train accuracy, test
contribute significantly in the data irrespective of strong experi- accuracy and training loss for all class weights are given in Table
mental evidence in state-of-the-arts that favor those factors. From 3.
majority voting, some of the strong correlates are age, white blood In the present study, it is our objective for our classifier to pre-
cells, platelet count, red cell distribution width, cholesterol, LDH, dict the presence of CHD with higher (improved) accuracy than in
uric acid, triglycerides, HDL, glycohemoglobin, gender, presence of previous studies. The recall rate (sensitivity) for correctly predict-
diabetes, blood related stroke, moderate and vigorous work. ing the true positive rate for class 1 (presence of CHD) is 77% while
To achieve the optimal number of features for training our CNN the class 0 (absence of CHD) is 81%. The CNN classifier has been
architecture, the threshold of the feature voting was kept in the tested on 31,779 subjects while maintaining almost the same TPR
range of 2 to 8. The highest accuracy obtained from training the and TNR, contrary to previous reported studies which have consid-
network is 83.17% with a training loss of 0.489 with a threshold ered significantly smaller samples.
feature of 6 as shown in Fig. 4. The corresponding highest test ac- Partitioning a highly imbalanced dataset poses a lot of chal-
curacy obtained is 82.32%. LASSO regularization reduces the coef- lenges and incurs unavoidable biases on a classifier’s performance.
ficients of three of the continuous predictor variables (Body-mass- While, it is advised to keep same ratio of class-specific data in
index, glucose and ALT) and one categorical variable to zero which training and testing in state of the arts, however, in a highly imbal-
is determined to be highly correlated as seen in Section 4.1. With a anced dataset it is often very challenging to get a high class 1 (true
threshold value of 6, the CNN architecture is trained separately on CHD prediction rate) accuracy as the testing data size increases.
different sets of subsampled data sets. Subsampling is performed The present method confirms that our proposed CNN architecture
in varying ratios starting with the range 1300:13,000 (CHD: Non- has the classification power of 77% to correctly classify the pres-
CHD) and increased to 130 0:40 0 0 as shown in Table 2. The corre- ence of CHD cases on a testing data, which is 85.70% of the total
sponding test accuracy is reported as 82.32%. dataset. This result signifies that the proposed architecture can be
In each subsampled set, all CHD subjects are taken into consid- generalized to other studies in healthcare with a similar order of
eration and the Non-CHD subjects are randomly chosen without features and imbalances.
8 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

Fig. 4. Model accuracy as a function of majority voting threshold. The threshold value of majority voting affects the classification accuracy of CHD as the selection of this
value controls the number of variables that are to be channeled to our CNN model. The smaller is the threshold value, the larger is the set of variables. Based on the training
loss, training accuracy and test accuracy, the threshold value between 16.67 (100/6) – 20 (100/5) combining 100 instances of LASSO appears suitable for obtaining balanced
per class (CHD and Non-CHD) classification accuracy.

Table 2.
Training schedule for increasing class weight ratio and sampling for optimal threshold. A maximum
training accuracy if 83.51% and minimum training loss of 0.489 is achieved with a misclassification
penalty of 3:1 (CHD: Non-CHD) and a sampling ratio of 130 0:40 0 0 (CHD: Non-CHD).

Class Weight Sampling TPR TNR Train Acc (%) Test Acc (%) Training Loss

10:1 1300:13,000 0.690 0.812 78.00 81.09 0.684


8:1 1300:10,000 0.706 0.827 77.28 82.67 0.676
6:1 1300:8000 0.741 0.799 80.00 79.90 0.595
5:1 1300:6000 0.735 0.80 80.48 79.96 0.548
3:1 1300:4000 0.778 0.823 83.51 82.32 0.489

The performance of our binary classifier is calculated by com- ones, independently from the class distribution. It does so by plot-
puting the ROC curve (Yang, Zhang, Lu, Zhang, and Kalui, 2017). ting parametrically the true positive rate (TPR) vs the false pos-
The area under the curve (AUC) value in the ROC curve is the prob- itive rate (FPR) at various threshold settings as shown in Fig. 4.
ability that our proposed CNN classifier ranks a randomly chosen (Right). The calculated AUC is 0.767 or 76.7% which is compara-
positive case (CHD) higher than a randomly chosen negative case ble to previous studies related to CHD (Martinez, Schwarcz, Valdez,
(Non-CHD) (Tom 2005). Thus, the ROC curve behaves as a tool to and Diaz, 2018). In highly imbalanced data sets balanced accuracy
select the possible optimal models and to reject the suboptimal is often considered to be a more accurate metric than normal ac-

Table 4.
Confusion matrix for the CNN classifier for coronary heart disease. Out of 208 coronary heart dis-
ease cases in the sample cohort, 161 cases were predicted correctly by the classifier. The proposed
classifier also correctly predicts 25,828 cases where patient did not report coronary heart disease.

Total Cohort True Condition

Presence of CHD Absence of CHD

Predicted Presence of CHD True Positive (TP) = 161 False Negative (FN) = 47
Condition Absence of CHD False Positive (FP) = 5743 True Negative (TN) = 25,828
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 9

Fig. 5. Training and test accuracies with varying misclassification penalties for class 1 and 0. The minimum difference between training and test accuracies is obtained with
a class weight of 3:1 (CHD: Non-CHD) and a training loss = 0.489. The model is trained with a constant optimized learning rate of 0.006 and 60 epochs.

Table 3. Table 5.
Training schedule for increasing class weight ratio and optimal sampling ratio at- Comparison of machine learning models for coronary heart disease prediction.
tained from table I. The difference between the training (83.17%) and test (82.32%) As compared to traditional machine learning models, our proposed model at-
accuracies attain the minimum when the misclassification penalty of class 1 is set tains a recall value of 0.77 which is comparable to the SVM classifier. However,
three times higher than class 0. specificity (0.81) and test accuracy (0.82) of our model are significantly higher
than the SVM classifier.
Class Weight TPR TNR Train Acc (%) Test Acc (%) Training Loss
Recall (%) Specificity (%) Test Acc (%) AUC
50:1 0.980 0.383 59.43 44.48 1.591
25:1 0.923 0.568 67.06 58.04 1.236 Logistic Regression 51.44 91.15 90.89 71.29
12:1 0.860 0.664 71.98 66.60 0.965 SVM 77.40 77.87 77.87 77.64
8:1 0.836 0.686 75.08 69.76 0.765 Random Forest 76.44 76.06 76.06 76.25
6:1 0.817 0.740 80.13 74.12 0.620 AdaBoost 52.88 90.36 90.12 71.63
4:1 0.788 0.770 81.00 77.10 0.550 MLP 66.34 78.88 78.80 72.61
3:1 0.773 0.818 83.17 82.32 0.489 Our model 77.3 81.8 81.78 76.78

curacy itself. The balanced accuracy of the model is determined to racies and AUC values are also determined. Logistic regression and
be (TPR + TNR)/2 = 0.795 or 79.5%. The fall out rate or the Type-I adaboost classification result in highest test accuracies but these
error of the model is 5743/31,571 = 18.2% and the miss rate or the classifiers suffer from low recall values which is the true positive
Type-II error of the model is 47/210 = 22.6%. The positive likeli- rate for coronary heart disease detection. While the recall values
hood ratio of the predicted model is 4.27 indicating that there is obtained from SVM and random forest are comparable to that of
almost a 30% increase in probability post diagnosis in prediction our proposed CNN model, our model predicts the negative (Non-
of the presence of CHD in patients. A negative likelihood ratio of CHD) cases with higher accuracy as shown in Table 5. The balanced
0.27 was calculated which signifies that there is approximately 30% accuracy of our model (79.5%) is also higher than individual accu-
decrease in probability post diagnosis in prediction of absence of racies from that of SVM or random forest classifiers. An optimized
CHD in patients. two-layer multilayer perceptron resulted in a low recall value of
66.34% when tested on our test cohort. Results in Table 5 show
4.3. Comparison of ML models that SVM and random forest classifiers perform better than logistic,
adaboost and MLP classifiers, but the specificity and test accuracy
4.3.1. Comparison with state-of-the-art ML models are significantly lower as compared to our designed CNN classi-
Machine learning models discussed in Section 3.3 are imple- fier. The CNN classifier results in high specificity and test accuracy
mented and tested on our test cohort. The prediction results from along with high values of recall and AUC.
these methods are then compared with the results of our proposed These results confirm that the CNN classifier outperforms all
CNN architecture. All models are implemented with optimized pa- existing commonly used machine learning models for coronary
rameters and then compared based on the true positive rate (re- heart disease prediction in terms of accuracy in prediction for both
call) and true negative rate (sensitivity). Corresponding test accu- CHD and Non-CHD classes.
10 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

4.3.2. Our LASSO-CNN vs vanilla CNN Table 6.


Comparison of machine learning models for stroke prediction.
Our experimental set up is generalized in the sense that set-
ting αN
as 0 (or, equivalently α = ∞) will select all the variables CHD acc (%) No-CHD acc (%) Test Acc (%)
into consideration. Thus, applying vanilla CNN is equivalent to ap- Logistic Regression 74 76.8 76.79
plying the model LASSO (∞) - CNN. Using the same subsampled SVM 75 74.75 74.74
dataset with 3:1 ratio of class samples, vanilla CNN yields 79.42% Random Forest 74 74.45 74.44
test accuracy, which is approximately 2% less than the average test AdaBoost 40 90.1 90
Vanilla-CNN 71 77.04 77.01
accuracy that we obtain on an average by applying LASSO (6)-CNN.
Our model 74 79.86 79.85
Although, it seems a marginal improvement, the number of sam-
ples in excess that are accurately labeled by our model is 635 (≈
31,779×0.02). CHD accuracy and 87% CHD accuracy, with the overall test ac-
curacy of 74.59%. Fig. 8 depicts the class wise accuracies by the
4.3.3. Data oversampling strategies undersampling methods on the test data. Although the test data
We compare our LASSO–CNN with the state-of-the-art over- size majorly differs in over- and undersampled data, note that the
sampling algorithms. Note that the size of the testing dataset is no-CHD accuracy significantly drops in case of undersampling ex-
remarkably small in oversampling cases, where the minority class cept in random undersampling (Table 4). No-CHD is the majority
is oversampled to become cardinally same with the majority class. class and the reason behind such accuracy drop might be two-fold.
In case of ROS and SMOTE, each class of the training data contains Firstly, the reduced data size insufficiently capture the true popu-
32,013 samples and the test data has 3709 samples (CHD: 3558 lation variation, thereby failing to generalize on the test data. This
samples, no-CHD: 151 samples). In case of ADASYN, the training is true for all the undersampling strategies. The second reason de-
data size is 64,134 (CHD: 32,121 samples, no-CHD: 32,013 samples) pends on the constraint of the algorithm. For example, to obtain
and the test data size is identical to ROS. 1:1 class data for training, near-miss (v3) first selects m neigh-
In all the cases, the ratio of class-specific samples (no- bors for every minority class sample, collectively formed the set
CHD:CHD) in the test data is approximately 24:1. From Fig. 6, it S. Later, among the majority-class samples in S, the algorithm se-
can be seen that, in all the three algorithms, the test accuracy of lects a subset, where each majority-class sample has the largest
the no-CHD class strongly follows the overall test accuracy, indi- k nearest neighbor. The “largest” distance ensures for the maxi-
cating the predominant effect of data imbalance in the test data. mal separation (local margin) of classes, foreshadowing a potential
Except in ADASYN, it is observed that the test accuracy increases loophole for the classification of majority samples which fall in be-
with our model being trained using more epochs. The increase in tween. Random subsampling does not have such constraints. A bet-
the test accuracy is favored by the increase in the no-CHD accuracy ter strategy would be, at first, the selection of data ratio, which we
at the expense of compromising CHD accuracy. found as 3:1 (no-CHD: CHD). This value depends on the dataset.
Overall, the behavior of LASSO+CNN using SMOTE is erratic, Later, an ensemble of LASSO+CNN classifiers can be built for a set
whereas trends in train accuracy, test accuracy, train loss can be of randomly subsampled datasets. Testing of an unknown sample
clearly observed in cases of ROS and ADASYN. Although, the test can be performed using majority voting.
accuracy is higher in case of SMOTE when compared with ROS
and ADASYN, the class-specific accuracies are balanced in case of 4.4. Validation on stroke data
ADASYN, yielding 79%, 79.14% and 75.5% for test accuracy, no-CHD
accuracy and CHD accuracy respectively, with the class score vari- It might appear that sequentially arranged, multiple convolu-
ance of 3.06. In terms of class score variance, ROS trails behind tional layers in our proposed model offer resistance to data imbal-
ADASYN, scoring 78.67%, 80% and 74% with 9 as the class score ance only for the CHD data provided by NHANES. To check whether
variance. Class score variance measures how close the CHD and our model is resilient to other imbalanced datasets containing 1D
no-CHD accuracies are. measurement variables, we apply our network on a similar dataset
on Stroke, which is also compiled by NHANES. The Stroke dataset
4.3.4. Data undersampling strategies contains 37,177 subjects and 36 mixed-type measurement vari-
We train and test our LASSO+CNN model on several datasets ables. Out of 37,177 subjects, there are 1269 subjects who reported
that are undersampled by NM (v1, v2 and v3), IHT and EDN, as that they had strokes. After applying LASSO, we found 34 vari-
discussed in Section 3.3. In case of undersampled datasets, the size ables which are important for further processing. All the models
of the test dataset is large as the size of the majority class (no- are trained with data from 1169 patients having CHD and 4300
CHD) is reduced against the minority class (CHD). After applying without CHD. Except our model, we do not apply LASSO prior to
an undersampling technique, the samples that are not considered training the models. While random forest and SVM yield smaller
in the train dataset is added to the test dataset. In case of EDN, the differences in accuracies between CHD and No-CHD, they fall short
test data contains 6376 samples (CHD: 152 samples, no-CHD: 6224 in overall test accuracy (74.44% and 74.74%, respectively) (see Table
samples), whereas in case of IHT, the test data has 24,995 sam- 6). Our model correctly labels almost 80% of the test cases.
ples (CHD: 152 samples, no-CHD: 24,843). Three versions of near-
miss (NM) are employed. Each of version1and 2 contains 31,200 4.5. Notes on the resilience to data imbalance
data samples (CHD: 152 samples, no-CHD: 31,048 samples), where
version 3 has 32,557 samples (CHD: 152 samples, no-CHD: 32,405 As stated in the introduction, we are primarily interested in
samples). In all of NM versions, there are 4523 no-CHD and 1357 problems that follow certain constraints: (1) The data is severely
CHD training data, maintaining approximately 3.5:1 data ratio to imbalanced due to the nature of its origin or the restriction on its
enforce similar experimental set up where we report the best test acquisition; (2) Data augmentation techniques, especially that in-
accuracy of 82.32% by our model in Table 2. volve under- and over- sampling algorithms and statistical distri-
Fig. 7 provides t-SNE visualization of undersampled data using butions are infeasible; (3) There are significant risks of misclassifi-
the algorithms mentioned above. cation. For simplicity, we consider binary classification of data con-
It is evident that random subsampling and near-miss (v3) en- taining mixed-type measurements as variables and investigate the
compass the span of the data uniformly and better than other al- surprising resilience to data imbalance, which is offered by convo-
gorithms that we consider. In fact, near-miss (v3) yields 74% no- lution layers. For the assessment of 2D and multidimensional data,
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 11

Fig. 6. Results using three oversampling techniques for the data augmentation of the minority class. For each technique, the results provide training accuracy, test accuracy,
training loss, CHD accuracy (class-specific) and no-CHD accuracy (class-specific) over a number of epochs, added with t-SNE low-dimensional embedding for data visualiza-
tion in 3D. (a) t-SNE visualization of 90% of the original data used for training. 10% of the data is reserved for testing. (b) The results using random oversampling (ROS). Note
that we did not provide the t-SNE visualization for ROS as, in ROS, data samples from the minority class are randomly picked and added to the data, thereby maintaining
the same data with redundant samples. So, the visualization is same as the original data in (a). (c),(d) Results using SMOTE with visualization. (e),(f) Results using ADASYN
with visualization.
12 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

Fig. 7. t-SNE visualization of the five undersampling techniques for the data reduction of the majority class. (a), (e) and (f) Near-miss using k-nearest neighbor (version-1,2
and 3). (b) Random subsampling with 3:1 as no-CHD: CHD data samples (one instance). (c) Edited nearest neighbor (EDN). (d) Instance hardness threshold (IHT).
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 13

Fig. 8. Results using the undersampling algorithms from Fig. 6(a)–(e). (a) CHD detection accuracies over epochs using the algorithms. (b) Detection accuracies of no-CHD test
data. Among all the competitive undersampling strategies that we compare our results with, near-miss (version 3) works better in improving both class-specific accuracies.

we need more refined approaches to address the resilience to data networks. Therefore, we pay our attention to shallow networks in
imbalance. It is because in 2D cases, for example image data, there this context. The results of various sequential convolutional net-
exists spatial correlation among pixels that need to be taken into works are enumerated in Table 7.
account, whereas we are considering mixed-type 1D variables that Table 7 suggests that, if properly trained, MLP indeed shows im-
may or may not be correlated at all. provement in accuracy of the majority class, which unfortunately
The small number of samples in the minority class and the in- affects the accuracy of the minority class. The difference between
feasibility of data augmentation prohibit us from designing deep classwise accuracies is 27.14% for MLP-I. MLP-II with an extra deep
14 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

Table 7.
Experiments with different shallow layers on CHD dataset. I = input, O = output. C2 = a convolution layer with 2 filters, C4 = a
convolution layer with four filters, C8 = a convolution layer with 8 filters. 64, 128, 512 = dense layers.

No-CHD acc (%) CHD acc (%) Overall acc (%) Acc difference (%) No of parameters

MLP - I (I–512–128-O) 84.74 57.6 84.63 27.14 88,706


MLP - II (I–64–128–256-O) 81.63 75.9 81.60 5.73 45,442
Conv-1 (I-128-C4-C8-O) 78 75.96 77.97 2.04 6306
Conv-II (I-128-C2-C4-C8-O) 79.4 77 79.38 2.4 3104
Conv-III (I-128-C4-O) 77.42 75.35 77.4 .07 6426
Conv-IV (I-64-C2-128-C4-O) 81.53 76.9 81.43 4.63 11,678
Our model (I-64-C2-C4-512-O) 83.17 77.88 82.3 5.29 32,066

layer compared to MLP-I seems to have a decline in the accuracy of randomly subsampled datasets, LASSO is performed repeatedly
difference (5.73%). However, this is achieved after a careful training to check the consistency of the variable contribution, which is a
and there is a significant chance of overfitting as the number of crucial step in our algorithm to control the true-negatives of vari-
trainable parameters of MLP-II is 45,442 and this is approximately able selection. Finally, a majority voting algorithm is applied to ex-
9 times of the amount of input data. tract the significant variables of interest, a step that achieves di-
It can be noticed that the convolutional layers provide surpris- mensionality reduction by excising unimportant variables. We do
ing resilience to class imbalance in terms of the difference between not follow the conventional dimension reduction techniques, such
classwise accuracies. Conv-I, II and III yield 2.04%, 2.4%, and 0.07% as Local Linear Embedding (LLE) and Principal Component Analy-
accuracy differences. However, this comes at a cost of achieving sis (PCA) because these methods generally provide dimensions that
lower overall accuracy scores. By restricting ourselves to sequential are linear or nonlinear combination of the data variables, leading
design for simplicity, we start investigating two possible architec- to a lack of interpretability of the derived dimensions. For example,
tures Conv-IV and the last one in Table 7. After rigorous training, a linear combination of BMI and alkaline phosphatase (ALP) is dif-
it is observed that sequential placements of C2-C4 outputs better ficult to interpret. Rather, we explicitly use t-SNE for the visualiza-
accuracy. tion of under and over-sampled data generated by applying state-
Note that the total number of parameters of our architecture of-the art algorithms. As we utilize LASSO, a potential research av-
is 32,066, which is significantly higher than Conv-I, II and III, but enue would be to test if LASSO reflects the true importance of a
moderately lower than MLP-I and II. Such large number of parame- variable through its shrinkage, and if not, this would call for the
ters is due to the presence of the dense layers (64 and 512). While construction of an appropriate optimization function.
C2-C4 attempts to minimize the difference between classwise ac- Once we obtain the significant predictor variables with LASSO
curacies, dense layers try to improve the overall test accuracy. and feature voting, we feed them to our 1-D feedforward convolu-
Another point worth to mention is the stability of accuracy tional neural network. We substantiate that shallow convolutional
achieved by individual model. While training MLP-I and II, it is ob- layers provide adaptability in data imbalance in terms of our re-
served that the training and test accuracies have the tendency to sults in Table 5, where, in contrast to Logistic Regression and Ad-
monotonically increase over epochs when we gradually decrease aboost, our model provides a balanced class wise classification ac-
the weight ratio to 13:40 (see Table 3 for weight ratio) according curacy. In a cohort of 37,079 individuals with high imbalance be-
to the previously-mentioned training schedule. This is degenerative tween presence of CHD and Non-CHD, we show that it is possible
because after each epoch, the accuracies (train and test) tend to be to predict the CHD cases with 77.3% and Non-CHD cases with 81.8%
higher than the ones at previous epoch (destabilization) while the accuracy, indicating that the prediction of the CHD class, which is
accuracy of the minority class starts plummeting. This degenera- deficit in the number of reported patient samples, does not suf-
tion is strikingly diminished after the inclusion of multiple con- fer significantly from the data imbalance. The preprocessing stage,
volution layers. In short, the accuracies that our proposed model consisting of repeated LASSO and majority voting, is pivotal in fil-
yields so far are stable for a fairly large number of epochs. tering out highly correlated variables, setting flags only for the un-
correlated ones to be fed to our CNN model. This is clearly ob-
5. Conclusion, limitations and future research served from Fig. 3. Each LASSO stage maintains 1:1 ratio of CHD:
Non-CHD data to avoid the adverse effect of data imbalance on the
In this paper, we propose a multi-stage model to predict CHD final shrinkage parameters γ . However, the 1:1 ratio does not en-
using a highly imbalanced clinical data containing both qualitative capsulate enough variation in Non-CHD classification as this class
and quantitative attributes. For such clinical data, imbalance is an contains large number of reported cases in our dataset. We re-
imminent challenge that exists due to the limited availability of peat LASSO with randomly sampled subsets and eventually apply
data. Such data imbalance adversely affects the performance of any majority voting to assess the importance of a variable. Once the
state-of-the-art clinical classification model. As a remedy to the dominant predictor variables are identified, we discard their LASSO
imbalance problem, one cannot efficiently apply conventional tech- values in this paper. In future work, instead of disregarding the
niques, such as data augmentation strategy due to biological im- LASSO-majority voting generated weight values, one can integrate
plausibility of replication of several attributes in the clinical data. them to the subsequent CNN model as priors, and test if that leads
By way of extensive experimentation and validation, we establish to enhancement in the performance of the model.
that a special-purpose, shallow convolutional neural network ex- A potential problem might arise while using LASSO due to the
hibits a considerable degree of resilience towards data imbalance, linear nature of the estimator. LASSO is a penalized regression
thereby producing classification accuracy superior to the existing technique, where sparsity of variables is enforced by a non-convex
machine learning models (Table 5). Our model is simple in con- penalty. Competitive methods, such as cross-correlation based vari-
cept, modular in design, and offers moderate resilience to data im- able selection, also supports the linear map, however, with a lit-
balance. tle difference. LASSO exploits partial correlation between the fac-
The proposed model initiates with the application of LASSO re- tors, which caters to the relevant prediction of output responses,
gression in order to identify the contribution of significant vari- whereas cross-correlation computes, in a sense, the marginal cor-
ables or attributes in the data variation. Using multiple instances
A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408 15

relation between each pair of factors, which might not be mono- last two dense layers can be retrained for new data. Thus, a sig-
tonic and linear all the time. Nonetheless, the assumption of linear nificant future research direction would be to implement CNN for
relationship between the input factors and output labels may have predictions from similar clinical datasets where such imbalanced
some consequential limitations and the number of factors might number of positive and negative classifications exist.
be significantly greater than what is present in the current data. A
further refined approach, in this case, would be a two-step, non- Credit author statement
linear reduction of dimension, where, we can use techniques, such
as sure independence screening (SIS), conditional SIS or graphical Aniruddha Dutta: Conceptualization, Data curation, Investi-
LASSO to approximate the partial correlation/covariance among the gation, Formal analysis, Writing original draft; Tamal Batabyal:
factors. A suitable threshold would give a reduced dimension to Methodology, software, validation, Writing- original draft; Meheli
apply LASSO afterwards for further reduction in the dimension. Basu: Data curation, Investigation, Writing-Review & Editing; Scott
A possible future direction of this work is to consider nutrition Acton: Writing Review & Editing, Supervision. Aniruddha Dutta and
and dietary data recorded by NHANES as additional predictor vari- Tamal Batabyal are equal contributors.
ables for CHD prediction. Dietary factors play an important role
Declaration of Competing Interest
in CHD occurrence (Bhupathiraju, 2011; Masironi, 1970) and the
prediction accuracy of CHD by including additional dietary vari-
ables could be explored. For example, until very recently, several The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
prospective studies concluded that total dietary fat was not sig-
influence the work reported in this paper. The authors declare the
nificantly associated with CHD mortality (Howard, Van Horn, and
Hsia, 2006; Skeaff, 2009). However, according to American Heart following financial interests/personal relationships which may be
considered as potential competing interests
Association (AHA), it is the quality of fat which determines CHD
risk (Lichtenstein, Appel, and Brands, 2006, USDA 2010). Individual
Acknowledgments
experiments performed with NHANES dietary data have discussed
the association of cholesterol, LDL, HDL, amino acids and dietary
The authors acknowledge the high-performance computational
supplements with CHD (references). However, individual consump-
support from The Center for Advanced Computing (CAC) at Queen’s
tion of nutrients takes place collectively in the form of meals con-
University, Canada and the Center for Research Computing at Uni-
sisting of combination of nutrients (Hu, 2002; Sacks, Obarzanek,
versity of Pittsburgh, USA. This research is not funded by any ex-
and Windhauser, 1995). This may lead to multi-collinearity among
ternal research grant.
factors and thus a more complex dietary pattern analysis, con-
trolling for multicollinearity of CHD associated significant nutri- References
ents could lead to a more comprehensive approach to CHD preven-
tion. Additionally, some of the clinical predictor variables included Ahmed, M. A., Yasmeen, A. A., Awadalla, H., Elmadhoun, W. M., Noor, S. K., & Al-
mobarak, A. O. (2017). Prevalence and trends of obesity among adult Sudanese
in the classification model of CHD prediction may themselves be individuals: Population based study. Diabetes & Metabolic Syndrome: Clinical Re-
impacted by certain dietary habits of patients. Thus, inclusion of search & Reviews, 11(2), 963–967. doi:10.1016/j.dsx.2017.07.023.
dietary data of patients along with clinical predictor variables, in Benjamin, E. J., Muntner, P., Alonso, A., Bittencourt, M. S., Callaway, C. W., Car-
son, A. P., et al. (2019). Heart disease and stroke statistics—2019 update: A re-
prediction of CHD, can also lead to potential endogeneity issues. port from the american heart association. American Heart Association, 139, 56–
However, with appropriate treatment of endogeneity, dietary data 528. doi:10.1161/CIR.0 0 0 0 0 0 0 0 0 0 0 0 0659.
inclusion is expected to provide further insights and improved ac- Bhupathiraju, S. N., & Tucker, K. L. (2011). Coronary heart disease prevention: Nu-
trients, foods, and dietary patterns. Clinica Chimica Acta, 412(17-18), 1493–1514.
curacy of CHD diagnosis.
doi:10.1016/j.cca.2011.04.038.
Finally, the preferred selection between data augmentation and Burchfiel, C. M., Tracy, R. E., Chyou, P., & Strong, J. P. (1997). Cardiovascular risk fac-
data subsampling is much debated and demands attention in this tors and hyalinization of renal arterioles at autopsy. Arteriosclerosis, Thrombosis,
and Vascular Biology, 17(4), 760–768. doi:10.1161/01.ATV.17.4.760.
section. Our argument in favor of subsampling is as follows: as
Burke, A. P., Farb, A., Malcom, G. T., Liang, Y. H., Smialek, J., & Virmani, R. (1997).
observed from the t-SNE figures in the result section, the CHD Coronary risk factors and plaque morphology in men with coronary disease
and no-CHD classes are densely interspersed. Moreover, the class- who died suddenly. The New England Journal of Medicine, 336(18), 1276–1282.
specific clusters are highly non-convex and extremely hard to sep- doi:10.1056/NEJM199705013361802.
Celermajer, D. S., Sorensen, K. E., Georgakopoulos, D., Bull, C., Thomas, O., Robin-
arate using naïve nonlinear classifiers. Synthetic data samples us- son, J., & Deanfield, J. E. (1993). Cigarette smoking is associated with dose-
ing strategies, such as a random sampling on the line connecting related and potentially reversible impairment of endothelium-dependent dila-
an arbitrary pair of data samples (used in SMOTE, ADASYN) might tion in healthy young adults. Circulation, 88(5), 2149–2155. doi:10.1161/01.CIR.
88.5.2149.
receive the wrong label. It is because of the fact that the newly Center for Nutrition Policy and Promotion. (2010). Dietary guidelines for Americans.
sampled data sample has the likelihood to be labeled as “0 (for US Department of Agriculture.
training) if the pair of data samples belongs to class “0 . How- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Syn-
thetic minority over-sampling technique. Journal of Artificial Intelligence Research,
ever, the data sample may be biologically implausible or, in case 16, 321–357. doi:10.1613/jair.953.
of potential plausibility, may actually be a sample from class “1 Chobanian, A. V., Bakris, G. L., Black, H. R., Cushman, W. C., Green, L. A., IzzoJr, J. L.,
as both the classes are densely mixed. Especially, when the data is Jones, D. W., Materson, B. J., Oparil, S., WrightJr, J. T., & Roccella, E. J.National
High Blood Pressure Education Program Coordinating Committee. (2003). Sev-
significantly imbalanced, such as in the case of our data, the num-
enth report of the joint national committee on prevention, detection, evalu-
ber of synthesized data samples of the minority class is large. A ation, and treatment of high blood pressure. Hypertension, 42(6), 1206–1252.
countable fraction of such newly synthesized, incorrectly labeled doi:10.1161/01.HYP.0 0 0 0107251.49515.c2.
Chonchol, M., & Nielson, C. (2008). Hemoglobin levels and coronary artery disease.
data imposes a large bias on the trained network and increases the
American Heart Journal, 155(3), 494–498. doi:10.1016/j.ahj.2007.10.031.
probability of misclassification. Therefore, we prefer to adopt the Clifton, P. M. (2011). Protein and coronary heart disease: The role of different
sub-sampling strategy, where the authenticity of data is preserved, protein sources. Current Atherosclerosis Reports, 13(6), 493–498. doi:10.1007/
barring the measurement and acquisition noise. It is an interesting s11883-011-0208-x.
Das, R., Turkoglu, I., & Sengur, A. (2009). Effective diagnosis of heart disease through
avenue to explore if the extension of shallow CNN models, in terms neural networks ensembles. Expert Systems with Applications, 36(4), 7675–7680.
of architecture and data sub-sampling, to implementation of neural doi:10.1016/j.eswa.2008.09.013.
net-based learning on similar clinical datasets, improves the pre- Fava, A., Plastino, M., Cristiano, D., Spanò, A., Cristofaro, S., Opipari, C., Chillà, A.,
Casalinuovo, F., Colica, C., Bartolo, M. D., Pirritano, D., & Bosco, D. (2013). Insulin
diction accuracy of the classification process. As explained earlier, resistance possible risk factor for cognitive impairment in fibromialgic patients.
our model can also be used as a transfer learning model and the Metabolic Brain Disease, 28(4), 619–627. doi:10.1007/s11011-013-9421-3.
16 A. Dutta, T. Batabyal and M. Basu et al. / Expert Systems With Applications 159 (2020) 113408

Goldstein, B. A., Navar, A. M., & Carter, R. E. (2017). Moving beyond regression Masironi, R. (1970). Dietary factors and coronary heart disease. Bulletin of the World
techniques in cardiovascular risk prediction: Applying machine learning to ad- Health Organization, 42(1), 103–114.
dress analytic challenges. European Heart Journal, 38(23), 1805–1814. doi:10. Meigs, J. M., D’Agostino Sr, R. B., Wilson, P. W. F., Cupples, L. A., Nathan, D. A.,
1093/eurheartj/ehw302. & Singer, D. E. (1997). Risk variable clustering in the insulin resistance syn-
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning: 1. MIT drome: the Framingham offspring study. Diabetes, 46(10), 1594–1600. doi:10.
press Cambridge. 2337/diacare.46.10.1594.
Haq, A. U., Li, J. P., Memon, M. H., Nazir, S., & Sun, R. (2018). A hybrid intelligent Mordvintsev, A., Olah, C., & Tyka, M. (2015). Inceptionism: Going deeper into neural
system framework for the prediction of heart disease using machine learning networks. Google AI Blog.
algorithms. Mobile Information Systems 21 pages. doi:10.1155/2018/3860146. Nakanishi, R., Dey, D., Commandeur, F., Slomka, P., Betancur, J., Gransar, H., Dail-
Haskell, W. L., Alderman, E. L., Fair, J. M., Maron, D. J., Mackey, S. F., Superko, H. R., ing, C., Osawa, K., Berman, D., & Budoff, M. (2018). Machine learning in predict-
Williams, P. T., Johnstone, I. M., Champagne, M. A., & Krauss, R. M. (1994). Ef- ing coronary heart disease and cardiovascular disease events: Results from the
fects of intensive multiple risk factor reduction on coronary atherosclerosis and multi-ethnic study of atherosclerosis (MESA). Journal of the American College of
clinical cardiac events in men and women with coronary artery disease. The Cardiology, 71(11) Supplement. doi:10.1016/S0735-1097(18)32024-2.
Stanford Coronary Risk Intervention Project (SCRIP). Circulation, 89(3), 975–990. National Center for Health Statistics. (2015). https://wwwn.cdc.gov/nchs/nhanes/
doi:10.1161/01.CIR.89.3.975. ContinuousNhanes/Questionnaires.aspx?BeginYear=2015
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling Neilson, C., Lange, T., & Hadjokas, N. (2006). Blood glucose and coronary artery
approach for imbalanced learning. In Proceedings of 2008 IEEE International joint disease in nondiabetic patients. Diabetes Care, 29(5), 998–1001. doi:10.2337/
conference on neural networks (IEEE world congress on computational intelligence) dc05-1902.
(pp. 1322–1328). IEEE. doi:10.1109/IJCNN.2008.4633969. Olaniyi, E. O., Oyedotun, O. K., & Khashman, A. (2015). Heart diseases diagnosis us-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recogni- ing neural networks arbitration. International Journal of Intelligent Systems and
tion. In Proceedings of the IEEE conference on computer vision and pattern recog- Applications, 7(12), 75–82. doi:10.5815/ijisa.2015.12.08.
nition (pp. 770–778). Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning
Howard, B. V., Van Horn, L., Hsia, J., et al. (2006). Low-fat dietary pattern and risk with deep convolutional generative adversarial networks. arXiv preprint.
of cardiovascular disease: The women’s health initiative randomized controlled Roger, V. L. (2010). The heart failure epidemic. International Journal of Environmen-
dietary modification trial. JAMA, 295, 655–666. doi:10.1001/jama.295.6.655. tal Research and Public Health, 7(4), 1807–1830. doi:10.3390/ijerph7041807.
Hu, F. B. (2002). Dietary pattern analysis: A new direction in nutri- Sacks, F. M., Obarzanek, E., Windhauser, M. M., et al. (1995). Rationale and de-
tional epidemiology. Current Opinion on Lipidology, 13, 3–9. doi:10.1097/ sign of the dietary approaches to stop hypertension trial (DASH). A multicenter
0 0 041433-20 02020 0 0-0 0 0 02. controlled-feeding study of dietary patterns to lower blood pressure. Annals of
Iandola, F. N., .Han, S., Moskewicz, M. W., .Ashraf, K., Dally, W. J., .& Keutzer, K. Epidemiology, 5, 108–118. doi:10.1016/1047-2797(94)0 0 055-x.
(2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 Shaper, A. G., Wannamethee, S. G., & Whincup, P. H. (2004). Serum albumin and
mb model size. arXiv preprint. risk of stroke, coronary heart disease, and mortality: The role of cigarette smok-
Irie, F., Iso, H., Sairenchi, T., Fukasawa, N., Yamagishi, K., Ikehara, S., & ing. Journal of Clinical Epidemiology, 57(2), 195–202. doi:10.1016/j.jclinepi.2003.
Kanashiki, M. (2006). The relationships of proteinuria, serum creatinine, 07.001.
glomerular filtration rate with cardiovascular disease mortality in Japanese Shen, J., Zhang, J., Wen, J., Ming, Q., Zhang, J., & Xu, Y. (2015). Correlation of
general population. Kidney International, 69(7), 1264–1271. doi:10.1038/sj.ki. serum alanine aminotransferase and aspartate aminotransferase with coronary
50 0 0284. heart disease. International Journal of Clinical and Experimental Medicine, 8(3),
Jin, Z., Sun, Y., & Cheng, A. C. (2009). Predicting cardiovascular disease from real- 4399–4404.
time electrocardiographic monitoring: An adaptive machine learning approach Shen, Y., Yang, Y., Parish, S., Chen, Z., Clarke, R., & Clifton, D. A. (2016). Risk predic-
on a cell phone. In Proceedings of international conference of the IEEE engineer- tion for cardiovascular disease using ECG data in the China kadoorie biobank.
ing in medicine and biology society (pp. 6889–6892). doi:10.1109/IEMBS.2009. In Proceedings of 38th annual international conference of the IEEE engineering in
5333610. medicine and biology society (EMBC). doi:10.1109/EMBC.2016.7591218.
Kahramanli, H., & Allahverdi, N. (2008). Design of a hybrid system for the diabetes Shilaskara, S., & Ghatol, A. (2013). Feature selection for medical diagnosis: Evalua-
and heart diseases. Expert Systems with Applications, 35(1-2), 82–89. doi:10.1016/ tion for cardiovascular diseases. Expert Systems with Applications, 40(10), 4146–
j.eswa.20 07.06.0 04. 4153. doi:10.1016/j.eswa.2013.01.032.
Kannel, W. B. (1996). Blood pressure as a cardiovascular risk factor: Prevention Skeaff, C. M., & Miller, J. (2009). Dietary fat and coronary heart disease: Summary
and treatment. Journal of the American Medical Association, 275(20), 1571–1576. of evidence from prospective cohort and randomised controlled trials. Annals of
doi:10.10 01/jama.1996.03530440 051036. Nutrition and Metabolism, 55, 173–201. doi:10.1159/0 0 02290 02.
Kannel, W. B., Castelli, W. P., Gordon, T., & McNamara, P. M. (1971). Serum choles- Smith, M. R., Martinez, T., & Giraud-Carrier, C. (2014). An instance level anal-
terol, lipoproteins, and the risk of coronary heart disease. The Framingham ysis of data complexity. Machine Learning, 95(2), 225–256. doi:10.1007/
study. Annals of Internal Medicine, 74(1), 1–12. doi:10.7326/0 0 03- 4819- 74- 1- 1. s10994- 013- 5422- z.
Kopel, E., Kivity, S., Morag-Koren, N., Segev, S., & Sidi, Y. (2012). Relation of serum Stamler, J., Vaccaro, O., Neaton, J. D., & Wentworth, D. (1993). Diabetes, other risk
lactate dehydrogenase to coronary artery disease. The American Journal of Cardi- factors, and 12-yr cardiovascular mortality for men screened in the multiple risk
ology, 110(12), 1717–1722. doi:10.1016/j.amjcard.2012.08.005. factor intervention trial. Diabetes Care, 16(2), 434–444. doi:10.2337/diacare.16.2.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep 434.
convolutional neural networks. In Proceedings of the 25th international conference Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. (2017). Inception-v4, inception-res-
on neural information processing systems (pp. 11097–11105). net and the impact of residual connections on learning. In Proceedings of the
Kurt, I., Ture, M., & Kurum, A. T. (2008). Comparing performances of logistic re- thirty-first AAAI conference on artificial intelligence (pp. 4278–4284).
gression, classification and regression tree, and neural networks for predict- Uyar, K., & Ilhan, A. (2017). Diagnosis of heart disease using genetic algorithm based
ing coronary artery disease. Expert Systems with Applications, 34(1), 366–374. trained recurrent fuzzy neural networks. Procedia Computer Science, 120, 588–
doi:10.1016/j.eswa.20 06.09.0 04. 593. doi:10.1016/j.procs.2017.11.283.
LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images, speech, and time Vasan, R. S., Larson, M. G., Leip, E. P., Evans, J. C., O’Donnell, C. J., Kannel, W. B.,
series. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks & Levy, D. (2001). Impact of high-normal blood pressure on the risk of cardio-
(pp. 255–258). Cambridge, MA: MIT Press. vascular disease. The New England Journal of Medicine, 345(18), 1291–1297 2001
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. Nov. doi:10.1056/NEJMoa003417.
doi:10.1038/nature14539. Venkatesh, B. A., et al. (2017). Cardiovascular event prediction by machine learn-
Lichtenstein, A. H., Appel, L. J., Brands, M., et al. (2006). Diet and lifestyle ing. The multi-ethnic study of atherosclerosis. Circulation Research, 121(9), 1092–
recommendations revision 2006: A scientific statement from the American 1101. doi:10.1161/CIRCRESAHA.117.311312.
Heart Association Nutrition Committee. Circulation, 114, 82–96. doi:10.1161/ Wannamethee, S. G., Shaper, A. G., & Perry, I. J. (1997). Serum creatinine concentra-
CIRCULATIONAHA.106.176158. tion and risk of cardiovascular disease: A possible marker for increased risk of
Madani, A., Arnaout, R., Mofrad, M., & Arnaout, R. (2018). Fast and accurate view stroke. Stroke, 28(3), 557–563. doi:10.1161/01.STR.28.3.557.
classification of echocardiograms using deep learning. npj Digital Medicine, 1, 6. Weng, S. F., Reps, J., Kai, J., Garibaldi, J. M., & Qureshi, N. (2017). Can machine-
doi:10.1038/s41746-017-0013-1. learning improve cardiovascular risk prediction using routine clinical data? PLoS
Madjid, M., & Fatemi, O. (2013). Components of the complete blood count as risk One, 12(4), E0174944. doi:10.1371/journal.pone.0174944.
predictors for coronary heart disease. Texas Heart Institute Journal, 40(1), 17–29. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: A case data. IEEE Transactions on Systems, Man, and Cybernetics, 3, 408–421. doi:10.1109/
study involving information extraction. In Proceedings of workshop on learning TSMC.1972.4309137.
from imbalanced datasets: 126. Yang, Z., Zhang, T., Lu, J., Zhang, D., & Kalui, D. (2017). Optimizing area under the
Marcus, G. (2018). Deep learning: A critical appraisal. arXiv preprint. ROC curve via extreme learning machines. Knowledge-Based Systems, 130, 74–89.
Martinez, F. L., Schwarcz, A., Valdez, E. R., & Diaz, V. G. (2018). Machine learn- doi:10.1016/j.knosys.2017.05.013.
ing classification analysis for a hypertensive population as a function of several Zeiher, A. M., Drexler, H., Saurbier, B., & Just, H. (1993). Endothelium-mediated coro-
risk factors. Expert Systems with Applications, 110, 2016–2215. doi:10.1016/j.eswa. nary blood flow modulation in humans. Effects of age, atherosclerosis, hyper-
2018.06.006. cholesterolemia, and hypertension. The Journal of Clinical Investigation, 92(2),
652–662. doi:10.1172/JCI116634.

You might also like