a r t i c l e i n f o a b s t r a c t
Article history: Data mining for healthcare is an interdisciplinary field of study that originated in database statistics and
Available online 2 August 2021 is useful in examining the effectiveness of medical therapies. Machine learning and data visualization
Diabetes-related heart disease is a kind of heart disease that affects diabetics. Diabetes is a chronic con-
Keywords: dition that occurs when the pancreas fails to produce enough insulin or when the body fails to properly
Data mining use the insulin that is produced. Heart disease, often known as cardiovascular disease, refers to a set of
Machine learning conditions that affect the heart or blood vessels. Despite the fact that various data mining classification
Decision tree
algorithms exist for predicting heart disease, there is inadequate data for predicting heart disease in a dia-
Naïve bayes
Support vector machine
betic individual. Because the decision tree model consistently beat the naive Bayes and support vector
Accuracy machine models, we fine-tuned it for best performance in forecasting the likelihood of heart disease in
Classification diabetes individuals.
Prediction Ó 2021 Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the scientific committee of the International Confer-
ence on Nanoelectronics, Nanophotonics, Nanomaterials, Nanobioscience & Nanotechnology.
1. Introduction existing data for the future course of action. This technology com-
bines multiple analytic methodologies with modern and complex
In terms of data collecting and processing, healthcare is one of algorithms, allowing for the exploration of massive amounts of
the most worrisome industries. With the advent of the digital era data [4]. It is used in healthcare to gather, organize, and analyze
and technological advancements, a vast quantity of multidimen- patient data in a systematic manner. It may be used to identify
sional data on patients is created, including clinical factors, hospi- inherent inefficiencies and best practices for providing better ser-
tal resources, illness diagnostic information, patients’ records, and vices, which may lead to improved diagnosis, better medicine,
medical equipment. The enormous, dense, and complex data must and more successful treatment, as well as a platform for a deeper
be processed and evaluated in order to extract knowledge for effec- knowledge of the mechanisms in practically all elements of the
tive decision making. Medical data mining offers a lot of potential medical domain. Overall, it assists in the early detection and pre-
for uncovering hidden patterns in medical data sets [1]. vention of disease epidemics by searching medical databases for
By identifying significant patterns and detecting correlations pertinent information.
and relationships among many variables in huge databases, the The process of determining a condition based on a person’s
use of various data mining tools and machine learning approaches symptoms and indicators is known as medical diagnosis. In the
has changed healthcare organizations [2,3]. It serves as an impor- diagnostic process, one or more diagnostic procedures, such as
tant instrument in the medical sector, providing and comparing diagnostic tests, are performed. Diagnosis of chronic illnesses is a
vital issue in the medical industry since it is based on many symp-
toms. It is a complex procedure that frequently leads to incorrect
tems evolve and new treatments become available, it becomes methodology was used to create this intelligent system in order to
more difficult for physicians and doctors to stay up with the cur- provide quick, better, and more accurate outcomes. It might aid
rent innovations in clinical practice [6]. For effective therapy, med- doctors in making clinical judgments about heart attacks. This sys-
ical practitioners and doctors must be well-versed in all pertinent tem may be enhanced by including SMS functionality, building
diagnostic criteria, patient history, and a mix of medication ther- Android and IOS mobile applications, and including a pacemaker
apy. However, mistakes are possible since they make judgments in the order.
instinctively based on information and experience gained from Diabetes and breast cancer were diagnosed by incorporating the
past experience with patients. Because of factors such as multi- adaptivity characteristic into support vector machines [15]. The
tasking, restricted analysis, and memory capacity, their cognitive goal was to offer a rapid, automated, and adaptable diagnostic
capacities are restricted [7]. As a result, it is difficult for a physician method using adaptive SVM. To achieve better results, the bias
to make the right judgment on a consistent basis if he is not sup- value in conventional SVM was changed. The suggested classifier
ported by clinical tests and patient history information. Even expe- produced output in the form of ‘if-then’ rules. The proposed
rienced physicians can benefit from a computer-aided diagnostic method was used to diagnose diabetes and breast cancer, and it
system in making sound medical judgments [8]. Thus, medical pro- provided 100% right classification rates for both conditions. Future
fessionals are very interested in automating the diagnosis process research should focus on developing more efficient ways for
by integrating machine learning techniques with physician exper- changing the bias value in conventional SVM.
tise [9]. Data mining and machine learning approaches are making For the prediction of type 2 diabetes, a hybrid model based on
significant efforts to intelligently translate accessible data into clustering followed by classification was proposed [16]. For predic-
valuable information in order to improve the diagnostic process’s tion, the suggested model uses K-means clustering and the C4.5
efficiency. Several studies have been conducted to explore the classification method with k-fold cross-validation. The model gen-
use of machine learning in terms of diagnostic abilities. It was dis- erated encouraging results with a classification accuracy of 88.38
covered that, when compared to the most experienced physician, percent using the hybrid technique, which might be highly useful
who can diagnose with 79.97% accuracy, machine learning algo- for clinicians in making appropriate clinical choices related to
rithms could identify with 91.1% correctness [10]. Machine learn- diabetes.
ing techniques are explicitly used to illness datasets to extract
features for optimal illness diagnosis, prediction, prevention, and 3. Framework for multiple disease prediction
In this framework, machine learning algorithms- support vector
2. Related work machine, naïve bayes, decision tree are used.
The Naive Slogan The Bayes classification [14] refers to a funda-
A structural model and a collection of conditional probabilities mental probabilistic classification based on strong independent
are used by Bayesian classifiers. They make the assumption that assumptions in the application of the Bayes theorem. The existence
the contributions of all factors are independent. It first calculates or absence of a particular class feature does not depend on the
the prior probability for each class, and then applies the occurrence presence or absence of any other feature. It operates on the basis
of each variable value to an unknown scenario. A Bayes network of conditions. It uses Bayes’ theorem that determines the probabil-
classifier is built on a Bayesian network, which reflects a joint ity that an event happens when another event happens. If B repre-
probability distribution over a set of category characteristics. sents the dependent event and A represents the last event the
The SVM method and the Nave Bayes technique were used to theorem Bayes may be phrased as follows: Sample (B supplied in
predict kidney disease [11]. The authors attempted to categorize A) = Sample (A and B)/Sample (A and B) (A) . The approach divides
various stages of kidney disease using the suggested ANFIS algo- the number of events in which A and B occur together by the num-
rithm. The study’s purpose was to design an effective categoriza- ber of circumstances in which A occurs to get the likelihood of B
tion algorithm using several assessment metrics such as accuracy given A alone. In order to estimate the parameters (variable media
and execution time. While the SVM Algorithm provided higher and variances), the Naive Bayes Classifier benefits from only a few
classification accuracy, the Nave Bayes fared better since it pro- training data. Due to the assumption of independent variables, all
duced results in less time. The results show that SVM outperforms the variances must be computed for each class. It is relevant to bin-
the Nave Bayes Approach in predicting renal illness. ary as well as multi-class problems.
The fuzzy technique with a membership function was used to SVM [15] is a method often used for kernel learning to handle
forecast cardiac disease [12]. Using the Fuzzy KNN Classifier, the issues of large prediction. The SVM classifier has shown greater
authors attempted to eliminate ambiguity and uncertainty from generalization and a well-scaling of both linear and nonlinear data
data. The 550-record dataset was separated into 25 classes, with as compared to other classifiers. In addition, the SVM classificator
each class having 22 items. The dataset was separated into two delivers very strong pattern recognition performance in conjunc-
equal parts: training and testing. The fuzzy KNN methodology tion with various frequently used approaches in statistical learning
was implemented after pre-processing techniques were used. This and optimisation theory. Identifying an overview that separates
technique was examined using several assessment metrics such as positive examples from negative data with the greatest error mar-
accuracy, precision, and recall, among others. Based on the data, it gin is the main aim of the SVM classification system.
was discovered that the fuzzy KNN classifier outperformed the When the data is linearly separable, it is easy to choose the opti-
KNN classifier in terms of accuracies. mum hyper-plane splitting two classes of data. For non-inlinear
For the prediction of cardiac disease, a novel technique based on mapping to large dimension space for non-separable problems,
the ANN algorithm was devised [13]. The researchers created an SVM applies ’Kernel Functions’ on the other side. There are a num-
interactive prediction method based on categorization using an ber of kernels functions including Linear Kernel Function (LKF),
artificial neural network algorithm and taking into account the Polynomial Kernel Function (PKF) and Sigmoid Kernel Function
thirteen most important clinical parameters. The suggested (SKF), Exponential Radial Basis Kernel Function (ERBKF) (GRBKF).
method proved effective for predicting heart disease with an accu- The Radial Basic Function (RBF) has been identified as the finest
racy of 80% and can be very useful for healthcare practitioners. kernel function among the several kernel functions.
Authors in [14] presented an automated approach for answer- Decision trees are used [16] extensively for categorizing huge
ing difficult inquiries for heart disease prediction. The Naive Bayes datasets. Decision trees categorize data between the root node
Error in %
Error in %
Naïve bayes SVM Decision Tree
and the leaf node. The produced tree can be used for rule-making. in the classes. It also provides names for possible diseases. Figs. 1
Decision trees are rules that are easy to understand. Many and 2 exhibit the accuracy and error rate of machine learning
decision-tab algorithms are available, including ID3, C4.5, and algorithms.
CART. The algorithm C4.5 for data mining is a complex decision
tree approach. The idea is based on the profit ratio. The main ben-
efits of the C4.5 algorithm are its well-functioning with both cate-
gorical and continuous features. It can also handle missing values 4. Conclusion
correctly while running and utilizes less memories. It has the
inconveniences of branches that are excessive and insignificant. Data mining for healthcare is an interdisciplinary topic of
The information gain is the basis of the ID3 algorithm. CART is a research that evolved from database statistics and is valuable in
generator of a binary decision tree, which is based on the measure assessing the efficacy of medical interventions. Data visualization
of the Gini index. The ID3 algorithm has discreet features that do with machine learning Diabetes-related heart disease is a kind of
not manage missing values. heart disease that occurs in diabetics. Diabetes is a chronic disease
In the framework, the Cleveland data set [17] is utilized as that arises when the pancreas fails to create enough insulin or
input. This Cleveland data collection has been preprocessed to when the body fails to utilize the insulin that is generated appro-
eliminate noise and make the data consistent. After preprocessing, priately. Heart disease, often known as cardiovascular disease, is
the input data is clean and consistent. This data is now fed into a group of disorders affecting the heart or blood arteries. Despite
machine learning algorithms such as SVM, Nave Bayes, and Deci- the existence of many data mining classification methods for pre-
sion Tree C4.5. These algorithms classify the data that is sent into dicting heart disease, there is insufficient data to predict heart dis-
them. The classification data is then used as training data for the ease in a diabetic individual. We fine-tuned the decision tree
prediction job. When new patient data is introduced into this model for optimum performance in forecasting the chance of heart
framework, the framework predicts whether the new patient’s disease in diabetic patients since it consistently outperformed the
data is normal or abnormal based on the learning data accessible naive Bayes and support vector machine models.
Accuracy in %
Accuracy in %
Naïve Bayes SVM Decision Tree
