Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Performance Analysis of Diabetes Detection Using Machine Learning Classifiers

Diabetes is a chronic medical condition that has been causing severe public health challenges in not only Canada, but the entire world, for as long as time immemorial, impacting millions of people and putting pressure on healthcare resources. That said, conventional diagnostic procedures sometimes depend on few data points and are prone to mistakes, resulting in premature action. Additionally, the sluggish adoption of modern machine learning (ML) technologies in the healthcare industries might be due to their misunderstanding of the systems’ decision-making procedures. This study purports to fill that gap by looking at various machine learning (ML) algorithms and applying them on the PIMA Indians Diabetes Dataset provided by the National Health Institute of Diabetes and Digestive and Kidney Diseases with the aim of improving the validity of diabetes prediction and diagnosis. Three types of machine learning classifiers are used: Tree-based, Function-based, and Rule-based. Results have shown that Stochastic Gradient Descent (function), Logistic Regression (function), JRip (rules) and Random Forests (trees) are among the top performing classifiers. They are judged based on different metrics, such as accuracy, precision, recall, specificity, F-1 score, MCC, and ROC area. Despite performing well in almost all the metrics, SGD’s low recall score shows that it is not the most optimal algorithm. Given that recall score is prioritized in the context of clinical diagnostics, Random Forest emerges as a strong candidate due to its balanced performance across key metrics

Uploaded by

kmailjun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Performance Analysis of Diabetes Detection Using Machine Learning Classifiers

Diabetes is a chronic medical condition that has been causing severe public health challenges in not only Canada, but the entire world, for as long as time immemorial, impacting millions of people and putting pressure on healthcare resources. That said, conventional diagnostic procedures sometimes depend on few data points and are prone to mistakes, resulting in premature action. Additionally, the sluggish adoption of modern machine learning (ML) technologies in the healthcare industries might be due to their misunderstanding of the systems’ decision-making procedures. This study purports to fill that gap by looking at various machine learning (ML) algorithms and applying them on the PIMA Indians Diabetes Dataset provided by the National Health Institute of Diabetes and Digestive and Kidney Diseases with the aim of improving the validity of diabetes prediction and diagnosis. Three types of machine learning classifiers are used: Tree-based, Function-based, and Rule-based. Results have shown that Stochastic Gradient Descent (function), Logistic Regression (function), JRip (rules) and Random Forests (trees) are among the top performing classifiers. They are judged based on different metrics, such as accuracy, precision, recall, specificity, F-1 score, MCC, and ROC area. Despite performing well in almost all the metrics, SGD’s low recall score shows that it is not the most optimal algorithm. Given that recall score is prioritized in the context of clinical diagnostics, Random Forest emerges as a strong candidate due to its balanced performance across key metrics

Uploaded by

kmailjun
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

International Journal of Management and Data Analytics (IJMADA)

Int. J. Management and Data Analytics, Vol. 4(1), 43-54


ISSN: 2816-9395
Journal Homepage: http://ijmada.com
https://doi.org/10.5281/zenodo.13926972

Performance Analysis of Diabetes Detection Using Machine Learning


Classifiers
Hung Vu Trung Huynh, Liu Hui, Ngoc Han Nguyen, Ruixuan Qiao
University Canada West, Vancouver, BC, Canada
Received: September, 2024, Published: Oct, 2024

ARTICLE INFO ABSTRACT


Keywords: Diabetes is a chronic medical condition that has been causing severe public health challenges in not only Canada,
but the entire world, for as long as time immemorial, impacting millions of people and putting pressure on
Classifiers; Diabetes Prediction
healthcare resources. That said, conventional diagnostic procedures sometimes depend on few data points and
and Diagnosis; Healthcare;
are prone to mistakes, resulting in premature action. Additionally, the sluggish adoption of modern machine
Machine Learning
learning (ML) technologies in the healthcare industries might be due to their misunderstanding of the systems’
decision-making procedures. This study purports to fill that gap by looking at various machine learning (ML)
algorithms and applying them on the PIMA Indians Diabetes Dataset provided by the National Health Institute
of Diabetes and Digestive and Kidney Diseases with the aim of improving the validity of diabetes prediction and
diagnosis. Three types of machine learning classifiers are used: Tree-based, Function-based, and Rule-based.
Results have shown that Stochastic Gradient Descent (function), Logistic Regression (function), JRip (rules) and
Random Forests (trees) are among the top performing classifiers. They are judged based on different metrics,
such as accuracy, precision, recall, specificity, F-1 score, MCC, and ROC area. Despite performing well in almost
all the metrics, SGD’s low recall score shows that it is not the most optimal algorithm. Given that recall score is
prioritized in the context of clinical diagnostics, Random Forest emerges as a strong candidate due to its balanced
performance across key metrics.

to Iparraguirre-Villanueva et al. (2023), ML models have


1. INTRODUCTION proven to be accurate forecasters of diabetes’ growth.
Diabetes is a chronic medical illness that affects millions These models can help with predicting the onset of
of individuals globally, posing considerable dangers that diabetes or liver disorders by analyzing the patient's
could be fatal (Mousa et al., 2023). Furthermore, the medical history, demographics, lifestyles and their genetic
condition is incurable, and patients can only manage its composition. With this information provided by machine
symptoms through expensive personalized treatment and learning, medical care can take preventive measures, make
therapy. Identifying a person’s susceptibility to diabetes is medical treatment plans, as well as better utilize the
a huge undertaking. Early diagnosis and intervention will limited resources in healthcare.
hinder the progression of the disease while also helping To enhance treatment options and make healthcare more
patients save on their medical bills. The traditional data accessible to the more unfortunate, an automated
analysis method is not efficient in dealing with the
computerized system is needed to diagnose diabetes. This
complexity and volume of data, which can lead to low
research will, therefore, tap into various approaches for
efficiency in disease diagnosis and patient management. In
identifying diabetes mellitus using three classification
relation to this matter, machine learning is a powerful
methods, such as trees, rule, and function-based. This
technology that could help with decision-making,
paper will further discuss the role of machine learning in
strengthen illness prediction, and thereby provide better
overcoming the challenges in healthcare, focusing on its
medical treatment to patients. (Davenport & Kalakota,
application in diabetes treatment. This study will highlight
2019).
how the effectiveness of these technologies improves
Machine learning is a branch of artificial intelligence; it diagnosis accuracy and patient outcomes by reviewing the
enables computers to learn from data and make predictions existing literature and analyzing different machine
using algorithms. It can identify hidden patterns which are learning algorithms. In addition, we will discuss the
not obvious to human eyes. The ability to parse large, impact of said algorithms on healthcare workers, patients,
complex data efficiently is one of the core factors helping and decision-makers, highlighting why it is crucial to
AI systems surpass humans in assessing the likelihood of apply these innovative technologies responsibly and
disease onset (Hounguè & Bigirimana, 2022). According ethically in everyday practice.

Contact: Hung Huynh, hunghuynhza@gmail.com http://ijmada.com


Int. J. Management and Data Analytics, Vol. 4 (1), 2024 44

Many trees-based algorithms, namely random trees, J48 evaluation, serving as an instrument for healthcare
(C4.5 algorithm) and random forests can explain and professionals in early diagnoses and predictions.
handle the complicated non-linear relationships with data. The continual evolution of today’s world has brought
It can generate decision rules that are easy for medical with it an exponential growth of data whose increased
practitioners to understand, which makes them especially availability makes them readily accessible for building AI
suitable for clinical settings. For the same reasons, rule- and machine learning (ML) algorithms. Within the context
based algorithms include methods such as RIPPER, OneR, of healthcare, the growing availability of data, coupled
and PART aim to create human-readable rules from data. with the ubiquity of smart devices and data-driven
These algorithms are widely used in medical settings resources, has made it easier for healthcare providers to
because they provide clear and actionable interpretations develop an accurate representation of a patient’s condition
to guide clinical decisions, particularly ideal in over time, while also offering novel perspectives on
environments which demand transparency and unprecedented cases (Kolasa et al., 2023).
explainability. Lastly, we have function-based algorithms As stated, modern technology and its automated
such as Logistic Regression, Stochastic Gradient Descent processes facilitate the handling of huge volumes of data,
(SGD), and Multilayer Perceptron (MLP), which are well making machine learning an ideal tool to help doctors and
known for their high accuracy and ability to identify other medical personnel make informed decisions.
critical patterns in large datasets. Their predictive power Doctors can look into a patient’s illness using medical
makes them highly effective in detecting subtle signs of measurements that include body temperature and arterial
disease that may be neglected by simpler models. pressure and prescribe remedies after having gone through
By comparing these algorithms, we aim to determine a series of iterative analyses (Bhat et al., 2023). Moreover,
which algorithm is the most effective one in predicting Davenport and Kalakota (2019) asserted that ‘precision
diabetes diseases from a given dataset. The comparison medicine’ is the most applied machine learning model in
will be based on key performance metrics such as healthcare as it predicts which form of treatment is most
accuracy, sensitivity/recall, specificity, precision, ROC likely to be effective on a patient. This is evaluated based
Area, and MCC. In doing so, we seek to identify the ML on a vast collection of patients’ attributes, their previous
algorithms with the best all-round performance in therapies, their family’s medical history, and the setting in
prediction and practical application so that they can be which they are treated. In addition, although a healthcare
used effectively in correctly identifying positively tested practitioner and a Machine Learning algorithm can arrive
diabetic patients. at the same conclusion based on the same information
The rest of the paper is structured as follows: In section given, the latter’s response rate is considerably much
three, the literature review will be covered, providing an faster, yielding results more efficiently, which enables
overview of the existing ML application in healthcare, intervention to take place sooner (Javaid et al., 2022).
particularly in diabetes diseases. After that, a Javaid et al. (2022) further pointed out that ML techniques
comprehensive analysis and discussion of the classifiers are more favored because they reduce the possibility of
used in this study will be conducted, as well as the human error.
comparison of different classifier performances across
various key performance metrics. Finally, section five will B. ML/AI in Diabetes Diagnoses
summarize our findings and recommendations for ML For a chronic metabolic condition like diabetes,
practical implementation and discuss the limitations of this Artificial Intelligence (AI) technologies in the likes of
study while also mentioning how future studies are needed machine and deep learning serve a crucial role in
to address these limitations. facilitating decision-making processes for healthcare
professionals (Chang et al., 2022; Baadel, et al. 2020). Not
only that, but they also help clinicians monitor and manage
2. LITERATURE REVIEW their patients accordingly, aided by customizable
treatments in accordance to each and every patient’s health
A. AI in Healthcare Context record. Real life approaches see ML/AI use genomic data
AI entails the techniques which enable computers to to assess and predict diabetes risks and use electronic
mimic human intelligence, allowing them to acquire health records (EHR) for diabetes diagnoses (Salazar-
knowledge from data and experiences while carrying out Reyna et al., 2020).
complex tasks that would normally call for human intellect In account for the prevalence of diabetes, several
(L. P. Nguyen et al., 2023). ML is a subset of AI machine learning algorithms, for instance Random Forest
techniques that focuses on using statistical methodologies (RF), Decision Trees (DT), Neural Networks, Logistic
to help computer systems improve with experiences. Regression, and ensembles, have been proposed to detect
Leveraging information from people’s regular physical the condition risks early on, using features such as age,
assessments can assist in producing a preliminary blood glucose levels, body mass index (BMI), blood

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 45

pressure, and other pertinent variables (Bhat et al., 2023). allows features/predictors to be ranked accordingly, based
Different methodologies are adopted by researchers to on the impact score they have on the dependent variable.
diagnose and predict whether a patient has diabetes or not. In the context of disease prediction, this function could be
For instance, researchers Shetty et al. (2017) used KNN advantageous for clinicians as they can discover which
and Naive Bayes methods for said prediction whereby they predictor is affecting the outcome variable the most and
input patient records via a software program to see promptly act from there.
whether the person in question has the condition or not. To On a related note, Pal et al. (2022) investigated several
add on this, Ahmed (2016) also leveraged patient records supervised learning algorithms for reliable diabetic
and treatment plans to classify the disease through Naive predictions. The study found that the Random Forest
Bayes, J48, and logistic regression algorithms. Other model has the highest accuracy rate compared to the rest.
algorithms such as Random Forest, function-based Allen et al. (2022)’s similar approach in using Random
multilayer perceptron (MLP) have also proven to be Forest to diagnose type 2 diabetes mellitus also resulted in
successfully applied in this regard after data a high accuracy score of 82%, outperforming traditional
preprocessing, after which a correlation-based feature statistical methodologies.
selection process can be initiated to eliminate redundant 2. Rule-based System
features (Alam et al., 2019). Natural Language Processing (NLP) applications in
1. Tree-based Models medical systems have been hailed as a critical component,
The goal of developing a machine learning model that helping health services to fulfill the demands of
can classify an individual’s condition based on the personalized healthcare. Berge et al. (2023) also
occurrence of diabetes, which can be done using decision mentioned that in the 1970s, rule-based NLP systems
trees. The algorithm follows a hierarchical structure made designed for retrieving structured clinical data from
up of branches, indicating qualities that impact the end narrative texts were implemented and have since been
result, and nodes where various possibilities are effectively employed. That said, although this
considered. Two categories are derived from this method: methodology is proven to have a good accuracy record,
classification and regression. According to Dudkina et al. these systems may not perform well if terms appearing in
(2021), classification trees are more favored in medical the narrative cannot be found in their lexicon. It must be
diagnostics because they organize symptoms in stated that the medical language contains multitudes of
accordance with the known target class. This form of jargons, technical terms, and grammatical structures, not
learning with labeled training - the known target class - is to mention abbreviations and incorrect spellings (Berge et
referred to as supervised learning. To successfully al., 2023). Hence, according to Jonnalagadda et al (2011)
categorize data, the tree separates it into categories and for these systems to be effective at information extraction,
constructs sequences of "if... then..." rules (Dudkina et al., they must rely on clinical experts to verify the legitimacy
2021). The traversal through various branches helps the of deterministic rules, as well as to develop and maintain
computer to match the symptoms to the target class, the quality of medical lexical resources.
thereby predicting diabetes.
Evidently, decision-tree analysis was employed by C. State of the Art Approaches in Diabetes Prediction
Chang et al. (2023) to determine the relationships between 1. Deep Learning in Diabetic Retinopathy
risk variables that influence the levels of HbA1c, also Detection
known as glycated hemoglobin in diabetic patients. The In instances of detecting diabetic retinopathy (DR), deep
researchers further found depression to be a core learning techniques have proven to be accurate and
component among type 2 diabetes patients. The decision effective in their diagnoses, far more superior and error-
tree algorithm used in the study indicated three pathways proof compared to traditional manual methods. DR is a
with risk features linked to poor blood glucose control in chronic medical condition that requires early identification
patients with diabetes mellitus. to avoid grave consequences (Das et al., 2022).
However, to improve upon this method with less Furthermore, according to Das et al. (2022), many studies
overfitting risks, Random Forest (RF) can be considered. have shown that deep learning models surpass machine
Random Forest entails a training process in which multiple learning algorithms when dealing with large and complex
decision trees are aggregated to produce a single output. datasets for better DR generalizations. Convolutional
This machine learning algorithm is particularly ideal for neural networks (CNN) and ensemble learning methods
medical diagnostics because of its ability to model are some of the more popular approaches in this regard that
complex and multi-feature data. Additionally, there are use retinal pictures to make early predictions. Recent
other reasons why this algorithm gains traction from the developments in CNNs have rendered them a popular
healthcare industry. Haripriya et al. (2021) stated that RFs method in image classification using a hierarchy of
are competent with both category and numerical data in features (Xu et al., 2017). Through Xu et al. (2017)’s
predictive tasks. Its integrated cross-validation capacity study, the efficacy of this method was investigated based

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 46

on real retina data with results showing 94.5% accuracy, implemented in these systems to cluster patient data for
the highest among prior handmade feature-based further analysis on patient behavior and clinical
classifiers. parameters. Radiotherapy uses cloud computing and
In short, CNNs have a considerable advantage over machine learning algorithms, including Monte Carlo
human feature selection approaches since they can extract simulations to enhance treatment planning and optimize
hierarchical features from raw images. They may identify dosage computation, especially in demanding instances
intricate patterns in data that human specialists or simpler like breast cancer.
models may overlook. CNN designs, including ResNet
and VGGNet, are increasingly often employed in medical E. Implications and Future Development
image analysis (Mall et al., 2023). Early identification of Within the context of diabetes prediction, early
diabetic retinopathy can avert serious complications detection of those at risk is critical for successful
including blindness. In this regard, CNNs' ability to detect intervention strategies. However, widespread diabetes
minuscule abnormalities in retinal scans far quicker than testing would be expensive, laborious, and stressful for
standard approaches makes them a valuable tool in medical personnel (Chowdhury et al., 2024). Perhaps, one
diabetes control. of the more pressing concerns related to the use of AI/ML
healthcare is transparency. For instance, the application of
2. Hybridization of Deep Learning and Traditional deep learning in image processing is complex and difficult
Methods to interpret fully. If a patient is notified that a scan has led
Hybrid models can be used to further enhance the to a severe diagnosis, they will want to know how this is
predictability and interpretability in diagnosing diabetes possible. Even healthcare professionals who are
through the combination of deep learning frameworks (for accustomed to the workings of deep learning may struggle
example, CNNs) with classic machine learning methods to explain how they work (Davenport & Kalakota, 2019).
such as ‘Random Forest’ or ‘Logistic Regression’. For It must be stated that AI/ML is not completely error-free
instance, Simaiya et al. (2022) conducted a novel, and will likely make mistakes in diagnosis and predictions
multistage ensemble technique that blends neural but holding them accountable would be a challenging task.
networks and decision trees for diabetes prediction, Therefore, it is important that medical institutions develop
demonstrating greater accuracy and recall than solo a coherent structure to monitor risks and set up proper
models. governance to prevent adverse outcomes from fully
Such a hybridized approach can leverage deep learning’s realizing.
capacity to detect complicated patterns in unstructured The most urgent matter in this regard is not so much
data (scans, images, genomic data, and so forth) while about determining the algorithms would be effective as
utilizing more interpretable models for structured data, validating their adoption in routine clinical settings. For
offering a balanced mix of predictive accuracy and clinical ML to be widely accepted and thrive in medical
interpretability. In clinical settings, these hybrid models operations, the systems must be approved by governing
can offer medical practitioners with a comprehensive yet bodies, EHR-system incorporated, have a standardized
easy-to-understand decision framework and process, taught to healthcare workers, and updated
simultaneously harness deep learning’s predictive power constantly (Davenport & Kalakota, 2019).
to the fullest.
3. METHODOLOGY
D. AI-powered Patient Care The dataset in this study is taken from the Pima Indian
The introduction of chatbots in healthcare services has Diabetes Dataset, which is administered by the National
paved the way in medical communication, fostering Institute of Diabetes and Digestive and Kidney Diseases.
relationships between clinics and patients. The adoption of The dataset consists of several diagnostics indicators,
AI, deep learning, and artificial neural networks in featuring 768 instances across 9 features/attributes. The
chatbots technology helps the bots improve their ability to predictors are as follows:
deliver their services in a sensible and empathetic manner 1. Preg - The number of times the subjects had been
(Siddique & Chow, 2021). For instance, chatbot systems pregnant.
(for example, the Mandy and Nurse Chatbot) use AI- 2. Plas - Plasma glucose concentration level
powered natural language processing (NLP) to support measured 2 hours after consuming a glucose
patient intake, offer assistance 24/7, and provide solution.
personalized care through intelligent dialogue systems that 3. Pres - Diastolic blood pressure in mmHG
replicate human conversation. 4. Skin - Tricep skinfold thickness (mm)
AI and ML have become prevalent in radiology and 5. Insu - 2-hour serum insulin
radiotherapy, most notably through chatbots and virtual 6. Mass - Body mass index
support networks (Siddique & Chow, 2021). NLP is 7. Pedi - Diabetes pedigree function

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 47

8. Age - Age of the subject measured in years training. Data preprocessing is an important phase in the
The aim of the study is to use all of the above-mentioned information discovery process since quality data is the
attributes to predict the ‘Class’ of the patient. ‘Class’ is the cornerstone for improving decision making. This step
dependent variable divided into ‘tested positive’ and involved several strategies such as data normalization,
‘tested negative’. To carry out the analysis, the data was transformation, and reduction (see Feature Engineering).
uploaded onto Waikato Environment for Knowledge From the dataset in Figure 1, one can see an imbalance
Analysis software, otherwise known as ‘WEKA’. All the distribution between positive and negative classes, with
attributes are summarized in Figure 1 below. the former significantly outnumbered by the latter (500
versus 268). Such an imbalance might cause some biases
toward the majority class, resulting in better classifier
prediction for the majority than the minority class (Jadhav
et al., 2022). The data preparation process could be helpful
in improving model performance with imbalanced data.
Data normalization helps standardize feature magnitudes,
ensuring that features are on a comparable scale to
minimize the impact of the majority class on the training
process.
Moreover, given that the dataset's parameters are of
different scales, they must be normalized in order to
achieve a better outcome when executing the models. The
data was normalized with the aim of rescaling the
attributes to the range of 0 and 1. Normalization is
particularly apt when the data distribution is unknown or
not Gaussian (a bell curve), which is applicable in this
study (see Figure 1). The process was done on WEKA
Figure 1: Dataset Summary Captured on WEKA using and applying the ‘Normalize’ filter.
Following normalization, the dataset was processed
The findings were evaluated using seven metrics: through 10-fold cross validation (the default setting) and
accuracy, precision, sensitivity, specificity, F-score, ROC nine algorithms based on tree, function, and rule-based
Area, and MCC (Matthew Correlation Coefficient). These classifiers. Tree classifiers include ‘Random Forest’,
variables are created from the confusion matrix, ‘J48’, and ‘Random Tree’; Function classifiers consist of
demonstrating the proportions of the actual and expected ‘Logistic Regression’ (Log Reg), ‘Multilayer Perceptron’
result classes on the testing set. The formulas for the (MLP), and ‘Stochastic Gradient Descent’ (SGD);
metrics are as follows: whereas rules-based classifiers encompass ‘Java Repeated
Incremental Pruning’ (JRip), ‘One Rule’ (OneR), and
!" $!% ‘Partial Decision Tree’ (PART). Each algorithm was
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
!" $ !% $&" $ &% tested 10 times (10 runs of 10-fold cross validation) using
!" various random number seeds. This yielded 10 slightly
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = !" $ &"
(2) varied outcomes for each assessed algorithm, a tiny
population that may be examined by statistical methods
!" later.
𝑅𝑒𝑐𝑎𝑙𝑙 = !" $ &%
(3)

!%
4. ANALYSIS AND DISCUSSION
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (4)
!% $&" This section will provide a comprehensive look into how
different machine learning algorithms perform on the
"'()*+*,- . /()011
F-1 = 2 x (5) dataset. The classifiers are categorized into three types -
"'()*+*,- $ /()011
Trees, Function, and Rules. These classifiers’
!" . !% 2 &" . &% performances were recorded in tabular form, using metrics
MCC = (6)
3(!5 $ &")(!" $ &%)(!%$&")(!%$&%) such as accuracy, sensitivity/recall, specificity, precision,
f-score, ROC, and MCC for assessing class label.

3.1 Data Pre-processing


Prior to training the models, data processing techniques
are used to significantly increase the overall quality of the

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 48

Table 1. Performance of Tree-based Algorithms this study. Random Trees is another straightforward
algorithm used in this study where each tree is constructed
Tree-Based Method by selecting random features. The simplicity of the model
makes it less computationally demanding than Random
Random Random Forest, which might be suitable for cases where quick
Metric Forest J48 Tree decision-decision making is required.
Overall, Random Forest emerges as the strongest
Accuracy 0.758 0.738 0.681 performer among all tree-based algorithms, ranking ahead
Specificity 0.836 0.814 0.746 of J48 and Random Trees across almost all metrics, except
for Recall where it is bested by J48. Furthermore, its
Precision 0.754 0.735 0.684 ability to work with large feature spaces, as well as its
resistance to overfitting, makes it the preferable tree-based
Recall 0.612 0.632 0.560 model for this task.
However, comparing these findings to a comparable
F1-Score 0.755 0.736 0.682
study done by Bhat et al. (2023), it was discovered that
MCC 0.458 0.417 0.303 XGBoost and LightGBM, ensemble learning techniques,
outperform standard Random Forest due their superior
ROC Area 0.820 0.751 0.653 boosting techniques. By using XGBoost, their diabetes
prediction score was as high as 85% (Bhat et al., 2023).
Note: J48 - C4.5 Algorithm; MCC - Matthews Correlation These models are intended to handle big, imbalanced
Coefficient; ROC - Receiver Operating Characteristic datasets more effectively and provide quicker training
times than RF.
Table 2. Performance of Function-based Algorithms After testing the tree-based algorithms, function
classifiers were looked at - Logistic Regression,
Function-based Method Multilayer Perceptron, Stochastic Gradient Descent
(Table 2). Logistic Regression (LR) is a highly
Metric Log MLP SGD interpretable model employing a linear combination of
input data to compute the final binary outcome, for
Accuracy 0.772 0.754 0.78 instance, having or not having diabetes. Conversely,
Multilayer Perceptron is tasked with modeling complex,
Specificity 0.718 0.832 0.896
non-linear relationships between input features and output,
Precision 0.767 0.75 0.776 making it more ideal for detecting intricate particulars that
logistic regression might have missed. Finally, Stochastic
Recall 0.571 0.608 0.56 Gradient Descent (SGD) was considered for the
investigation due to its computational efficiency, as it is
F1-Score 0.765 0.751 0.771 capable of handling large datasets and high-dimensional
MCC 0.48 0.449 0.497 data (Fjellström & Nyström, 2022). The algorithm’s high
scalability makes it applicable to a wide range of models,
ROC Area 0.832 0.793 0.73 from linear to neural networks.
Based on the results observed in Table 2, SGD is the top
performing algorithm for diabetes prediction. Its
Note: Log - Logistic Regression; MLP - Multilayer Perceptron; performance is relatively high across almost all metrics,
SGD - Stochastic Gradient Descent; MCC - Matthews
Correlation Coefficient; ROC - Receiver Operating
scoring particularly well in Precision and Specificity, two
Characteristic. elements that are crucial in chronic health diagnosis
(Grabler et al., 2017). That said, Logistic Regression
Table 1 looks at the performance of Random Forest, J48, nevertheless remains a good model due to its simplicity
and Random Tree classifiers. Random Forest is an and interpretability, achieving the highest ROC Area
ensemble method that aggregates the outputs of multiple score. Although MLP’s scores are the lowest among all
decision trees to enhance the model’s accuracy while also function classifiers, the algorithm still shows promise
reducing overfitting. J48, also known as the thanks to its relatively high recall score in relation to the
implementation of Quinlan’s 4.5 algorithm, is noted for its other two classifiers. Its low performance in other aspects
ability to simplify decision trees and make them more might suggest that this dataset is not large enough to fully
interpretable (Chang et al., 2022). Interpretability is a make use of its capabilities.
crucial factor in explaining how a prediction or diagnosis As observed, one can see that the function-based method
is reached and so, for this reason, J48 was considered for might have performed better than the tree-based. These

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 49

three function classifiers, especially SGD, can deal with partial decision trees, which are subsequently converted
high dimensional data. However, when comparing them to into a set of rules.
deep learning models, they are not quite effective. This can Based on the results indicated in Table 3, JRip performs
be seen in a study done by Xu et al. (2017) who reported the best in most metrics, achieving the highest scores
that they achieved an accuracy score of over 94% when among all the rules-based classifiers in accuracy,
implementing CNNs to diagnose diabetic retinopathy precision, F-1 measure, and recall (same score as PART),
from retinal pictures. Nevertheless, SGD's scalability and making it a fine choice for diabetes diagnosis where
speed are some of its perks. In contrast, deep learning balanced performance is deemed necessary. OneR has the
models, while more powerful, would typically require highest specificity score but falls short in other metrics.
more computer resources and are less interpretable. Meanwhile, PART scores the highest on the ROC Area,
which highlights its ability to distinguish between positive
Table 3. Performance of Rule-based Algorithms and negative cases. From this observation, it can be
deduced that JRip and Part high performances are
Rules-based Method attributable to their complex algorithms compared to
OneR. JRip’s iterative pruning enables it to make better
Metric JRip OneR PART generalizations, evidenced by a balance of accuracy and
precision, whereas PART’s hybridization of partial
Accuracy 0.76 0.715 0.753 decision trees and rules-based learning provides enough
flexibility to deal with complex data.
Specificity 0.856 0.866 0.844
Table 4. Comparative Performance of All Algorithms
Precision 0.755 0.703 0.747
Recall 0.582 0.433 0.582
F1-Score 0.755 0.699 0.748
MCC 0.457 0.334 0.441
ROC Area 0.739 0.649 0.794

Note: JRip - Java Repeated Incremental Pruning to Produce Error


Reduction; OneR - One Rule; PART - Partial Decision Trees;
MCC - Matthews Correlation Coefficient; ROC - Receiver
Operating Characteristic.
In terms of the three methods’ overall performance,
function-based classifiers, such as Logistic Regression
Table 3 above shows the performance of three rules-
and SGD slightly edge ahead of the trees and rules-based
based algorithms - JRip, OneR, and PART. These
methods, offering high accuracy and F-1 scores that
algorithms are particularly beneficial in situations when signify a great balance between precision and recall.
interpretability and simplicity are essential. They generate
‘Rule-based Method’ stands out with respect to
simple decision rules for healthcare predictions in various
interpretability. JRip and OneR create rules that are easy
contexts. The first among them is JRip, an efficient rule- to understand but also have enough complexity to handle
learning algorithm that builds sets of rules on association
intricate patterns. This interpretable factor is of great
principles and reduced error pruning (Simaiya et al.,
value in clinical diagnostics where understanding the
2022). It strikes the right balance between the model's basis for a prediction is critical. Lastly, with reference to
complexity and predicted accuracy, a critical aspect in
robustness, Random Forest, which is the best performing
medical diagnostics where accuracy and interpretability
classifier in the tree-based method, can also be
are equally vital. The second rule classifier is OneR, which
considered given its adeptness at handling complex
creates a single rule according to the characteristic with the
datasets with very minimal chance of overfitting.
highest classification accuracy. It evaluates each attribute As for comparing each individual classifier across all
individually and picks the one that produces the greatest methods, it is evident that SGD, Stochastic Gradient
results. This method is highly interpretable, and its
Descent, emerges as the most effective algorithm based
simplicity makes it easy to apply. Lastly, PART is also
on the metrics’ comprehensive overview. It achieves the
referred to as ‘Partial Decision Tree’, an algorithm that highest scores in nearly all metrics, except for Recall and
incorporates components of decision trees and rule-based ROC Area. Yet, the algorithm’s high specificity and
classifiers. As its name might suggest, PART develops
precision minimize the likelihood of false positives, but

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 50

its sensitivity/recall remains doubtful compared to other J48 and Random Forest are most effective at identifying
algorithms. positive diabetic cases, considering their high recall scores.
After SGD, Logistic Regression is another excellent Surprisingly, despite performing well in accuracy and
algorithm in the function-based method, especially given precision, SGD’s recall is lower than expected as this
its large ROC Area, which implies its effectiveness in suggests that this algorithm might misclassify many
identifying diabetes and non-diabetic patients (true diabetic cases (false negatives). This metric is particularly
positive and false positive cases). It also has a higher important in disease prediction because it minimizes the
recall score than SGD, which is deemed the most probability of false negatives—when a diabetic patient is
important metric in medical diagnosis, mistakenly identified as non-diabetic, possibly leading to
In instances where the simplicity of rules is valued, the condition left untreated (Government of Canada, 2023).
JRip proves to be the best alternative. Within the context
of healthcare, particularly when one must account for the
degree of which the model will be applied in practice,
JRip is most likely the best choice because of its
interpretability, which is extremely useful for medical
decision-making. However, if predictive performance is
prioritized, Random Forest should also be considered
with reference to its higher Recall and robustness, a
factor that is exemplified by the larger ROC Area.

Figure 4: F1-score Performance


SGD once again comes out on top with the highest F1-
score, seconded by Logistic Regression, and then JRip.
This implies that these algorithms maintain a balance
between precision and recall. On top of this, as referred to
the ROC Area performance in Figure 6, Logistics
Regression yet demonstrates a strong performance with
Figure 2: Accuracy & Precision Performance the largest ROC Area (0.832), showing its capability at
identifying true positive cases and false positive cases.
In terms of accuracy, SGD, Logistic Regression, JRip
are the top 3 performers, scoring 0.78, 0.772, 0.76,
respectively. This means that these three algorithms can
predict the correct outcome. However, accuracy alone
can be misleading, especially within the context of
medical diagnosis where there are many other crucial
factors that need to be considered, such as Recall,
Specificity, F-1, and Precision. For precision, these 3
algorithms remain as top performers, indicating that
when a diabetic case is predicted as positive, the
prediction is likely correct.

Figure 5: ROC Area Performance

4.1 Feature Engineering


Feature engineering is the process of choosing,
transforming, and producing useful input features from
raw data for supervised machine learning (Patel, 2024).
For evaluating whether the machine learning model could
be improved by feature selection, ‘InfoGainAttributeEval’

Figure 3: Recall Performance

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 51

was chosen as an evaluator employed to rank the The last three attributes are 'Preg’ (pregnancy), ‘Pedi’
usefulness of the features. (diabetes pedigree function), and ‘Pres’ (diastolic blood
The following features were ranked in order: pressure) show that they are not significant in predicting
Plas, mass, age, insu, skin, preg, pedi, and pres. diabetes. Although their attribute scores are low, they help
add context to the model in fine-tuning the prediction,
The attribute score above shows that plasma glucose especially certain subgroups such as older individuals and
concentration level is the most critical feature in women with a history of gestational diabetes.
diagnosing diabetes. Based on research, type 2 diabetes Overall, the ‘Info Gain Evaluator’ facilitates better
can develop as early as four to seven years of age before a understanding of the features’ influence on the outcome
clinical diagnosis (Gurung et al., 2024). The plasma variable. Judging from the evaluator, feature ‘Plas’
glucose level can be used as the basis for monitoring the (plasma glucose concentration level) is ranked first with
disease and predicting complications among Type 2 the highest score, whereas feature ‘Pres’ (diastolic blood
diabetic patients. On a related note, the American Diabetes pressure) comes last. The Random Forest classifier was
Association (n.d.) has established the limits for used as the basis for the study to see whether the removal
prediabetes (100-125 mg/dL fasting) and diabetes (>126 of the lowest score feature would improve the model’s
mg/dL). For this reason, it is not unexpected that this performance.
attribute is placed first, as it serves as the foundation for
4.2 Feature Reduction Analysis
diabetes diagnosis.
The second most important feature is ‘Body Mass
Index’. This is particularly the case when obesity is often
deemed as a key risk factor for Type 2 diabetes (Daley &
Yashi, 2023). BMI is an indicator of body fat based on
height and weight. A high BMI score is closely associated
with an increased risk of being diagnosed with diabetes.
The condition elevates insulin resistance, immobilizing
the body’s cells’ ability to respond to insulin, resulting in
increased blood sugar levels. Research done by Ganz et al.
(2014) shows that people who score more than 30 on their
BMI are much more likely to develop Type 2 diabetes.
Ranked behind BMI is ‘Age’. Research has shown that
the likelihood of developing diabetes grows with age Figure 6: RF’s Performance Before Feature Removal
(Mordarska & Godziejewska-Zawada, 2017). Clinical
studies have shown that people over the age of 45 have a
high risk of developing diabetes (Flores et al., 2020). This
is attributable to the insulin secretion deficit and insulin
resistance that develop with age and changes in body
composition.
Next is the ‘Insu’ feature, which stands for 2-hour
Serum Insulin, that measures an individual’s insulin levels
following a 2-hour glucose tolerance test, showing how
the body might react to glucose. Insulin levels found in
insulin-resistant individuals are usually high because the
pancreas has to overwork to produce more insulin to
overcome the resistance (Wilcox, 2005). However, insulin Figure 7: RF’s Performance After Feature Removal
production may decrease over time as the pancreas loses
effectiveness. Tracking this alteration can be vital for As evident from the above figures, the performance of
Random Forest slightly improves across almost all fronts.
detecting prediabetes and Type 2 diabetes.
The recall score also increases, signifying the model’s
Ranked fifth is the ‘Skin’ feature that uses skinfold stability in fighting against misclassification. Although the
thickness as a baseline for measuring body fat percentage. ROC Area decreases marginally from 0.820 to 0.814, it
Its lower ranking may be because skinfold thickness is not nevertheless suggests that feature reduction does help
as direct or effective a predictor of diabetes as BMI, which streamline the model without compromising its predictive
provides a more comprehensive picture of a person's ability. After this was done, other algorithms were
weight in relation to their height. Skinfold thickness may immediately tested to see if they would have the same
give supplementary information, although it is not usually outcome as Random Forest.
a primary factor in diabetes risk assessment.

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 52

Aside from Random Forest, feature reduction has proven The findings in this study have great implications for
advantageous to J48 and Logistic Regression with slight clinical diagnostics. Machine Learning classifiers, as
improvement in key areas. Conversely, SGD and JRip do shown throughout this study, can help identify diabetic
not have any gain from said process but rather experience patients. With early on identification, doctors can provide
a marginal drop in performance. This might be due to their these patients with personalized treatment and improve
reliance on the complete feature set in order to perform their quality of life. However, despite these findings, there
optimally. are a few limitations that should be noted. To start, this
From this one can see that rule-based classifiers, such as dataset might not accurately reflect the general population
JRip and OneR, are most impacted by feature engineering since it is leaned more toward a specific group of people
because they rely on simple, interpretable rules, making (Pima Indians). This shortcoming may impair the model's
them very sensitive to feature significance. Tree-based ability to generalize for other groups of patients.
classifiers, prime example being Random Forest, see an Moreover, although the dataset is comprehensive, it might
improvement partly because of how they can take have overlooked other attributes that are more applicable
advantage of feature significance within their frameworks. to other demographics.
Function-based classifiers, such as Logistic Regression, Therefore, future studies must address the above-
benefit from feature reduction since it simplifies the model mentioned constraints to improve the machine learning
and prevents overfitting, whereas SGD and MLP are more classifiers’ credibility and make them applicable to real-
resistant to feature engineering, although feature reduction world healthcare practice. For instance, other health-
still improves performance. related datasets with comparable features to those in this
can be used and assessed to broaden the scope of this
5. CONCLUSION study. On the same note, further research on this subject
In conclusion, this research has efficiently investigated should attempt to incorporate as many data points as
the functionality of various machine learning algorithms possible from the same group of patients over time. Doing
in predicting and diagnosing diabetes. The primary goal so may culminate in the building of a sophisticated
was to investigate 3 machine learning methods – trees- prediction and diagnosis model for diabetes intervention.
based, rules-based, and function-based – to find the most According to Agliata et al. (2023), such an advanced
optimal algorithm for clinical use. While SGD, a function- model could be accomplished through complex pattern
based algorithm, scores high in almost all the metrics, its and context identifications, as well as neural network
low recall score is a major pain point since misclassifying technologies – for example, Long-Short-Term Memory
diabetic patients (increasing the likelihood of false Models.
positives) poses grave danger in healthcare context,
resulting in the disease being left untreated. Logistic
Regression and Random Forest may be strong candidates REFERENCES
in this case because of their more balanced performance. [1] Agliata, A., Giordano, D., Bardozzo, F., Bottiglieri, S., Facchiano,
A., & Tagliaferri, R. (2023). Machine learning as a support for the
Logistic Regression, with its large ROC area, high diagnosis of Type 2 diabetes. International Journal of Molecular
precision, accuracy, and F-1 measure, is good for Sciences, 24(7), 6775. https://doi.org/10.3390/ijms24076775
accurately identifying diabetic and non-diabetic patients [2] Ahmed, T. M. (2016). Using data mining to develop models for
across different thresholds. However, Random Forest’s classifying diabetic patient control level based on historical medical
robustness and well-balanced metrics, and most records. Journal of Theoretical and applied information
Technology, 87(2), 316.
importantly high recall score, make it an optimal classifier
[3] Alam, T. M., Iqbal, M. A., Ali, Y., Wahab, A., Ijaz, S., Baig, T. I.,
in a case like this where false positives must be reduced at
Hussain, A., Malik, M. A., Raza, M. M., Ibrar, S., & Abbas, Z.
all cost for early intervention. (2019). A model for early prediction of diabetes. Informatics in
Key features such as plasma glucose concentration, body Medicine Unlocked, 16, 100204.
mass index, and age were the top three most influential https://doi.org/10.1016/j.imu.2019.100204
predictors in the dataset, thereby demonstrating their [4] Allen, A., Iqbal, Z., Green-Saxena, A., Hurtado, M., Hoffman, J.,
usefulness in assessing risks of diabetes. Feature Mao, Q., & Das, R. (2022). Prediction of diabetic kidney disease
with machine learning algorithms, upon the initial diagnosis of type
engineering was also conducted by assessing the 2 diabetes mellitus. BMJ Open Diabetes Research & Care, 10(1),
significance of each feature on the outcome variable. This e002560. https://doi.org/10.1136/bmjdrc-2021-002560
process has proven that by modifying or removing a [5] American Diabetes Association. (n.d.). Diabetes Diagnosis &
feature of low significance, the model’s performance can Tests. https://diabetes.org/about-diabetes/diagnosis
be slightly improved. The ‘Info Gain Attribute Eval’ [6] Baadel, S., Thabtah, F., Lu, J. (2020). A clustering approach for
method was employed and proved effective in using Autistic trait classification, Informatics for Health and Social Care,
45 (3), 309-326.
feature reduction to stabilize classifiers’ performances
[7] Berge, G. T., Granmo, O., Tveit, T. O., Ruthjersen, A. L., &
through relevant attributes. Sharma, J. (2023). Combining unsupervised, supervised and rule-
based learning: the case of detecting patient allergies in electronic

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 53

health records. BMC Medical Informatics and Decision Making, [23] Iparraguirre-Villanueva, O., Espinola-Linares, K., Castañeda, R. O.
23(1). https://doi.org/10.1186/s12911-023-02271-8 F., & Cabanillas-Carbonell, M. (2023). Application of machine
[8] Bhat, S. S., Banu, M., Ansari, G. A., & Selvam, V. (2023). A risk learning models for early detection and accurate classification of
assessment and prediction framework for diabetes mellitus using Type 2 diabetes. Diagnostics, 13(14), 2383.
machine learning algorithms. Healthcare Analytics, 4, 100273. https://doi.org/10.3390/diagnostics13142383
https://doi.org/10.1016/j.health.2023.100273 [24] Jadhav, A., Mostafa, S. M. M., Elmannai, H., & Karim, F. K.
[9] Chang, V., Bailey, J., Xu, Q. A., & Sun, Z. (2022). Pima Indians (2022). An empirical assessment of performance of data balancing
diabetes mellitus classification based on machine learning (ML) techniques in classification task. Applied Sciences, 12(8), 3928.
algorithms. Neural Computing and Applications, 35(22), 16157– https://doi.org/10.3390/app12083928
16173. https://doi.org/10.1007/s00521-022-07049-z [25] Javaid, M., Haleem, A., Singh, R. P., Suman, R., & Rab, S. (2022).
[10] Chowdhury, M. M., Ayon, R. S., & Hossain, M. S. (2024). An Significance of machine learning in healthcare: Features, pillars
investigation of machine learning algorithms and data and applications. International Journal of Intelligent Networks, 3,
augmentation techniques for diabetes diagnosis using class 58–73. https://doi.org/10.1016/j.ijin.2022.05.002
imbalanced BRFSS dataset. Healthcare Analytics, 5, 100297. [26] Jonnalagadda, S., Cohen, T., Wu, S., & Gonzalez, G. (2012).
https://doi.org/10.1016/j.health.2023.100297 Enhancing clinical concept extraction with distributional
[11] Daley S & Yashi K. (2023). Obesity and Type 2 Diabetes. semantics. Journal of Biomedical Informatics, 45(1), 129–140.
https://www.ncbi.nlm.nih.gov/books/NBK592412/ https://doi.org/10.1016/j.jbi.2011.10.007
[12] Das, D., Biswas, S. K., & Bandyopadhyay, S. (2022). Detection of [27] Kolasa, K., Admassu, B., Hołownia-Voloskova, M., Kędzior, K. J.,
Diabetic Retinopathy using Convolutional Neural Networks for Poirrier, J. E., & Perni, S. (2024). Systematic reviews of machine
Feature Extraction and Classification (DRFEC). Multimedia Tools learning in healthcare: a literature review. Expert review of
and Applications, 82(19), 29943–30001. pharmacoeconomics & outcomes research, 24(1), 63–115.
https://doi.org/10.1007/s11042-022-14165-4 https://doi.org/10.1080/14737167.2023.2279107
[13] Davenport, T., & Kalakota, R. (2019). The potential for artificial [28] Mall, P. K., Singh, P. K., Srivastav, S., Narayan, V., Paprzycki, M.,
intelligence in healthcare. Future Healthcare Journal, 6(2), 94–98. Jaworska, T., & Ganzha, M. (2023). A comprehensive review of
https://doi.org/10.7861/futurehosp.6-2-94 deep neural networks for medical image processing: Recent
developments and future opportunities. Healthcare Analytics, 4,
[14] Dudkina, T., Meniailov, I., Bazilevych, K., Krivtsov, S., &
100216. https://doi.org/10.1016/j.health.2023.100216
Tkachenko, A. (2021). Classification and Prediction of Diabetes
Disease using Decision Tree Method. Symposium on Information [29] Mordarska, K., & Godziejewska-Zawada, M. (2017). Diabetes in
Technologies & Applied Sciences, 163–172. http://ceur- the elderly. Menopause Review/Przegląd Menopauzalny, 16(2),
ws.org/Vol-2824/paper16.pdf 38-43. https://doi.org/10.5114/pm.2017.68589
[15] Fjellström, C., & Nyström, K. (2022). Deep learning, stochastic [30] Mousa, A., Mustafa, W., Marqas, R. B., & Mohammed, S. H. M.
gradient descent and diffusion maps. Journal of Computational (2023). A comparative study of diabetes detection using the PIMA
Mathematics and Data Science, 4, 100054. Indian Diabetes Database. The Journal of the University of Duhok,
https://doi.org/10.1016/j.jcmds.2022.100054 26(2), 277–288. https://doi.org/10.26682/sjuod.2023.26.2.24
[16] Flores, Y. N., Toth, S., Crespi, C. M., Ramírez-Palacios, P., [31] Nguyen, L. P., Tung, D. D., Nguyen, D. T., Le, H. N., Tran, T. Q.,
McCarthy, W. J., Briseño-Pérez, A., Granados-García, V., & Van Binh, T., & Pham, D. T. N. (2023). The utilization of machine
Salmerón, J. (2020). Risk of developing pre-diabetes or diabetes learning algorithms for assisting physicians in the diagnosis of
over time in a cohort of Mexican health workers. PLoS ONE, 15(3), diabetes. Diagnostics, 13(12), 2087.
e0229403. https://doi.org/10.1371/journal.pone.0229403 https://doi.org/10.3390/diagnostics13122087
[17] Ganz, M. L., Wintfeld, N., Li, Q., Alas, V., Langer, J., & Hammer, [32] Pal, S., Mishra, N., Bhushan, M., Kholiya, P. S., Rana, M., & Negi,
M. (2014). The association of body mass index with the risk of type A. (2022). Deep learning techniques for prediction and diagnosis of
2 diabetes: a case-control study nested in an electronic health diabetes mellitus. 2022 International Mobile and Embedded
records system in the United States. Diabetology & metabolic Technology Conference (MECON).
syndrome, 6(1), 50. https://doi.org/10.1186/1758-5996-6-50 https://doi.org/10.1109/mecon53876.2022.9752176
[18] Government of Canada. (2023, October 23). Care during [33] Patel, H. (2024, April 29). Feature Engineering explained. Built In.
pregnancy: Family-centred maternity and newborn care national https://builtin.com/articles/feature-
guidelines. Canada.ca. https://www.canada.ca/en/public- engineering#:~:text=Apr%2029%2C%202024-
health/services/publications/healthy-living/maternity-newborn- ,Feature%20engineering%20is%20the%20process%20of%20selec
care-guidelines-chapter-3.html ting%2C%20manipulating%20and%20transforming,used%20in%
20a%20predictive%20model.
[19] Grabler, P., Sighoko, D., Wang, L., Allgood, K., & Ansell, D.
(2017). Recall and cancer detection rates for screening [34] Salazar-Reyna, R., Gonzalez-Aleu, F., Granda-Gutierrez, E. M.,
mammography: finding the sweet spot. American Journal of Diaz-Ramirez, J., Garza-Reyes, J. A., & Kumar, A. (2020). A
Roentgenology, 208(1), 208–213. systematic literature review of data science, data analytics and
https://doi.org/10.2214/ajr.15.15987 machine learning applied to healthcare engineering systems.
Management Decision, 60(2), 300–319.
[20] Gurung, P., Zubair, M., & Jialal, I. (2024, February 27). Plasma
https://doi.org/10.1108/md-01-2020-0035
glucose. StatPearls - NCBI Bookshelf.
https://www.ncbi.nlm.nih.gov/books/NBK541081/ [35] Shetty, D., Rit, K., Shaikh, S., & Patil, N. (2017). Diabetes disease
prediction using data mining. 2017 International Conference on
[21] Haripriya, G., Abinaya, K., Aarthi, N., & Kumar, P. (2021).
Innovations in Information, Embedded and Communication
Random Forest Algorithms in Health Care Sectors: A Review of
Systems (ICIIECS). https://doi.org/10.1109/iciiecs.2017.8276012
Applications.
[36] Siddique, S., & Chow, J. C. L. (2021). Machine learning in
[22] Hounguè, P., & Bigirimana, A. G. (2022). Leveraging PIMA
healthcare communication. Encyclopedia, 1(1), 220–239.
Dataset to diabetes Prediction: case study of deep neural networks.
https://doi.org/10.3390/encyclopedia1010021
Journal of Computer and Communications, 10(11), 15–28.
https://doi.org/10.4236/jcc.2022.1011002 [37] Simaiya, S., Kaur, R., Sandhu, J. K., Alsafyani, M., Alroobaea, R.,
Alsekait, D. M., Margala, M., & Chakrabarti, P. (2022). A novel

http://ijmada.com
Int. J. Management and Data Analytics, Vol. 4 (1), 2024 54

multistage ensemble approach for prediction and classification of


diabetes. Frontiers in Physiology, 13.
https://doi.org/10.3389/fphys.2022.1085240
[38] Sonia, J. J., Jayachandran, P., Quadir, A., MD, Mohan, S.,
Sivaraman, A. K., & Tee, K. F. (2023). Machine-Learning-Based
diabetes mellitus risk prediction using Multi-Layer Neural Network
No-PROP algorithm. Diagnostics, 13(4), 723.
https://doi.org/10.3390/diagnostics13040723
[39] Wilcox G. (2005). Insulin and insulin resistance. The Clinical
biochemist. Reviews, 26(2), 19–39.
[40] Xu, K., Feng, D., & Mi, H. (2017). Deep Convolutional Neural
Network-Based Early Automated detection of diabetic retinopathy
using FundUs Image. Molecules, 22(12), 2054.
https://doi.org/10.3390/molecules22122054

http://ijmada.com

You might also like