IDP FINAL BATCH-16 SEC-G
IDP FINAL BATCH-16 SEC-G
IDP FINAL BATCH-16 SEC-G
Nagi Reddy Annapureddy Dr. Deva Kumar S line 1: 6th Given Name Surname
Department of CSE Associate Professor line 2: dept. name of organization
Vignan’s Foundation for Science, Technology Department of CSE (of Affiliation)
and Research Vignan’s Foundation for line 3: name of organization
Guntur, India Science, Technology and (of Affiliation)
naagireddy@gmail.com Research line 4: City, Country
Guntur, India line 5: email address or ORCID
drsdk@gmail.com
2. Model Selection
III. METHODOLOGY
Predicting Chronic Kidney Disease (CKD) is the primary
A CKD prediction system Based on the implementation of focus of the model-building stage, which comes after
the machine learning model proposed in this research. system preprocessing the data. Once missing values have been
is intended to use a range of machine learning methods, addressed, categorical variables have been encoded, and
including KNN, SVM, NB, and MLP, to concentrate on the numerical variables have been scaled, all of these procedures
early diagnosis of CKD. To enhance precision, sensitivity, will be utilised to create models that use machine learning
specificity, an ensemble approach will also be developed to make predictions. Typical models include SVM, gradient
to aggregate predictions from various models. After that, it boosting method, logistic regression, decision trees, and
preprocesses the patient data so that its MLPs, including random forests. A portion of the data will be used to train
those with CKD status, can be trained as shown in Figure 1. each model in order to identify the links and patterns present
in the features.
1. Data Gathering and Preparation
Data Source: Public clinical databases, the UCI CKD a. KNN, or K-Nearest Neighbours
dataset, or hospital databases that hold patient records in- Data points are categorized by the non-parametric
cluding demographics, test results, and diagnoses will be the KNN algorithm according to how close they are to other
points in
the feature space. The model will be educated on a subset 4. Model Assessment
of the data collection, and its performance will be reviewed Performance Metrics: The effectiveness of all the
using measurements such as accuracy, precision, recall, and models will be evaluated using accuracy, precision, recall, F1-
the F1 score. The optimal number of neighbours (k) to ensure score, and the area under the ROC-AUC curve. These metrics
accurate predictions will be found using cross-validation. are selected because they not only provide only measure the
actual success of the model for identifying CKD patients but
b. Support vector machines, or SVMs also rule out false positives or negatives.
SVM is a important bracket fashion that Accuracy: it tests the overall efficiency of the model in
chooses the most effective hyperplane for class division. The yielding highly accurate prediction.
model’s performance will be estimated grounded on its Accuracy and Recall: A good performance measure that
capability to generalise to new data. To find the stylish speaks to a balance between true positives and avoiding
configuration for the CKD dataset, a range of kernel functions false negatives, particularly of great importance in the CKD
will be assessed with a focus on delicacy and other bracket detection.
criteria . ROC-AUC: It is the trade-off between true positives and false
positives, hence at multiple levels of thresholds.
c. Naive Bayes( NB) Comparison of Models: Comparing the models according
Founded on Bayes’ theorem, the Naive Bayes to the performance metrics; because in this case the taken
classifier assumes that the predictors are independent of one benchmark is NB and therefore the outcome in more complex
another. The model will be trained on the dataset, and its models, with a combination of SVM along with KNN & NB,
predictive performance will be evaluated using Bayes theorem with its previous studies, prove yielding higher accuracy than
criteria. Given the simplicity and effectiveness of Naive every of these models.
Bayes, it serves as a strong birth for comparison against more
complex models. 5. Ensemble Methods
To increase the precision of prediction accuracy levels, we
d. Perceptron with Multiple Layers (MLP) shall employ ensemble methods. A hybrid model of KNN,
One kind of artificial neural network that works well SVM, and NB will be used. For such models, Majority
for simulating intricate relationships in data is MLP. It is Voting or Stacking techniques for aggregating predictions can
made up of several interwoven layers of neurones.The model be performed.
will be trained and evaluated using standard metrics, with an
emphasis on performance optimisation through the
manipulation of hyperparameters like learning rate and hidden IV. RESULTS AND ANALYSIS
layer count. Several machine learning models, including SVM, NB,
KNN, and MLP, were trained during the experimentation
3. Dimensionality Reduction stage. A number of Performance indicators are employed to
Dimensionality reduction is the lowering of high- substantiate the findings for these models. To determine
dimensional data space by representation, with more whether any model can accurately predict CKD, performance
emphasis on the most informative features that help to: is evaluated using the degree of accuracy, precision, recall,
Accelerate computations and F1 score. The results indicate that while each model has
Lower the chances of its advantages, the ensemble model outperforms the others
overfitting Enhance due to its utilization of SVM, NB, KNN, and MLP
visualization predictions. The ensemble model’s success stems from its
Remove noise capacity to leverage the capabilities of various classifiers,
Make the data easier to interpret which enhances robustness against overfitting and enables the
capture of more intricate patterns in the dataset.
Following individual model evaluation, PCA will be used This structure was utilized by confusion matrices to show
to lower the dataset’s dimensionality. Principal components the performance of each model in terms of false positives,
are a new collection of features that PCA creates from the false negatives, true positives, and true negatives. The
original feature space in order to best capture the maximum graphical representation was essential for illustrating incorrect
variance in the data. Therefore, this phase is similar to classifications and demonstrated a distinct distinction between
lowering the curse of dimensionality, which speeds up the CKD and non-CKD patients, both of which needed to be
model’s training and improves performance overall. The improved.
training data will undergo PCA, and the test data will undergo
the same transformation. At least a sizable portion of the data We compared four classifiers—Support Vector Classifier
variance—let’s say 95%—will be preserved by the number of (SVC), Gaussian Naive Bayes (GaussianNB), K-Nearest
components kept. The models discussed above will then be Neighbours (KNN), and Multi-Layer Perceptron (MLP)—
run once more to observe the impact of dimensionality before and after using Principal Component Analysis (PCA)
reduction on their prediction capabilities. in order to assess the effect of dimensionality reduction on
model-performance.
Table 1:Performance Metrics
Table 2:Hybrid Model Performance
Model Accuracy Precision Recall F1 Score
Hybrid Model
Na¨ıve Bayes 0.95 0.96 0.95 0.95
(SVM , KNN , NB)
Support Vector Ma- 0.85 0.85 0.85 0.85
chine Metric Value
MLP 0.62 0.39 0.62 0.48 Precision 0.98
KNN 0.93 0.93 0.93 0.92
Recall 0.98
F1-Score 0.98
Accuracy 0.98
The accuracy measurements showed that PCA's dimensionality
reduction had variable effects on model performance as showed
in Table 1. Notably, after PCA, the accuracies of the SVC and
MLP classifiers increased from 85% and 62% to 88% and 89%,
V. CONCLUSION
respectively, showing a slight improvement. On the other hand,
GaussianNB and KNN continued to perform similarly, In this study, we will explore the capabilities of various models in
demonstrating the resilience of their decision boundaries in both predicting chronic kidney disease. We will begin by analyzing each
high-dimensional and low-dimensional spaces. This model separately, ranging from SVM to Naive Bayes, KNN, and
investigation demonstrates how PCA can improve classifier MLP, using accuracy, precision, recall, and F1 score metrics to
efficiency, especially for models that stand to gain from smaller evaluate their predictive effectiveness for CKD. The application of
feature spaces by reducing computing complexity and PCA helped to decrease the dimensionality for a dimensionality
overfitting. reduction process in the subsequent step. PCA minimized the number
of features while preserving the essential information for more
efficient preprocessing. This preprocessing step is expected to
enhance model performance by reducing overfitting and
improving generalization to new data.
[8] G. Kaur and M. Tech, “Mining Algorithms In Hadoop,” Proc. Int. Conf.
Inven. Comput. Informatics (ICICI 2017), no. Icici, 2017.