Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
62 views

Assignment Cover Sheet - IT Machine Learning

Uploaded by

3228suman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Assignment Cover Sheet - IT Machine Learning

Uploaded by

3228suman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

1 Machine Learning

INTERNATIONAL SCHOOL OF
MANAGEMENT AND TECHNOLOGY

FACULTY OF COMPUTING

ASSIGNMENT COVER SHEET


This form is to be completed by students submitting assignments of level 4 and level 5. Students
are required to complete all sections and attach to your assignment.

STUDENT DETAILS

STUDENT NAME Kaushlendra Yadav

STUDENT ID Kit22e.05kau@ismt.edu.np

UNIT AND ASSIGNMENT DETAILS

UNIT TITLE Unit 25: Machine Learning

UNIT NUMBER H/618/7438

ASSIGNMENT
TITLE

ISSUE DATE July 10, 2024 DUE DATE September 11, 2024

ASSESSOR Rajad Shakya


NAME

ESTIMATED 5884
WORD LENGTH

KAUSHLENDRA 1
1 Machine Learning

SUBMISSION

HAND IN DATE September 11, 2024

DECLERATION AND ACKNOWLEDGEMENT

When submitting assignments, each student must sign a declaration confirming that the work
is their own.

Plagiarism and Collusion


Plagiarism: to use or pass off as one’s own, the writings or ideas of another without
acknowledging or crediting the source from which the ideas are taken.

Collusion: submitting an assignment, project or report completed by another person


and passing it off as one’s.

In accordance with the Academic Integrity and Plagiarism Policy:

1. I declare that:
a) this assignment is entirely my own work, except where I have included fully-
documented references to the work of others,
b) the material contained in this assignment has not previously been submitted for any
other subject at the University or any other educational institution, except as
otherwise permitted,
c) no part of this assignment or product has been submitted by me in another (previous
or current) assessment, except where appropriately referenced, and with prior
permission from the Lecturer / Tutor / Unit Coordinator for this unit.

2. I acknowledge that:
a) if required to do so, I will provide an electronic copy of this assignment to the
assessor;
b) the assessor of this assignment may, for the purpose of assessing this assignment:

KAUSHLENDRA 2
1 Machine Learning

I. reproduce this assignment and provide a copy to another member of academic


staff;
II. communicate a copy of this assignment to a plagiarism checking service such
as Plagiarism Check (which may then retain a copy of this assignment on its
database for the purpose of future plagiarism checking).

I am aware of and understand that any breaches to the Academic Code of Conduct will be
investigated and sanctioned in accordance with the College Policy.

SIGNATURE DATE

Contents

1. Introduction 4

2. Machine Learning Overview 4

3. Features of Artificial Intelligence 5

4. Examining Issues with Learning 7

4. 1 Supervised Education 7

Typical Uses in Medical: 8

4. 2 Unmonitored Education 8

Common Uses in Healthcare: 8

4. 3 Learning via Reinforcement 8

Possible Uses in Healthcare: 9

Possible Uses in Healthcare: 9

5. Classification of Algorithms for Machine Learning 9

5. 1 Learning Style Classification 10

Reinforcement Learning: 10

KAUSHLENDRA 3
1 Machine Learning

6. Preparing Data 11

7. Model Choice 12

8. Model Application 13

10. Model Assessment (Ongoing) 14

Decision Tree Overfitting: 15

Logistic Regression: 16

12. Interpretation of the Results 16

Interpretation of Precision and Recall: 17

14. Forecast based on Fresh Information is a tour de force of economic theory and its application
for forecasting model. 18

16. Some of the challenges that the organization experiences in a bid to forecast readmissions of
patients are as follows: 19

18. Upcoming Projects and Enhancements 20

Activity:2 22

1. Introduction: 22

2. Data Preparation: 22

2.3. Scaling Numerical Features: 24

2.4. Splitting the Data into Training and Test Sets: 25

3. Model Implementation: 25

3.1. Logistic Regression: 25

Implementation Steps: 25

3.2. Random Forest: 26

Procedures for Execution: 26

3.3. Support Vector Machines (SVM): 27

4. Model Evaluation: 27

KAUSHLENDRA 4
1 Machine Learning

4.1. Confusion Matrix: 28

4.2. Accuracy, Precision, Recall, F1-Score: 28

These measurements shed light on how well each model performs: 28

5. Prediction for New Patients: 29

5.1 Function for Predicting New Patients: 29

6. Evaluation and Discussion of Results: 30

6.1. Balanced, Underfitting, or Overfitting: 33

6.2. Effectiveness of Algorithms: 33

6.3. Strengths and Weaknesses: 34

7. Conclusion: 34

Reference: 34

1. Introduction
The healthcare industry has many issues particularly regarding patients readmissions and these
have impacted the crisis on the healthcare industry in terms of expenditure and drain on hospital
facilities. By being able to predict the readmissions, the healthcare provider has the chance to
change the course, readmissions are hypothesized to enhance the patient outcomes and decrease
unneeded hospitalizations. This project uses the data mining concept where a machine learning
algorithm is used to analyse demographic, medical and lifestyle data to determine the probability
of the patient readmission. There are also 1000 patients records with 19 attributes including, age,
gender, history of heart diseases and other options like smoking, etc. Logistic Regression,
Decision Tree, Random Forest models as the ML models will be used to predict the readmissions
which will help to enhance the healthcare management.

2. Machine Learning Overview


In the field of AI, ML enables the computer to learn from the data and make a decision that can
be an inference or a prediction and this can be done without any programming. In healthcare
KAUSHLENDRA 5
1 Machine Learning

sector, machine learning used to enhance treatment plans, for predicting the outcome of the
patient, and to increase organizational efficiency. Some of the ways in which ML can help
healthcare practitioners are as follows; By using machine learning models to predict other
important factors like readmission, the practitioner can be able to manage resources and offer
prevention care. The use of big data some analyses and the identification of trends offered by
machine learning are particularly useful in the healthcare field where numerous factors are
intertwined to influence outcomes.

3. Features of Artificial Intelligence


Machine learning is especially beneficial for applications in the healthcare industry because of a
few important features:

KAUSHLENDRA 6
1 Machine Learning

There are important factors that make machine learning more useful for the healthcare industry
and these includes:
Learning from Data:

Therefore, it is clear that confidential messages do not need to be learned from massive amounts
of data, into which machine learning is introduced. For instance in this project the model is used
to predict readmission in future from data of patients in past.
Predictive Modeling:

It means that through machine learning it is very possible to model future events given the
records of the history. The objective of this project is to develop an algorithm that can estimate
the probability of the patient to be readmitted after discharge, so that the health professionals can
go for early intervention to the high risk patient.
Pattern Recognition:

The definition of machine learning can be as simple as this and yet it can be used to describe
what machine learning does best and that is to look for patterns in the huge ocean of large and
complex data. This comes in handy in the health sector more often than not given that there are
multiple interrelated factors that always influence patient’s over health fluctuation to mention but

KAUSHLENDRA 7
1 Machine Learning

a few; their lifestyle, medical history and so on which may not be easily analyzed using
statistical method.
Continuous Learning:

It also means that when one feeds new data into these models they get better in the skill that they
display. Another strength of the model is that it can be dismantled in order to adapt it to change it
in ways that can accommodate emerging trends and better understanding of health patient data as
this becomes available.

4. Examining Issues with Learning


Three general categories can be used to categorize machine learning problems: The categories
are reinforcement learning, unsupervised learning and the supervised learning. Each category is
confronted to various problem kinds and requires various approaches.

4. 1 Supervised Education
In supervised learning, the algorithm learns the usage by the input data having connections to
known results. In predicting whether the patient would be readmitted to the hospital or not, this
study employ the model which has been trained using patient information. Supervised learning is

KAUSHLENDRA 8
1 Machine Learning

ideal for the problems such as classification like readmitted or not, and regression like, estimate
length of the hospital stay.

Typical Uses in Medical:


• Projecting the likelihood of a patient's readmission based on their clinical, demographic, and
treatment modality variables.

• A method of making medical diagnoses based on a patient's overall physical symptoms and
medical images.

• The ordering of drugs in a pharmacy according to a list of ailments and conditions.

4. 2 Unmonitored Education
In cases where there are no labels for the data it undergoes unsupervised learning with a purpose
of identifying the patterns or structures. The carry out of unsupervised learning in the health care
industry can help in the assessment of clustering of patients for the purposes of evaluating causes
of readmissions. Nonetheless, because it is a tagged outcome forecast that we are preparing here
(readmission), unsupervised learning is not used in this research.

Common Uses in Healthcare:


• Targeting certain populations based on shared traits associated with their particular condition.
• Clustering the genes and classifying them according to their data so that they are part of a
specific functional category.
• Observing subtle deviations in the physiologic parameters of the patients that can portend
future health issues.

4. 3 Learning via Reinforcement


In reinforcement learning an agent interacts with the environment picks up decision making
experiences and learns from the outcome in terms of rewards or penalties. As for the problem of
readmission probability prediction this kind of learning is not directly applicable, yet it can be
helpful in such fields as individual approach and treatment management where constant
interaction between the patient and the system may produce a better outcome.

KAUSHLENDRA 9
1 Machine Learning

Possible Uses in Healthcare:


• Training robots to help with therapeutic tasks or the healing process following surgery.

• Advancing treatment regimens more quickly in accordance with the patient's response to
medicine.

• Creating chatbots that can offer individualized health care information.

Possible Uses in Healthcare:


• Training robots to help with therapeutic tasks or the healing process following surgery.

• Advancing treatment regimens more quickly in accordance with the patient's response to
medicine.

• Creating chatbots that can offer individualized health care information.

5. Classification of Algorithms for Machine Learning


Although there exists no absolute way of classification the following aspects can be used to
classify machine learning algorithms; Classification of machine learning algorithms based on
model type and learning style. It helps in selecting the right algorithm for a certain problem at
hand.

KAUSHLENDRA 10
1 Machine Learning

5. 1 Learning Style Classification


Supervised Learning:

Supervised learning models are used in most tasks across practices including regression and
classification tasks, for instance, predicting patient readmission. That are trained using
supervised learning problems/ algorithms. supervised learning is exemplified by the three
algorithms employed in this project: Random Forest, Decision Tree, and Logistic Regression
algorithms are as follows.

Unsupervised Learning:

The patterns of unlabeled data are detected by the techniques of unsupervised learning such as
clustering. It is crucial in this experiment without having applied unsupervised learning The
intention was not to classify the patients based on similarity.

Reinforcement Learning:

KAUSHLENDRA 11
1 Machine Learning

It has been stated earlier that it may find application in promoting the healthcare treatment
programs and pertaining to the decision making task.
5.2 % Grouping according to Model Type
Algorithms for Regression: used in anticipating continuing performances Net, It is needed in
order to anticipate reoccurring performances Net. For example, using a set of first symptoms in a
patient, regression model could predict how long the patient is to stay in hospital.
When the goal variable is categorical, before predicting its class label for a patient, as in the
example of readmission (Yes/No), classification models are used. This project uses three kinds
of Classification algorithms:

The top three algorithms used in this study are Random Forest, Decision Tree, and Logistic
Regression.

Clustering techniques:

These techniques for unsupervised learning are responsible to group data points which are
having same patterns. The medical field of using clustering may be used to identify groups of
patients with similar health risks.

6. Preparing Data
Preparing the data before feeding it to the machine learning model is perhaps the most important
KAUSHLENDRA 12
1 Machine Learning

step to ensure that the models perform in the best way possible. From the feature lists of this
project, it was possible to have a set of numerical features, categorical features and a
combination of both, which required the use of different preprocessing techniques.
6. 1 Managing Absent Information
The results generated by machine learning models can greatly depend on the data that was
provided to them and in this scenario there are gaps in the data that can affect the models. In the
course of this project, the most frequent value was imputed to the response variable for missing
values on categorical variable and median of each feature for missing values of the numerical
feature vectors. This ensures that the dataset does not bring in biases from missing values while
at the same time maintain homogeneity.
6. 2 Understand the concept of Categorical Variable Encoding and Feature Engineering
Still, gender and exercise habits in machine learning models cannot be utilized in its categorical
form but rather in numericized form. This project used two different encoding techniques:This
project used two different encoding techniques:
Label Encoding:

This technique was applied on ordinal data (disease severity) in particular since the categories
are ranked.
One-Hot Encoding:

This method is used where there is no assumption of an order in the categories such as gender
which is Nominal Variable.
6. 3 Numerical Characteristic Scaling
To normalize age, BMI and amount of alcohol consumed, StandardScaler was utilized on the
characteristics while numerical. Scaling also helps algorithms avoid biasing from features with
large values of ranges and ensures all the features have equal impacts on the model.

7. Model Choice
A key factor of the project’s success is the proper selection of the machine learning models.
Here, three kinds of algorithms that are Logistic Regression, Decision Tree, and Random Forest
were selected for the prediction of patient readmissions.
7. 1 Inverse Logistic Regression
Especially in the case of binary classification tasks, simple and easily understandable model such
KAUSHLENDRA 13
1 Machine Learning

as logistic regression is used. It works according to the probabilities that an input, a linear
combination of its characteristics, belongs to a particular category. Logistic regression is easy to
apply despite the fact it is not effective to identify complex, non-linear relations between the
variables.
7. 2 Trees of Decisions
A decision tree is a non-parametric model of data which organizes them in terms of feature to
give out predictions. It is well-suited for this topic since it is very easily interpretable and does
not have any issues with identifying non-interaction effects of the inputs. Decision trees in turn
may be outperformed on new data – the trees are also sensitive to over-fitting.
7. 3 Indeterminate Forest
Random Forest is yet another ensemble learning technique, whereby Decision Trees are
integrated several to enhance the generated outcomes. Random Forest yields a higher degree of
generalization because prediction is done simultaneously by several trees, hence reducing over-
fitting. This model has been chosen because the complex data can be approached, and probability
of making a correct prediction is much higher compared to Decision Tree.

8. Model Application
When the data was ready, the training dataset was applied for the execution of the machine
learning models. Data division, model building, and assessment of the results that the models
generated were other elements of the implementation process of the algorithm.
8. 1 Dividing Information
In this study, eighty percent of the data were used for the testing while twenty percent for the
training of the models. The used dataset was split into training and testing set. This ensures that
the models are tested with unseen data, thereby providing a better statistics of as to how the
models shall perform in the real world.
8. 2 Educating the Model
The training dataset was used to train the Random Forest, Decision Tree and Logistic Regression
models. Depending on the input features, the various models developed a framework of
prediction after analyzing the patterns found in the data set.
8. 3 Examining Models
After the training procedure the models were tested on the testing dataset in order to evaluate the
performance of the models. The goal of the test is to check the reliability of the models on
KAUSHLENDRA 14
1 Machine Learning

unseen data which is important when trying to make practical predictions of the patients’
readmissions.
9. Adjusting Hyperparameters
Fine tuning was then made in order to bring out the better part of the models above. This means
changing the set of rules by which the models’ learning algorithm is defined by. For example, in
the Random Forest concept, GridSearchCV which is specifically used for selecting the best
parameters to achieve maximum accuracy was employed in defining the number of trees to be
there in the forest, as well as the maximum depth of the trees that are developed in the forest.

10. Model Assessment (Ongoing)


Several parameters were used to assess the performance of the models to develop a clear
understanding of the strengths and weaknesses of the models.
10. 1 Precision
How well the model performs the classification is expressed by accuracy as the ratio of the
number of correct predictions to the total number of predictions made by the model. Albeit,
precision may sometimes be an issue, especially in cases of data skewness, for example
measuring the rate of readmission of patients in which there would be much more number of
patients who are not readmitted than those who are readmitted.
10. 2 Precision and Recall Precision:

Precision measures how many patients are truly positive among those that should be re-admitted
or the expected positive cases. A high precision means low false positive rate which in turn
means that the model has relatively good false positive minimization ability.
Recall:

Sensitivity, also called recall, represents the level of true positive: namely, the proportion of
readmitted patients which was correctly predicted. A high recall rate therefore means that the
model captures most of the real cases of readmissions.
10. 3 Matrix of Confusion
As the name suggests, confusion matrix which shows true positives, true negatives, false
positives, false negatives gives a new level of detail on the performance of the model.
Visualizing the model splits of the student records for readmitted and non-readmitted courses
enhance the interpretation of the results. From the Confusion Matrix perspective, the False
KAUSHLENDRA 15
1 Machine Learning

Negative values were relatively small fro the Random Forest model compared to the Logistic
Regression model.

11. Speak about Underfitting and Overfitting


If a model has high accuracy on the training data but low accuracy on the test data, its said to
have been over fitted. Because of this, the model is only coding features and noise that it has
picked from the used data instead of the pattern. Low accuracy on both the training and test sets
indicate that the model tends to perform underfitting and this is where the model learnt is too
simple and thus it is unable to capture the underlying patterns present in the data.

Decision Tree Overfitting:


Decision Tree model has a tendency of over fitting because the rules it formulates when not
pruned or overly constrained are very complex and specific to the training set.
Overfitting is Reduced by Random Forest:

The Random Forest model achieves this through the combination of the result of a number of
decision trees that were learnt using different subsets of the given data.

KAUSHLENDRA 16
1 Machine Learning

Logistic Regression:

Sometimes, First Order Model may not model the input features in relation to the outcome
(readmission) well when there is a non-linear fashion hence underfitting. Regularization
techniques such as L2 regularization serves in balancing of underfitting by punishing large
coefficients.

12. Interpretation of the Results


The result of the above machines learning models indicates that among the three models, the
Random Forest yielded better results in the prediction of patient readmissions as compared to the
Logistic Regression and Decision Tree models. The Random Forest model had a relatively
balanced precision and recall rate of 80% indicating that the model was fairly accurate. These
results suggest that Random Forest outperforms the other algorithms in dealing with the complex

KAUSHLENDRA 17
1 Machine Learning

interactions that present themselves in the lifestyle and medical history together with the patient
demographics.

Interpretation of Precision and Recall:


By comparing the performance of Random Forest model with Logistic Regression, the model
demonstrated superior precision and high recall that can be interpreted that the model was
successful in separating actual readmissions from non-readmissions with less number of false
alarms.
Confusion Matrix Analysis:

A lower number of instances of misclassification as shown in the matrix for Random Forest
which means that this model was able to capture the majority of the patients who were actually
readmitted.
13. Model Performance Comparison
The advantages and disadvantages of each of the three models—Logistic Regression, Decision
Tree, and Random Forest—were highlighted by comparison: The advantages and disadvantages
of each of the three models—Logistic Regression, Decision Tree, and Random Fores
Highlighted by comparison.

Logistic regression:

KAUSHLENDRA 18
1 Machine Learning

which was easy to interpret has the tendency to accommodate only linear relationship between
variables in a dataset. Due to its suboptimal performance of memory, it could not distinguish
multiple increasing real readmissions.
Decision Tree:

While in contrast to Random Forest, it was less accurate on test set due to overfitting training set
and effectively worked on non-linear data set.
In the view of accuracy and recall, the improved technique with the high value is Random Forest.
It is the most credible model for readmissions’ prediction as it dealt with data-related challenges
without oversimplification.

14. Forecast based on Fresh Information is a tour de force of economic theory and its application
for forecasting model.
After that, to prognosticate patient readmissions, the Random Forest model was used on other
unseen data that were not used to train the model. No additional transformations were performed
on output dataset because the same data preprocessing techniques as used on initial data set were
Applied:

managing with missing values, encoding categorical attributes, and scaling numerical variables.
Example of New Patient:

After analyzing the raw materials of a new patient record, the model generated a prognosis as to
the possibility of readmission of the patient. However, the model provided the probability ratio
that would help the doctors and other health care practitioners to comprehensively decide the
level of risk attached to the patient’s readmission.
Practical Implications:

KAUSHLENDRA 19
1 Machine Learning

These enable early interventions to be provided to high-risk patients, and readmissions also may
be prevented due to this predictive capacity.
15. Application of Analytical Methods in the Context of Machine Learning in the Healthcare
System
Based on analysing data provided to patients and healthcare practitioners as well as optimizing
the performance of health-care organisations, machine learning may revolutionise the health-care
sector. Using machine learning models, hospitals will be able to ascertain the allocation of
resources, as well as identify the target patient population, and design specific treatment
regimens based on patients’ profiles and their potential for risk.
Predictive analytics in readmission:

Using key clinical, lifestyle and demographic attributes related to a patient, the RF model, for
instance, can predict the patient’s likelihood of readmission and make correct interventions in
good time.
Operational Benefits:

The component of readmission predictions helps the healthcare facilities to allocate their
resources by identifying the high-risk patients as well as avoiding certain unnecessary
readmissions.

16. Some of the challenges that the organization experiences in a bid to forecast readmissions of
patients are as follows:
Predicting patient readmissions presents a number of difficulties despite the encouraging results:
Predicting patient readmissions presents a number of difficulties despite the encouraging results:
Unbalanced Data:

Similar issues encountered in other studies are the skewed distribution of data where most of the
patients are not readmitted and thus limit the prediction of patient readmission. Due to this,
models can struggle to get the correct predictions with regards to the minority class of readmitted
patients.
Complexity of Health Data:

KAUSHLENDRA 20
1 Machine Learning

It is worth to underline that numerous complex factors including lifestyle, medical history, and
socioeconomic status may influence the patient’s outcomes. Thus, complex modeling
methodologies are required to incorporate such relationships among these components.
Data Security and Privacy:

Some concerns arise while using patient data in machine learning for the patient privacy
concerns. In healthcare applications the identity of the patient should be masked and data should
be stored in a secure manner.

17. Final Thoughts


In conclusion, our research utilised a set of demographic, health and lifestyle variables to
successfully update the machine learning models with patient readmissions. Random Forest at an
accuracy of 80% was found to be better than the tested models including Logistic Regression and
Decision Trees. This is why, Random Forest IS the better model in readmissions prediction – due
to its ability to handle interactions of complex data and its ability to avoid overfitting. Predictive
analytics in the hospital can help maintain a healthy triage on the patients to be discharged or
who are likely to be readmitted thus enhancing the overall health of the patient and decreasing
the costs of the healthcare services.

18. Upcoming Projects and Enhancements


To expand the model's use and boost performance, a number of changes can be made:However,
the following changes can be made to increase the use of the model and also enhance the its
performance:
Including Extra Features:

Perhaps, the addition of feature detail and more specific patient information including patient
chronic diseases and follow-up treatment recommendations will assist in increasing the
proportionate predictive accuracy of the said model as envisaged.
More Complex methods:

KAUSHLENDRA 21
1 Machine Learning

Therefore, when it comes to machine learning, a shift to such advanced techniques as the deep
learning models or the Gradient Boosting Machines (GBM) is likely to offer a better
performance.
Continuous Learning:

It is therefore important that there should be learning mechanisms that feed the model with new
patient data on regular bases so that the forecast remain relevant over time.
Handling Unbalanced Data:

Reducing the imbalance of the datasets and to increase the performance of the model in
predicting the minority class that is the readmitted patients, SMOTE which stands for Synthetic
Minority Over-sampling could be applied.

19. Study Limitations


There are a few limitations to the study that need to be noted:However there are few limitations
that has to be listed down regarding the study:
Dataset Size:

It could be also limited for the larger and more diverse groups as the given model’s conditions
and applicability for such groups might be also restricted as the project is based on a limited
sample data pool of 1000 records.
Feature Selection:

In this case, there were several other features that could have been used in investigation of the
readmissions but were missing from the dataset such as the socioeconomic data and follow up
data of the patient.
Bias in the Data:

Therefore, it means that the predictions of the model may be skewed by some particular
producer’s bias in regard to the given data set and the types of patients that can be included. In
case some categories of populations are incorporated too much in the dataset; the model will not
generalize for other categories. .

Activity:2

KAUSHLENDRA 22
1 Machine Learning

1. Introduction:
 Using the dataset provided in filter.csv, machine learning models are implemented in this
section of the project to predict healthcare outcomes. Numerous demographic and health-
related characteristics are included in this dataset, including past medical history of
chronic conditions (heart disease, diabetes, arthritis), dietary practices, and exercise
regimens. The objective is to create models that, using patient data, can forecast the
chance of specific medical illnesses or other outcomes, like heart disease.
1.1 This task will address:
Data preparation and preprocessing.
 Application of three distinct machine learning models: Support Vector Machines (SVM),
Random Forest, and Logistic Regression.
 A variety of performance criteria are used to evaluate the models.
 A comparison of the models' advantages and disadvantages.

2. Data Preparation:
To create a machine learning model that is reliable and accurate, data preparation is necessary.
We handle missing values, scale numerical features, encode categorical variables, and clean the
dataset in this step.

KAUSHLENDRA 23
1 Machine Learning

2.1. Preprocessing and Data Cleaning


The dataset includes several columns, including Checkup, Exercise, General Health, and other
indicators of chronic diseases (e.g., Diabetes, Heart Disease). Dealing with any discrepancies,
missing data, or anomalies in these categories comes first.
Missing Values Example from the Filter.csv Dataset:

We can either eliminate the rows in question or estimate the missing values using methods like
the mean, median, or mode for categorical data or continuous data, respectively, if some rows
include missing values for features like BMI or Heart_Disease.

KAUSHLENDRA 24
1 Machine Learning

2.2. Offering Numerical Representation of Categorical Variables


Categorical data must be transformed into a numerical representation in order to construct a
machine learning model. One-Hot Encoding and Label Encoding are two common encoding
techniques.
2.2.1: Coding Labels:
By using label encoding, every category is given a distinct integer. One can, for instance,
transform the column Sex into 1 for Male and 0 for Female.
2.2.2. Encoding One-Hot:
One-hot encoding generates binary columns for every category in multi-class categorical features
like General_Health (which includes categories like "Poor," "Good," and "Very Good").

2.3. Scaling Numerical Features:


Scaling numerical features like Height_(cm), Weight_(kg), and BMI is necessary to make sure
every feature adds the same amount to the model. Both normalization (scaling characteristics to a
[0, 1] range) and standardization (with a mean of 0 and standard deviation of 1) are widely used
approaches.

KAUSHLENDRA 25
1 Machine Learning

2.4. Splitting the Data into Training and Test Sets:


The training set, which makes up 80% of the dataset, is used to train the models, and the test set,
which makes up 20% of the dataset, is used to assess the models.

3. Model Implementation:
Three distinct machine learning models—Logistic Regression, Random Forest, and Support
Vector Machines—will be used and compared in this research (SVM). Since every model is
diverse and has advantages and disadvantages, it may be applied to various data types and
prediction tasks.

3.1. Logistic Regression:


A straightforward yet effective model for binary classification applications is logistic regression.
It forecasts the likelihood of a category result, such as the existence or absence of heart disease.
The process of calculating the parameters of a logistic function, or sigmoid, which converts any
real number into a probability between 0 and 1, is how logistic regression operates.

Implementation Steps:
 Set up the Logistic Regression model.
 Utilizing the training data, train the model.
 Utilizing the test data, assess the model.

KAUSHLENDRA 26
1 Machine Learning

3.2. Random Forest:


Using several decision trees constructed and combined into one, Random Forest is an ensemble
learning technique that produces predictions that are more reliable and accurate. It excels in
managing categorical features and missing data in particular.

Procedures for Execution:


 Start the Random Forest Classifier from scratch.
 Utilizing the training data, train the model.
 Examine how well it performed.

KAUSHLENDRA 27
1 Machine Learning

3.3. Support Vector Machines (SVM):


SVM is a potent classification technique that divides data into classes by identifying the
hyperplane that best does so. When there are more features than samples, and when the space is
high-dimensional, it works very well.

Procedures for Execution:

Set up the SVM model at first.


Utilizing the training data, train the model.
Examine how well it performed.

4. Model Evaluation:
To understand how well the models work with unknown data, model evaluation is essential. To
assess the performance of the three models, we will utilize metrics like the confusion matrix,
accuracy, precision, recall, and F1-score.
KAUSHLENDRA 28
1 Machine Learning

4.1. Confusion Matrix:


An overview of the model's accurate and inaccurate predictions may be found in the confusion
matrix.

4.2. Accuracy, Precision, Recall, F1-Score:


These measurements shed light on how well each model performs:

 The proportion of accurately anticipated cases to all instances is known as accuracy.


 The precision can be defined as the ratio of true positives to the total of false positives.
 Remember: The proportion of genuine positives to the total of false negatives and true
positives.
 The harmonic mean of recall and precision is the F1-Score.

KAUSHLENDRA 29
1 Machine Learning

5. Prediction for New Patients:


Based on the health data of new patients, the models can be trained to predict their outcomes. For
instance, depending on characteristics like age, smoking history, and BMI, we may forecast the
likelihood that a new patient would acquire heart disease.

5.1 Function for Predicting New Patients:

6. Evaluation and Discussion of Results:

KAUSHLENDRA 30
1 Machine Learning

KAUSHLENDRA 31
1 Machine Learning

KAUSHLENDRA 32
1 Machine Learning

KAUSHLENDRA 33
1 Machine Learning

6.1. Balanced, Underfitting, or Overfitting:


 When a model is too basic to identify the underlying patterns in the data, underfitting
takes place. For instance, poor performance of Logistic Regression may point to
underfitting.
 When a model performs well on training data but poorly on test data, it is said to be
overfitting. If not adjusted correctly, Random Forest is more likely to overfit.
 Well-balanced: A well-balanced model adapts well to fresh data. SVM, for instance,
might function consistently in test and training sets.

6.2. Effectiveness of Algorithms:


 Simple, straightforward, and useful for binary classification, logistic regression may miss
intricate correlations in the data.

KAUSHLENDRA 34
1 Machine Learning

 Random Forest: If not properly adjusted, this robust algorithm may overfit and struggle to
handle missing data and categorical features.
 SVM: Performs well with small datasets and is effective in high-dimensional spaces;
nevertheless, it may be sluggish in large datasets.

6.3. Strengths and Weaknesses:


Every model has benefits and drawbacks. Random Forest offers greater accuracy for complicated
datasets than Logistic Regression, despite the latter's simplicity and interpretability. Although
SVM provides a robust boundary separation, it might need more processing capacity.

7. Conclusion:
Based on the characteristics of the data and the objectives of the research, each of the three
models in this comparison—Logistic Regression, Decision Tree, and Random Forest—offers
unique benefits and drawbacks. straightforward, linearly separable data are a good fit for logistic
regression, which produces a probabilistic result that is straightforward to understand but has
trouble with complex or nonlinear patterns. Although they are vulnerable to instability and
overfitting, decision trees provide a high degree of interpretability and flexibility in capturing
nonlinear interactions. Although it has higher computational requirements and is less
interpretable, Random Forest is a strong tool for complicated datasets since it averages many
decision trees to improve accuracy and robustness. The trade-offs between simplicity,
interpretability, and predictive power required for a given application determine which model is
best.

Reference:

Bishop, C.M., 2006. Pattern Recognition and Machine Learning. 1st ed. Springer.
Available at: https://www.springer.com/gp/book/9780387310732
[Accessed 17 July 2024].

Goodfellow, I., Bengio, Y., and Courville, A., 2016. Deep Learning. MIT Press.
Available at: https://www.deeplearningbook.org/
[Accessed 17 July 2024].

KAUSHLENDRA 35
1 Machine Learning

Russell, S.J. and Norvig, P., 2010. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice
Hall.
Available at: https://aima.cs.berkeley.edu/
[Accessed 17 July 2024].

Murphy, K.P., 2012. Machine Learning: A Probabilistic Perspective. MIT Press.


Available at: https://mitpress.mit.edu/books/machine-learning
[Accessed 17 July 2024].

LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436-444.
Available at: https://www.nature.com/articles/nature14539
[Accessed 17 July 2024].

Kohavi, R. and Provost, F., 1998. Glossary of terms. Machine Learning, 30(2), pp.271-274.
Available at: https://link.springer.com/article/10.1023/A:1017181826899
[Accessed 17 July 2024].

Domingos, P., 2012. A few useful things to know about machine learning. Communications of
the ACM, 55(10), pp.78-87.
Available at: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
[Accessed 17 July 2024].

Ng, A., 2019. Machine Learning Yearning. DeepLearning.AI.


Available at: https://www.deeplearning.ai/machine-learning-yearning/
[Accessed 17 July 2024].

Ravi, D., Wong, C., Lo, B., Yang, G., 2017. Deep learning for health informatics. IEEE journal
of biomedical and health informatics, 21(1), pp.4-21.
Available at: https://ieeexplore.ieee.org/document/7727300
[Accessed 17 July 2024].

Schmidhuber, J., 2015. Deep Learning in Neural Networks: An Overview. Neural Networks, 61,
pp.85-117.

KAUSHLENDRA 36
1 Machine Learning

Available at: https://www.sciencedirect.com/science/article/pii/S0893608014002135


[Accessed 17 July 2024].

Hastie, T., Tibshirani, R. and Friedman, J., 2009. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. 2nd ed. Springer.
Available at: https://hastie.su.domains/ElemStatLearn/
[Accessed 17 July 2024].

KAUSHLENDRA 37

You might also like