Assignment Cover Sheet - IT Machine Learning
Assignment Cover Sheet - IT Machine Learning
INTERNATIONAL SCHOOL OF
MANAGEMENT AND TECHNOLOGY
FACULTY OF COMPUTING
STUDENT DETAILS
STUDENT ID Kit22e.05kau@ismt.edu.np
ASSIGNMENT
TITLE
ISSUE DATE July 10, 2024 DUE DATE September 11, 2024
ESTIMATED 5884
WORD LENGTH
KAUSHLENDRA 1
1 Machine Learning
SUBMISSION
When submitting assignments, each student must sign a declaration confirming that the work
is their own.
1. I declare that:
a) this assignment is entirely my own work, except where I have included fully-
documented references to the work of others,
b) the material contained in this assignment has not previously been submitted for any
other subject at the University or any other educational institution, except as
otherwise permitted,
c) no part of this assignment or product has been submitted by me in another (previous
or current) assessment, except where appropriately referenced, and with prior
permission from the Lecturer / Tutor / Unit Coordinator for this unit.
2. I acknowledge that:
a) if required to do so, I will provide an electronic copy of this assignment to the
assessor;
b) the assessor of this assignment may, for the purpose of assessing this assignment:
KAUSHLENDRA 2
1 Machine Learning
I am aware of and understand that any breaches to the Academic Code of Conduct will be
investigated and sanctioned in accordance with the College Policy.
SIGNATURE DATE
Contents
1. Introduction 4
4. 1 Supervised Education 7
4. 2 Unmonitored Education 8
Reinforcement Learning: 10
KAUSHLENDRA 3
1 Machine Learning
6. Preparing Data 11
7. Model Choice 12
8. Model Application 13
Logistic Regression: 16
14. Forecast based on Fresh Information is a tour de force of economic theory and its application
for forecasting model. 18
16. Some of the challenges that the organization experiences in a bid to forecast readmissions of
patients are as follows: 19
Activity:2 22
1. Introduction: 22
2. Data Preparation: 22
3. Model Implementation: 25
Implementation Steps: 25
4. Model Evaluation: 27
KAUSHLENDRA 4
1 Machine Learning
7. Conclusion: 34
Reference: 34
1. Introduction
The healthcare industry has many issues particularly regarding patients readmissions and these
have impacted the crisis on the healthcare industry in terms of expenditure and drain on hospital
facilities. By being able to predict the readmissions, the healthcare provider has the chance to
change the course, readmissions are hypothesized to enhance the patient outcomes and decrease
unneeded hospitalizations. This project uses the data mining concept where a machine learning
algorithm is used to analyse demographic, medical and lifestyle data to determine the probability
of the patient readmission. There are also 1000 patients records with 19 attributes including, age,
gender, history of heart diseases and other options like smoking, etc. Logistic Regression,
Decision Tree, Random Forest models as the ML models will be used to predict the readmissions
which will help to enhance the healthcare management.
sector, machine learning used to enhance treatment plans, for predicting the outcome of the
patient, and to increase organizational efficiency. Some of the ways in which ML can help
healthcare practitioners are as follows; By using machine learning models to predict other
important factors like readmission, the practitioner can be able to manage resources and offer
prevention care. The use of big data some analyses and the identification of trends offered by
machine learning are particularly useful in the healthcare field where numerous factors are
intertwined to influence outcomes.
KAUSHLENDRA 6
1 Machine Learning
There are important factors that make machine learning more useful for the healthcare industry
and these includes:
Learning from Data:
Therefore, it is clear that confidential messages do not need to be learned from massive amounts
of data, into which machine learning is introduced. For instance in this project the model is used
to predict readmission in future from data of patients in past.
Predictive Modeling:
It means that through machine learning it is very possible to model future events given the
records of the history. The objective of this project is to develop an algorithm that can estimate
the probability of the patient to be readmitted after discharge, so that the health professionals can
go for early intervention to the high risk patient.
Pattern Recognition:
The definition of machine learning can be as simple as this and yet it can be used to describe
what machine learning does best and that is to look for patterns in the huge ocean of large and
complex data. This comes in handy in the health sector more often than not given that there are
multiple interrelated factors that always influence patient’s over health fluctuation to mention but
KAUSHLENDRA 7
1 Machine Learning
a few; their lifestyle, medical history and so on which may not be easily analyzed using
statistical method.
Continuous Learning:
It also means that when one feeds new data into these models they get better in the skill that they
display. Another strength of the model is that it can be dismantled in order to adapt it to change it
in ways that can accommodate emerging trends and better understanding of health patient data as
this becomes available.
4. 1 Supervised Education
In supervised learning, the algorithm learns the usage by the input data having connections to
known results. In predicting whether the patient would be readmitted to the hospital or not, this
study employ the model which has been trained using patient information. Supervised learning is
KAUSHLENDRA 8
1 Machine Learning
ideal for the problems such as classification like readmitted or not, and regression like, estimate
length of the hospital stay.
• A method of making medical diagnoses based on a patient's overall physical symptoms and
medical images.
4. 2 Unmonitored Education
In cases where there are no labels for the data it undergoes unsupervised learning with a purpose
of identifying the patterns or structures. The carry out of unsupervised learning in the health care
industry can help in the assessment of clustering of patients for the purposes of evaluating causes
of readmissions. Nonetheless, because it is a tagged outcome forecast that we are preparing here
(readmission), unsupervised learning is not used in this research.
KAUSHLENDRA 9
1 Machine Learning
• Advancing treatment regimens more quickly in accordance with the patient's response to
medicine.
• Advancing treatment regimens more quickly in accordance with the patient's response to
medicine.
KAUSHLENDRA 10
1 Machine Learning
Supervised learning models are used in most tasks across practices including regression and
classification tasks, for instance, predicting patient readmission. That are trained using
supervised learning problems/ algorithms. supervised learning is exemplified by the three
algorithms employed in this project: Random Forest, Decision Tree, and Logistic Regression
algorithms are as follows.
Unsupervised Learning:
The patterns of unlabeled data are detected by the techniques of unsupervised learning such as
clustering. It is crucial in this experiment without having applied unsupervised learning The
intention was not to classify the patients based on similarity.
Reinforcement Learning:
KAUSHLENDRA 11
1 Machine Learning
It has been stated earlier that it may find application in promoting the healthcare treatment
programs and pertaining to the decision making task.
5.2 % Grouping according to Model Type
Algorithms for Regression: used in anticipating continuing performances Net, It is needed in
order to anticipate reoccurring performances Net. For example, using a set of first symptoms in a
patient, regression model could predict how long the patient is to stay in hospital.
When the goal variable is categorical, before predicting its class label for a patient, as in the
example of readmission (Yes/No), classification models are used. This project uses three kinds
of Classification algorithms:
The top three algorithms used in this study are Random Forest, Decision Tree, and Logistic
Regression.
Clustering techniques:
These techniques for unsupervised learning are responsible to group data points which are
having same patterns. The medical field of using clustering may be used to identify groups of
patients with similar health risks.
6. Preparing Data
Preparing the data before feeding it to the machine learning model is perhaps the most important
KAUSHLENDRA 12
1 Machine Learning
step to ensure that the models perform in the best way possible. From the feature lists of this
project, it was possible to have a set of numerical features, categorical features and a
combination of both, which required the use of different preprocessing techniques.
6. 1 Managing Absent Information
The results generated by machine learning models can greatly depend on the data that was
provided to them and in this scenario there are gaps in the data that can affect the models. In the
course of this project, the most frequent value was imputed to the response variable for missing
values on categorical variable and median of each feature for missing values of the numerical
feature vectors. This ensures that the dataset does not bring in biases from missing values while
at the same time maintain homogeneity.
6. 2 Understand the concept of Categorical Variable Encoding and Feature Engineering
Still, gender and exercise habits in machine learning models cannot be utilized in its categorical
form but rather in numericized form. This project used two different encoding techniques:This
project used two different encoding techniques:
Label Encoding:
This technique was applied on ordinal data (disease severity) in particular since the categories
are ranked.
One-Hot Encoding:
This method is used where there is no assumption of an order in the categories such as gender
which is Nominal Variable.
6. 3 Numerical Characteristic Scaling
To normalize age, BMI and amount of alcohol consumed, StandardScaler was utilized on the
characteristics while numerical. Scaling also helps algorithms avoid biasing from features with
large values of ranges and ensures all the features have equal impacts on the model.
7. Model Choice
A key factor of the project’s success is the proper selection of the machine learning models.
Here, three kinds of algorithms that are Logistic Regression, Decision Tree, and Random Forest
were selected for the prediction of patient readmissions.
7. 1 Inverse Logistic Regression
Especially in the case of binary classification tasks, simple and easily understandable model such
KAUSHLENDRA 13
1 Machine Learning
as logistic regression is used. It works according to the probabilities that an input, a linear
combination of its characteristics, belongs to a particular category. Logistic regression is easy to
apply despite the fact it is not effective to identify complex, non-linear relations between the
variables.
7. 2 Trees of Decisions
A decision tree is a non-parametric model of data which organizes them in terms of feature to
give out predictions. It is well-suited for this topic since it is very easily interpretable and does
not have any issues with identifying non-interaction effects of the inputs. Decision trees in turn
may be outperformed on new data – the trees are also sensitive to over-fitting.
7. 3 Indeterminate Forest
Random Forest is yet another ensemble learning technique, whereby Decision Trees are
integrated several to enhance the generated outcomes. Random Forest yields a higher degree of
generalization because prediction is done simultaneously by several trees, hence reducing over-
fitting. This model has been chosen because the complex data can be approached, and probability
of making a correct prediction is much higher compared to Decision Tree.
8. Model Application
When the data was ready, the training dataset was applied for the execution of the machine
learning models. Data division, model building, and assessment of the results that the models
generated were other elements of the implementation process of the algorithm.
8. 1 Dividing Information
In this study, eighty percent of the data were used for the testing while twenty percent for the
training of the models. The used dataset was split into training and testing set. This ensures that
the models are tested with unseen data, thereby providing a better statistics of as to how the
models shall perform in the real world.
8. 2 Educating the Model
The training dataset was used to train the Random Forest, Decision Tree and Logistic Regression
models. Depending on the input features, the various models developed a framework of
prediction after analyzing the patterns found in the data set.
8. 3 Examining Models
After the training procedure the models were tested on the testing dataset in order to evaluate the
performance of the models. The goal of the test is to check the reliability of the models on
KAUSHLENDRA 14
1 Machine Learning
unseen data which is important when trying to make practical predictions of the patients’
readmissions.
9. Adjusting Hyperparameters
Fine tuning was then made in order to bring out the better part of the models above. This means
changing the set of rules by which the models’ learning algorithm is defined by. For example, in
the Random Forest concept, GridSearchCV which is specifically used for selecting the best
parameters to achieve maximum accuracy was employed in defining the number of trees to be
there in the forest, as well as the maximum depth of the trees that are developed in the forest.
Precision measures how many patients are truly positive among those that should be re-admitted
or the expected positive cases. A high precision means low false positive rate which in turn
means that the model has relatively good false positive minimization ability.
Recall:
Sensitivity, also called recall, represents the level of true positive: namely, the proportion of
readmitted patients which was correctly predicted. A high recall rate therefore means that the
model captures most of the real cases of readmissions.
10. 3 Matrix of Confusion
As the name suggests, confusion matrix which shows true positives, true negatives, false
positives, false negatives gives a new level of detail on the performance of the model.
Visualizing the model splits of the student records for readmitted and non-readmitted courses
enhance the interpretation of the results. From the Confusion Matrix perspective, the False
KAUSHLENDRA 15
1 Machine Learning
Negative values were relatively small fro the Random Forest model compared to the Logistic
Regression model.
The Random Forest model achieves this through the combination of the result of a number of
decision trees that were learnt using different subsets of the given data.
KAUSHLENDRA 16
1 Machine Learning
Logistic Regression:
Sometimes, First Order Model may not model the input features in relation to the outcome
(readmission) well when there is a non-linear fashion hence underfitting. Regularization
techniques such as L2 regularization serves in balancing of underfitting by punishing large
coefficients.
KAUSHLENDRA 17
1 Machine Learning
interactions that present themselves in the lifestyle and medical history together with the patient
demographics.
A lower number of instances of misclassification as shown in the matrix for Random Forest
which means that this model was able to capture the majority of the patients who were actually
readmitted.
13. Model Performance Comparison
The advantages and disadvantages of each of the three models—Logistic Regression, Decision
Tree, and Random Forest—were highlighted by comparison: The advantages and disadvantages
of each of the three models—Logistic Regression, Decision Tree, and Random Fores
Highlighted by comparison.
Logistic regression:
KAUSHLENDRA 18
1 Machine Learning
which was easy to interpret has the tendency to accommodate only linear relationship between
variables in a dataset. Due to its suboptimal performance of memory, it could not distinguish
multiple increasing real readmissions.
Decision Tree:
While in contrast to Random Forest, it was less accurate on test set due to overfitting training set
and effectively worked on non-linear data set.
In the view of accuracy and recall, the improved technique with the high value is Random Forest.
It is the most credible model for readmissions’ prediction as it dealt with data-related challenges
without oversimplification.
14. Forecast based on Fresh Information is a tour de force of economic theory and its application
for forecasting model.
After that, to prognosticate patient readmissions, the Random Forest model was used on other
unseen data that were not used to train the model. No additional transformations were performed
on output dataset because the same data preprocessing techniques as used on initial data set were
Applied:
managing with missing values, encoding categorical attributes, and scaling numerical variables.
Example of New Patient:
After analyzing the raw materials of a new patient record, the model generated a prognosis as to
the possibility of readmission of the patient. However, the model provided the probability ratio
that would help the doctors and other health care practitioners to comprehensively decide the
level of risk attached to the patient’s readmission.
Practical Implications:
KAUSHLENDRA 19
1 Machine Learning
These enable early interventions to be provided to high-risk patients, and readmissions also may
be prevented due to this predictive capacity.
15. Application of Analytical Methods in the Context of Machine Learning in the Healthcare
System
Based on analysing data provided to patients and healthcare practitioners as well as optimizing
the performance of health-care organisations, machine learning may revolutionise the health-care
sector. Using machine learning models, hospitals will be able to ascertain the allocation of
resources, as well as identify the target patient population, and design specific treatment
regimens based on patients’ profiles and their potential for risk.
Predictive analytics in readmission:
Using key clinical, lifestyle and demographic attributes related to a patient, the RF model, for
instance, can predict the patient’s likelihood of readmission and make correct interventions in
good time.
Operational Benefits:
The component of readmission predictions helps the healthcare facilities to allocate their
resources by identifying the high-risk patients as well as avoiding certain unnecessary
readmissions.
16. Some of the challenges that the organization experiences in a bid to forecast readmissions of
patients are as follows:
Predicting patient readmissions presents a number of difficulties despite the encouraging results:
Predicting patient readmissions presents a number of difficulties despite the encouraging results:
Unbalanced Data:
Similar issues encountered in other studies are the skewed distribution of data where most of the
patients are not readmitted and thus limit the prediction of patient readmission. Due to this,
models can struggle to get the correct predictions with regards to the minority class of readmitted
patients.
Complexity of Health Data:
KAUSHLENDRA 20
1 Machine Learning
It is worth to underline that numerous complex factors including lifestyle, medical history, and
socioeconomic status may influence the patient’s outcomes. Thus, complex modeling
methodologies are required to incorporate such relationships among these components.
Data Security and Privacy:
Some concerns arise while using patient data in machine learning for the patient privacy
concerns. In healthcare applications the identity of the patient should be masked and data should
be stored in a secure manner.
Perhaps, the addition of feature detail and more specific patient information including patient
chronic diseases and follow-up treatment recommendations will assist in increasing the
proportionate predictive accuracy of the said model as envisaged.
More Complex methods:
KAUSHLENDRA 21
1 Machine Learning
Therefore, when it comes to machine learning, a shift to such advanced techniques as the deep
learning models or the Gradient Boosting Machines (GBM) is likely to offer a better
performance.
Continuous Learning:
It is therefore important that there should be learning mechanisms that feed the model with new
patient data on regular bases so that the forecast remain relevant over time.
Handling Unbalanced Data:
Reducing the imbalance of the datasets and to increase the performance of the model in
predicting the minority class that is the readmitted patients, SMOTE which stands for Synthetic
Minority Over-sampling could be applied.
It could be also limited for the larger and more diverse groups as the given model’s conditions
and applicability for such groups might be also restricted as the project is based on a limited
sample data pool of 1000 records.
Feature Selection:
In this case, there were several other features that could have been used in investigation of the
readmissions but were missing from the dataset such as the socioeconomic data and follow up
data of the patient.
Bias in the Data:
Therefore, it means that the predictions of the model may be skewed by some particular
producer’s bias in regard to the given data set and the types of patients that can be included. In
case some categories of populations are incorporated too much in the dataset; the model will not
generalize for other categories. .
Activity:2
KAUSHLENDRA 22
1 Machine Learning
1. Introduction:
Using the dataset provided in filter.csv, machine learning models are implemented in this
section of the project to predict healthcare outcomes. Numerous demographic and health-
related characteristics are included in this dataset, including past medical history of
chronic conditions (heart disease, diabetes, arthritis), dietary practices, and exercise
regimens. The objective is to create models that, using patient data, can forecast the
chance of specific medical illnesses or other outcomes, like heart disease.
1.1 This task will address:
Data preparation and preprocessing.
Application of three distinct machine learning models: Support Vector Machines (SVM),
Random Forest, and Logistic Regression.
A variety of performance criteria are used to evaluate the models.
A comparison of the models' advantages and disadvantages.
2. Data Preparation:
To create a machine learning model that is reliable and accurate, data preparation is necessary.
We handle missing values, scale numerical features, encode categorical variables, and clean the
dataset in this step.
KAUSHLENDRA 23
1 Machine Learning
We can either eliminate the rows in question or estimate the missing values using methods like
the mean, median, or mode for categorical data or continuous data, respectively, if some rows
include missing values for features like BMI or Heart_Disease.
KAUSHLENDRA 24
1 Machine Learning
KAUSHLENDRA 25
1 Machine Learning
3. Model Implementation:
Three distinct machine learning models—Logistic Regression, Random Forest, and Support
Vector Machines—will be used and compared in this research (SVM). Since every model is
diverse and has advantages and disadvantages, it may be applied to various data types and
prediction tasks.
Implementation Steps:
Set up the Logistic Regression model.
Utilizing the training data, train the model.
Utilizing the test data, assess the model.
KAUSHLENDRA 26
1 Machine Learning
KAUSHLENDRA 27
1 Machine Learning
4. Model Evaluation:
To understand how well the models work with unknown data, model evaluation is essential. To
assess the performance of the three models, we will utilize metrics like the confusion matrix,
accuracy, precision, recall, and F1-score.
KAUSHLENDRA 28
1 Machine Learning
KAUSHLENDRA 29
1 Machine Learning
KAUSHLENDRA 30
1 Machine Learning
KAUSHLENDRA 31
1 Machine Learning
KAUSHLENDRA 32
1 Machine Learning
KAUSHLENDRA 33
1 Machine Learning
KAUSHLENDRA 34
1 Machine Learning
Random Forest: If not properly adjusted, this robust algorithm may overfit and struggle to
handle missing data and categorical features.
SVM: Performs well with small datasets and is effective in high-dimensional spaces;
nevertheless, it may be sluggish in large datasets.
7. Conclusion:
Based on the characteristics of the data and the objectives of the research, each of the three
models in this comparison—Logistic Regression, Decision Tree, and Random Forest—offers
unique benefits and drawbacks. straightforward, linearly separable data are a good fit for logistic
regression, which produces a probabilistic result that is straightforward to understand but has
trouble with complex or nonlinear patterns. Although they are vulnerable to instability and
overfitting, decision trees provide a high degree of interpretability and flexibility in capturing
nonlinear interactions. Although it has higher computational requirements and is less
interpretable, Random Forest is a strong tool for complicated datasets since it averages many
decision trees to improve accuracy and robustness. The trade-offs between simplicity,
interpretability, and predictive power required for a given application determine which model is
best.
Reference:
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. 1st ed. Springer.
Available at: https://www.springer.com/gp/book/9780387310732
[Accessed 17 July 2024].
Goodfellow, I., Bengio, Y., and Courville, A., 2016. Deep Learning. MIT Press.
Available at: https://www.deeplearningbook.org/
[Accessed 17 July 2024].
KAUSHLENDRA 35
1 Machine Learning
Russell, S.J. and Norvig, P., 2010. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice
Hall.
Available at: https://aima.cs.berkeley.edu/
[Accessed 17 July 2024].
LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436-444.
Available at: https://www.nature.com/articles/nature14539
[Accessed 17 July 2024].
Kohavi, R. and Provost, F., 1998. Glossary of terms. Machine Learning, 30(2), pp.271-274.
Available at: https://link.springer.com/article/10.1023/A:1017181826899
[Accessed 17 July 2024].
Domingos, P., 2012. A few useful things to know about machine learning. Communications of
the ACM, 55(10), pp.78-87.
Available at: https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
[Accessed 17 July 2024].
Ravi, D., Wong, C., Lo, B., Yang, G., 2017. Deep learning for health informatics. IEEE journal
of biomedical and health informatics, 21(1), pp.4-21.
Available at: https://ieeexplore.ieee.org/document/7727300
[Accessed 17 July 2024].
Schmidhuber, J., 2015. Deep Learning in Neural Networks: An Overview. Neural Networks, 61,
pp.85-117.
KAUSHLENDRA 36
1 Machine Learning
Hastie, T., Tibshirani, R. and Friedman, J., 2009. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. 2nd ed. Springer.
Available at: https://hastie.su.domains/ElemStatLearn/
[Accessed 17 July 2024].
KAUSHLENDRA 37