FSI Document Bhavya PDF
FSI Document Bhavya PDF
FSI Document Bhavya PDF
DEPARTMENT OF
ELECTRICAL AND ELECTRONICS ENGINEERING
2020-2024
Approved by AICTE,
NH-5, Bypass Road, Gudur – 524101.
Tirupati (DT), Andhra Pradesh.
2020-2024
DEPARTMENT OF
CERTIFICATE
This is to certify that the full semester internship project report entitled “MACHINE LEARNING
USINGPYTHON” is being submitted by KRISTIPATI BHAVYA NANDINI (20G21A0239) in partial
fulfilmentof the requirements for the award of Bachelor of Technology during the year 2021-2024. This
result embodied in the project report have not been submitted to any university or institution forthe award
of any degree.
I, KRISTIPATI BHAVYA NANDINI, REGD.NO: - 20G21A0239, hereby declare that the Internship report
work entitled “Machine Learning Using Python” done by us under the esteemed guidance of Mr. J. Suresh,
M.Tech., (Ph.D.), Associate professor & HOD, Department of Electrical and Electronics Engineering and
Mr. R. Vikas Reddy, Managing Partner, Technotran, The Full Semester Internship report is submitted in
partial fulfillment of the requirements for the award of the Bachelor of Technology in Electrical and
Electronics Engineering. This work is an independent work and any help taken from the other people has
been mentioned in acknowledgement. Any part of this report and the reportas a whole, therefore, has not
been submitted in other university or academic institutions.
Date:
Place:
(20G21A0239)
ACKNOWLEDGEMENT
The satisfaction and elation that accompany the successful completion of any task would be incomplete
without the mention of the people who have made it a possibility. It is our great privilege to express our
gratitude and respect to all those who have guided us and inspired us during the Full Semester Internship
has been a period of various challenges that have led to a great deal of learning and professional growth.
Making it through would not have been possible without the help and support of family and friends.
First and foremost, I would like to extend my heartful gratitude to AICTE, India for providing the platform
through which I found internship opportunity at Technotran.
I would like to express my deep and sincere thanks to the Technotran for giving me the opportunity to do
an internship within the organization.
I express my sincere gratitude and thanks to our honorable Chairman Dr. VANKI PENCHALAIAH, M.A.,
M.L., Ph.D., for providing facilities and necessary encouragement during the Full semester Internship
Program. I am highly indebted to Director Dr. A. MOHAN BABU, Ph.D., and Principal Prof. K.
DHANUNJAYA, M. Tech, (Ph.D.), for the facilities provided to accomplish this internship.
I would like to thank my Head of the Department and my project guide Prof. J. SURESH, M. Tech, (Ph.
D.), for his constant support and guidance throughout my internship.
I would like to thank Mr. G. Ratnaiah, M. Tech, (Ph.D.), who played a role as Internship Coordinator,
Department of EEE for their support to get and complete internship in above said organization.
I would like to convey my heartfelt gratitude to external supervisor and mentor, Mr. R. VIKAS REDDY
for having accepted me as Full Semester Internship student and providing unconditional support and
guidance regarding all aspects of internship and career.
I also would like to thank all the people that worked along with me in Technotran, Nellore, with their
patience and openness they created an enjoyable and learning oriented ambience online. It is indeed with a
great sense of pleasure and immense sense of gratitude that I acknowledge the help of these individuals. I
am extremely great full to my department staff members and friends who helped me in successful completion
of this internship,
(20G21A0239)
PROFILE OF THE COMPANY
Technotran (ISO 9001:2015 certified company) was founded in 2013 by R. Vikas Reddy. Specializing
in Embedded Systems, Robotics, IoT, and AI. Technotran is based in Nellore, Andhra Pradesh & a business
unit functioning from Hyderabad which imparts training programs across India to its client Universities and
Colleges to cater to the needs of various sections of students from Engineering backgrounds.
Technotran has conducted more than 100 workshops on various technologies. Trained Over 10,000+
Students in 100+ Colleges across India. Apart from that Technotran also offers customized Robotic chassis
design, Electronic Circuit Design, PCB Design, Embedded Systems, IoT, Artificial Intelligence, and
Machine learning through online training. Technotran serves universities, colleges, and schools nationwide,
delivering tailored services from DIY Robotic kit design to electronic product development and
prototyping, all aimed at empowering students and institutions with cutting-edge technology solutions.
Technotran mission is to provide accessible and exceptional technology education that sparks curiosity,
fosters creativity, and cultivates essential skills for the future. We are committed to empowering learners of
all ages, equipping them with the knowledge and confidence to navigate and contribute to a rapidly evolving
technological landscape. Through hands-on experiences and engaging programs, we strive to inspire a
lifelong love for learning and innovation, preparing individuals to thrive in an ever-changing world.
INTERNSHIP CERTIFICATE
ABSTRACT
Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and
research discipline. At the core of this revolution lies the tools and the methods that are drivingit, from processing the
massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with
advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing,
have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering
the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and
machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-
level APIs. Machine learning is concerned with developing algorithms that learn from experience, build models of the
environment from the acquired knowledge, and use these models for prediction. Machine Learning is usually taught as a
branch of methods that can solve a bunch of problems. This survey offers insight into the field of machine learning with
Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have
enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of
educating the reader and driving the field of Python machine learning forward.
Accurately forecasting the spread of COVID-19 is crucial for effective public health response and resource
management. In this study, we employed machine learning techniques, specifically Support Vector Machines, Decision
Trees, and Long Short-Term Memory (LSTM) networks, to model and predict the transmission of COVID-19. Using
Python-based tools like scikit-learn and TensorFlow, we analyzed epidemiological and demographic data to understand
and predict the pandemic's dynamics. Among the tested algorithms, the LSTM network stood out for its superior
performance in handling time-series data, providing the most accurate forecasts for new case numbers and potential
hotspots. The findings of this study assist in optimizing health interventions and planning. Future efforts will focus on
incorporating moredetailed datasets, such as mobility patterns and viral genetics, to refine our predictive capabilities
further.
INDEX
1 Introduction 1-11
8 Testing 54-60
9 Result 61-65
10 Conclusion 66-68
11 Reference 69-70
LIST OF FIGURES
1. INTRODUCTION
The novel coronavirus (COVID-19) pandemic, identified in late 2019, quickly escalated into a
global health crisis, affecting millions of people worldwide. The rapid spread and the severe impact of
the virus necessitated the urgent need for effective tools and strategies to diagnose and manage the
disease. In many regions, healthcare systems were overwhelmed, with facilities struggling to manage
both testing demands and patient care. In this critical context, the ability to predict potential COVID-19
infections swiftly using early clinical symptoms can significantly streamline the process of testing and
treatment, thereby enhancing the overall management of the pandemic.
Machine Learning (ML), as a field within Artificial Intelligence (AI), has demonstrated substantial
promise in addressing complex problems across various domains, including healthcare. By leveraging
data-driven models, ML provides an opportunity to enhance diagnostic accuracy and efficiency.
Specifically, predictive models can analyze clinical data in real-time to estimate the likelihood of a
patient having COVID-19, thus aiding in early intervention and better resource allocation.
Machine learning (ML) is the study of computer algorithms that improve automatically through
experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms
build a model based on sample data, known as "training data", in order to make predictions or decisions
without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of
applications, such as in medicine, email filtering, and computer vision, where it is difficult or unfeasible to
develop conventional algorithms to perform the needed tasks.
A subset of machine learning is closely related to computational statistics, which focuses on making
predictions using computers; but not all machine learning is statistical learning. The study of mathematical
optimization delivers methods, theory and application domains to the field of machine learning. Data mining
is a related field of study, focusing on exploratory data analysis through unsupervised learning. exploratory
In its application across business problems, machine learning is also referred to as predictive analytics.
Machine learning plays an important role in cybersecurity and online fraud detection. Because of
growing monetary online frauds, companies like PayPal have started using machine learning techniques for
protection against the money laundering. The prediction problem of the model for fraud detection can be
divided into two types: classification and regression. Some of the most used machine learning approaches
for this type of prediction problems are Logistic Regression, Decision Tree, Random Forest Tree, and
NeuralNetworks.
Modern day machine learning has two objectives, one is to classify data based on models which have
beendeveloped, the other purpose is to make predictions for future outcomes based on these models. A
hypothetical algorithm specific to classifying data may use computer vision of moles coupled with
supervised learning in order to train it to classify the cancerous moles. Whereas machine learning algorithm
for stock trading may inform the trader of future potential predictions.
Machine Learning is the field of study that gives computers the capability to learn without being
explicitlyprogrammed. ML is one of the most exciting technologies that one would have ever come across.
As it is evident from the name, it gives the computer that makes it more similar to humans: The ability to
learn. Machine learning is actively being used today, perhaps in many more places than one would expect.
A subset of machine learning is closely related to computational statistics, which focuses on making
predictions using computers; but not all machine learning is statistical learning. The study of mathematical
optimization delivers methods, theory and application domains to the field of machine learning. Data mining
is a related field of study, focusing on exploratory data analysis through unsupervised learning. In its
application across business problems, machine learning is also referred to as predictive analytics.
Machine learning involves computers discovering how they can perform tasks without being explicitly
programmed to do so. It involves computers learning from data provided so that they carry out certain tasks.
For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to
execute all steps required to solve the problem at hand; on the computer's part, no learning is needed.
The machine learning field is continuously evolving. And along with evolution comes a rise in demand
andimportance. There is one crucial reason why data scientists need machine learning, and that is: 'High-
valuepredictions that can guide better decisions and smart actions in real-time without human intervention.
Machine learning as technology helps analyze large chunks of data, easing the tasks of data scientists in
anautomated process and is gaining a lot of prominence and recognition. Machine learning has changed the
way data extraction and interpretation works by involving automatic sets of generic methods that have
replaced traditional statistical techniques.
For more advanced tasks, it can be challenging for a human to manually create the needed algorithms.
In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than
having human programmers specify every needed step.
Figure 1.1
Machine learning approaches are traditionally divided into three broad categories, depending on the
natureof the "signal" or "feedback" available to the learning system:
Supervised learning:
Supervised learning algorithms build a mathematical model of a set of data that contains both the inputs
andthe desired outputs. The data is known as training data, and consists of a set of training examples. Each
training example has one or more inputs and the desired output, also known as a supervisory signal. In the
mathematical model, each training example is represented by an array or vector, sometimes called a feature
vector, and the training data is represented by a matrix. Through iterative optimization of an objective
function, supervised learning algorithms learn a function that can be used to predict the output associated
with new inputs. An optimal function will allow the algorithm to correctly determine the output for inputs
that were not a part of the training data. An algorithm that improves the accuracy of its outputs or predictions
over time is said to have learned to perform that task. Machine Learning algorithms, classification and
regression. Classification algorithms are used when the outputs are restricted to a limited set of values, and
regression algorithms are used when the outputs may have any numerical value within a range. As an
example, for a classification algorithm that filters emails, the input would be an incoming email, and the
output would be the name of the folder in which to file the email.
Similarity Learning is an area of supervised machine learning closely related to regression and
classification,but the goal is to learn from examples using a similarity function that measures how similar or
related two objects are. It has applications in ranking, recommendation systems, visual identity tracking,
face verification, and speaker verification.
Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically, supervised
learning is a learning in which we teach or train the machine using data which is well labelled that means
some data is already tagged with the correct answer. After that, the machine is provided with a new set of
examples(data) so that supervised learning algorithm analyses the training data(set of training examples)
and produces a correct outcome from labelled data.
Supervised learning is where there are input variables (x) and an output variable (Y) and an algorithm
isused to learn the mapping function from the input to the output.
Y = f(X)
The goal is to approximate the mapping function so well that when there is a new input data (x) that
theoutput variables (Y) for that data can be predicted easily.
Figure 1.2
Unsupervised learning:
Unsupervised learning algorithms take a set of data that contains only inputs, and find structure in the
data,like grouping or clustering of data points. The algorithms, therefore, learn from test data that has not
been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning algorithms
identify commonalities in the data and react based on the presence or absence of such commonalities in
eachnew piece of data. A central application of unsupervised learning is in the field of density estimation in
statistics, such as finding the probability density function. Though unsupervised learning encompasses
otherdomains involving summarizing and explaining data features.
Unsupervised learning is the training of machine using information that is neither classified nor labeled
and allowing the algorithm to act on that information without guidance. Here the task of machine is to group
unsorted information according to similarities, patterns and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore, machine is restricted to find the hidden structure in unlabeled data by itself.
Figure 1.3
Semi-supervised Learning:
Semi-supervised learning falls between unsupervised learning (without any labeled training data) and
supervised learning (with completely labeled training data). Some of the training examples are missing
training labels, yet many machine-learning researchers have found that unlabeled data, when used in
conjunction with a small amount of labeled data, can produce a considerable improvement in learning
accuracy.
Reinforcement learning:
Reinforcement learning is an area of machine learning concerned with how software agents ought to
take actions in an environment so as to maximize some notion of cumulative reward. Due to its generality,
the field is studied in many other disciplines, such as game theory, control theory, operations research,
information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and
genetic algorithms. In machine learning, the environment is typically represented as a Markov decision
process (MDP). Many reinforcement learning algorithms use dynamic programming techniques.
Reinforcement learning algorithms do not assume knowledge of an exact mathematical model of the MDP,
and are used when exact models are infeasible. Reinforcement learning algorithms are used in autonomous
vehicles or in learning to play a game against a human opponent.
A computer program interacts with a dynamic environment in which it must perform a certain goal. As
it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries
tomaximize.
Figure 1.4
Feature Learning:
Several learning algorithms aim at discovering better representations of the inputs provided during
training. Classic examples include principal components analysis and cluster analysis. Feature learning
algorithms, also called representation learning algorithms, often attempt to preserve the information in their
input but also transform it in a way that makes it useful, often as a pre-processing step before performing
classificationor predictions. This technique allows reconstruction of the inputs coming from the unknown
data-generating distribution, while not being necessarily faithful to configurations that are implausible under
that distribution. This replaces manual feature engineering, and allows a machine to both learn the features
and use them to perform a specific task.
Feature learning can be either supervised or unsupervised. In supervised feature learning, features are
learned using labeled input data. Examples include artificial neural networks, multilayer perceptrons, and
supervised dictionary learning. In unsupervised feature learning, features are learned with unlabeled input
data. Examples include dictionary learning, independent component analysis, autoencoders, matrix
factorization and various forms of clustering.
Manifold learning algorithms attempt to do so under the constraint that the learned representation is low-
dimensional. Sparse coding algorithms attempt to do so under the constraint that the learned representation
is sparse, meaning that the mathematical model has many zeros. Multilinear subspace learning algorithms
aim to learn low-dimensional representations directly from tensor representations for multidimensional data,
without reshaping them into higher-dimensional vectors. Deep learning algorithms discover multiple levels
of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or
generating) lower-level features. It has been argued that an intelligent machine is one that learns a
representation that disentangles the underlying factors of variation that explain the observed data.
Feature learning is motivated by the fact that machine learning tasks such as classification often require
input that is mathematically and computationally convenient to process. However, real-world data such as
images, video, and sensory data has not yielded to attempts to algorithmically define specific features. An
alternative is to discover such features or representations through examination, without relying on explicit
algorithms.
Data: The foundational element in ML. Data can be in various forms such as images, text, numbers, or even
video. The quality and quantity of data significantly influence the performance of an ML model.
Algorithms: These are the methods used to learn from data. Algorithms can be simple linear regression or
more complex like deep neural networks.
Model: A model is what an algorithm builds from the data. After training, a model can make predictions or
decisions based on new data.
Training: The process by which an ML algorithm learns from data. This usually involves feeding the
algorithm a large amount of data and allowing it to adjust itself to make accurate predictions.
Inference: After training, models use this phase to make predictions on new data. The efficiency and
accuracy of inference can be critical in many applications.
Machine Learning is widely used across many fields including medicine, finance, education, and more:
Healthcare: From disease prediction and diagnosis to robotic surgeries, ML is revolutionizing how medical
care is delivered.
Problem Background
Since the outbreak of COVID-19, the world has faced unprecedented health challenges. Effective
management of the disease relies heavily on the ability to quickly and accurately diagnose potential cases.
The symptoms of COVID-19, such as cough, fever, and cold, are common to many other illnesses, making
it difficult to diagnose based solely on initial clinical assessments.
Problem Definition
The main challenge addressed by this project is to develop a predictive model using machine learning
that can estimate the probability of COVID-19 infection based on clinical symptoms. This model aims to
supporthealthcare systems by providing a tool that can:
• Enhance the screening process, thereby reducing the burden on testing facilities.
• Identify high-risk patients early, allowing for timely medical intervention.
Allocate medical resources more efficiently by prioritizing individuals based on the likelihood of
infection.
The primary aim of this project is to develop and validate a Machine Learning (ML) model capable of
predicting the probability of COVID-19 infection based on commonly observed clinical symptoms such as
cough, fever, and cold. This ML model will serve as an invaluable tool for healthcare providers to make
informed decisions quickly, thereby optimizing patient outcomes and resource management during
pandemic situations.
• To develop a reliable model that can predict the likelihood of COVID-19 infection from clinical
symptoms.
• To validate the model using real-world data, ensuring its accuracy and efficacy.
• To deploy the model as a practical tool for healthcare providers to use in preliminary assessments of
potential COVID-19 cases
The "Corona Virus Infection Probability Using Machine Learning" project is designed to develop a
predictive tool using machine learning to estimate the likelihood of COVID-19 infection based on clinical
symptoms. Utilizing Python along with key libraries such as Pandas, Scikit-learn, and Matplotlib/Seaborn,
the project will analyze a large dataset of patient information within the Jupyter Notebook environment. The
focus is to determine the most effective machine learning model for accurately predicting infection
probabilities. Once developed, this model will be integrated into existing healthcare IT systems, aiding
healthcare professionals in efficiently assessing potential COVID-19 cases for better testing and resource
management. Starting with specific regional data, the project is structured to be scalable and adaptable,
potentially expanding its reach to a broader demographic and geographic audience as it evolves. This
initiative aims to enhance current healthcare responses to the pandemic by providing a reliable and
actionable tool for early diagnosis and management.
2. LITERATURE SURVEY
2.1.Introduction
The rapid spread of COVID-19 and its significant impact on global health have spurred researchers to
investigate how AI and ML can be harnessed to combat this pandemic. Several studies have focused on
using machine learning for predicting infection risks, diagnosing the disease from medical imaging, and
forecasting outbreak dynamics.
Research such as that conducted by [Author(s), Year] employs logistic regression and support vector
machines to predict the likelihood of COVID-19 infection based on symptoms and travel history, providing
a useful yet basic approach for initial screening.
Studies like those by [Author(s), Year] have successfully applied convolutional neural networks
(CNNs) todistinguish COVID-19 cases from other types of pneumonia using chest X-rays and CT scans,
showcasing high levels of accuracy.
Epidemiological Forecasting:
Comprehensive models integrating machine learning with statistical techniques have been developed to
predict the spread of the virus, as seen in work by [Author(s), Year]. These models are crucial for
planningand resource allocation in pandemic responses.
ML is used in various fields, including medicine to predict disease and forecast its outcome. In
medicine, the right diagnosis and the right time are the keys to successful treatment. If the treatment has a
high error rate, it may cause several deaths. Therefore, researchers have started using artificial intelligence
applications
for medical treatment. The task is complicated because the researchers have to choose the right tool: it is a
matter of life or death.
For this task, ML achieved a milestone in the field of health care. ML techniques are used to interpret
and analyze large datasets and predict their output. These ML tools were used to identify the symptoms of
disease and classify samples into treatment groups. ML helps hospitals to maintain administrative processes
and treat infectious disease.
ML techniques were previously used to treat cancer, pneumonia, diabetes, Parkinson disease, arthritis,
neuromuscular disorders, and many more diseases; they give more than 90% accurate results in prediction
and forecasting.
12 The pandemic disease known as COVID-19 is a deadly virus that has cost the lives of many people
all over the world. There is no treatment for this virus. ML techniques have been used to predict whether
patientsare infected by the virus based on symptoms defined by WHO and CDC.
ML is also used to diagnose the disease based on x-ray images. For instance, chest images of patients
can beused to detect whether a patient is infected with COVID-19.
Moreover, social distancing can be monitored by ML; with the help of this approach, we can keep
ourselvessafe from COVID-19.
According to the results obtained from the Systematic Literature Review (SLR), RQ1 could not be
answeredthoroughly. In many works, a clear comparison between various machine learning algorithms has
been conducted deliberately but the conclusion couldn’t be achieved. A comparison model was suggested.
Considering the results from a set of literature, a particular set of algorithms that include: Support
Vector Machine (SVM), Artificial Neural Networks (ANNs) and Random Forests (RF) were chosen to
perform anexperimental evaluation to select the most suitable algorithm to predict COVID-19.
For the "Corona Virus Infection Probability Using Machine Learning" project, a variety of technologies
are employed to handle data processing, machine learning modeling, visualization, and deployment. Here's
a breakdown of the key technologies and tools used in the project:
Python: Python is the primary programming language for this project due to its extensive support
for data analysis and machine learning through various libraries. It is favored for its readability,
simplicity, and the powerful ecosystem of data science libraries.
Jupyter Notebook: Jupyter Notebook is used as the interactive computational environment where
coding, visualization, and documentation are combined. It allows for an iterative approach to coding
and is particularly useful for data exploration and presentation.
Pandas: Pandas is a Python library used for data manipulation and analysis. In this project, Pandas
is crucial for data cleaning, transformation, and preparation tasks. It provides data structures and
operations for manipulating numerical tables and time series.
NumPy: NumPy is another essential Python library used for numerical computing. It's particularly
useful for performing operations on arrays and matrices, which are frequently used in data
preprocessing and feature engineering in machine learning projects.
Scikit-learn: Scikit-learn is a Python library for implementing machine learning algorithms. It
provides a range of supervised and unsupervised learning algorithms via a consistent interface. This
project uses Scikit-learn for building, training, and evaluating the machine learning models
including logistic regression, decision trees, and random forests.
Matplotlib and Seaborn: Both Matplotlib and Seaborn are visualization libraries in Python.
Matplotlib provides a wide range of plotting functions to create standard statistical plots, while
Seaborn extends Matplotlib with more sophisticated visualization patterns, making it easier to create
complex visualizations from data. These tools are used extensively to visualize data distributions
and the results of analyses.
OpenCV: For projects that might also involve image data as part of the diagnostic process (not the
primary focus of this project but could be included in expanded scopes), OpenCV can be used for
image processing tasks which aid in enhancing model inputs for better prediction.
Git: Git is used for version control, allowing multiple contributors to work on the code
simultaneously without conflicts, and enabling effective tracking of changes and project evolution.
Docker/Container Technology: For deployment, Docker can be used to containerize the
application, ensuring that the project runs consistently across different computing environments.
This technology simplifies deployment by packaging the application and its environment into a
single container that can be executed anywhere Docker is available.
Cloud Services: Depending on the need for scalability, cloud services (like AWS, Azure, or Google
Cloud) can be utilized to host the data, models, and application. These platforms provide robust,
scalable environments for deploying machine learning models and applications.
Together, these technologies form a comprehensive toolkit that supports the entire lifecycle of the
project—from data handling and model development to visualization and deployment—ensuring that the
project is robust, scalable, and adaptable to the needs of different users and environments.
Figure 2.1
3. PROPOSED METHODOLOGY
Project Description:
Data Collection and Preprocessing: Gather the dataset containing patient records with clinical symptoms
(e.g., cough, fever, cold) and corresponding COVID-19 test results. Perform data preprocessing tasks such
as handling missing values, encoding categorical variables, and scaling numerical features.
Exploratory Data Analysis (EDA): Conduct exploratory data analysis to gain insights into the distribution
of symptoms, the prevalence of COVID-19 cases, and potential correlations between symptoms and
infection status. Visualize key relationships using plots, histograms, and other statistical summaries to
understand the underlying patterns in the data.
Feature Engineering: Engineer new features or transform existing ones to enhance the predictive power
of the model. Extract relevant information from clinical symptoms and create additional features that may
aid in predicting COVID-19 infection probabilities.
Model Selection and Training: Choose appropriate machine learning algorithms for the task, considering
factors such as interpretability, scalability, and performance. Split the dataset into training and testing sets
to evaluate model performance effectively. Train multiple models, such as logistic regression, decision trees,
random forests, or gradient boosting machines, using the training data.
Model Evaluation: Evaluate the performance of each model using appropriate evaluation metrics, such as
accuracy, precision, recall, F1-score, and ROC-AUC. Compare the performance of different models and
select the one that achieves the highest predictive accuracy and generalization to unseen data.
Hyperparameter Tuning: Fine-tune the hyperparameters of the selected model using techniques such as
grid search or random search to optimize performance further. Conduct cross-validation to ensure the
robustness of the model and mitigate overfitting.
Model Interpretation: Interpret the trained model to understand the relative importance of features and
how they contribute to predicting COVID-19 infection probabilities. Visualize model decision boundaries,
feature importance’s, and other relevant insights to provide actionable information for healthcare
professionals.
Deployment and Monitoring: Deploy the trained model into a production environment, such as a web
application or API, to make predictions on new patient data. Implement monitoring mechanisms to track
model performance over time, detect drift, and ensure the continued accuracy and reliability of predictions.
Figure 3.1
A symptom-based predictive model was proposed to predict COVID-19 based on symptoms defined by the
WHO and CDC.
Because there is no proper description of symptoms declared by the WHO, based on some existing
symptoms, we defined a model used to predict the disease according to the accuracy given by the model.
We created a symptom database in which rules were created and used as input. Then, these data were
used as raw data. Then, feature selection took place as part of preprocessing data. The data were divided into
training data (80% of data) and test data (20% of data), usually known as the train-test split process. This
split is generally done in a stratified or random manner so that population distribution in both groups
consistsof shuffled data, which leads minimized bias or skewness in the data. Training data were used to
train the ML classifier that we used in the model, and test data were used to test that classifier in terms of
accuracy received over a predefined unseen portion of the dataset.
In our work, the symptoms and patient's class dataset were defined on the basis of symptoms such as
fever,cough, and sneezing, whether the patient had traveled to 17 an infected place, age, and whether the
patient had a history of disease that could increase the possibly of being infected by the virus.
This dataset was then further divided into two sets (training set and testing set) using the test-train split
method. The system was trained on the basis of training set data and the accuracy of the ML classifier, and
then evaluated over the testing set. Finally, the model was used to predict the probability of infection from
the disease using new patient data in terms of positive or negative.
Classification predicts discrete responses. Here, the algorithm labels by choosing two or more classes for
each example. If it is done between two classes then it is called binary classification and if it is done between
two or more classes then it is called multi- class classification. Applications of classification includes hand
writing recognition, medical imaging etc.
Regression predicts continuous responses. Here, the algorithms return a statistical value. For example,
a setof data is collected such that the people are happy when considered the amount of sleep. Here, sleep and
happy are both variables. Now, the analysis is done by making predictions. The types of popular
regressiontechniques are:
• Linear regression.
• Logical regression.
The ultimate goal of SVM modelling is to find the optimal hyper plane that separates the clusters where
onone side of the plane there is target variable and on the other side of the plane other category. The vectors
which are near the hyper plane are the support vectors. Typical example of support vector machine is
depicted.
Figure 3.2
Figure 3.3
ANNs are an attempt, in the simplest way, to imitate the neural system of the human brain. The basic
unit of ANN are neurons. A neuron is said to perform functions on an input and produces an output. Neurons
combined together are called neural networks. Once the neural networks are formed, training of the data
isstarted to minimize the error. In the end, an optimizing algorithm is used to further reduce the errors. The
layered architecture of Artificial Neural Networks (ANNs) is represented in Figure
Artificial Neural Networks (ANNs) are sophisticated machine learning models inspired by the biological
neural networks found in human brains. An ANN consists of multiple layers, including an input layer,
severalhidden layers, and an output layer. Each layer is made up of nodes, or neurons, connected by edges
that represent weighted connections. The input layer receives raw data which each neuron processes, passing
a transformed version onwards. This transformation is determined by a combination of node- specific
weights
and a bias, usually followed by a non-linear activation function that decides whether and how much signal
a neuron will forward to subsequent layers.
The real computational power of ANNs lies in their hidden layers, which can extract progressively
more abstract features from the input data. Each neuron in these layers transforms its input signals into
outputs via activation functions, which are crucial for handling non-linear data and interactions. Learning
occurs when the ANN adjusts the weights of the connections between neurons to minimize the difference
betweenthe predicted output and the actual target values from the training data—a process typically achieved
throughbackpropagation and optimization algorithms like gradient descent.
The output layer receives the final transformed signals from the last hidden layer and converts them into
a format suitable for addressing the specific problem the network is designed to solve, such as classification
labels or continuous numerical values. The sophistication of ANNs allows them to tackle a wide range of
tasks, from image and speech recognition to predicting stock prices, making them a versatile tool in both
commercial and research settings. Their ability to model complex non-linear relationships and learn
featuresautonomously is a significant advantage over more traditional machine learning techniques.
The random sampling and ensemble strategies utilized in RF enable it to achieve accurate predictions
aswell as better generalizations [40]. The random forests consist of large number of trees. The higher the
number of uncorrelated trees, the higher the accuracy. Random Forest classifiers can help filling some
missing values. Prediction in Random Forests (RFs) is represented in Figure
Figure 3.4
4. SYSTEM ANALYSIS
Epidemiological Models
Compartmental Models:
SIR Model: The most basic form is the Susceptible, Infected, Recovered (SIR) model, which segments the
population into three compartments. The model uses differential equations to estimate the rate at which
individuals move from being susceptible to infected, and from infected to recovered.
SEIR Model: An extension of the SIR model that includes an 'Exposed' category for those who have been
exposed to the virus but are not yet infectious. This model provides a more detailed framework that is
somewhat more predictive of diseases with an incubation period, like COVID-19.
These models use rates derived from historical data on how diseases spread, including factors like
contactrates and recovery times, to predict how an infection might progress within a community. They do
not, however, account for individual variability or complex interactions between host factors and the
pathogen.
Clinical Heuristics
Risk Scoring:
Clinical Algorithms: These are set algorithms or flowcharts that doctors and healthcare professionals use
to determine the likelihood of a patient having a disease based on symptoms, travel history, contact history,
and other clinically relevant information.
Scoring Systems: Tools like the CURB-65 score for pneumonia, which assesses the severity of pneumonia
and the need for hospitalization based on clinical criteria (Confusion, Urea, Respiratory rate, Blood
pressure, and age 65 or older).
These methods rely heavily on the clinical judgment and experience of healthcare providers. They are
lessprecise in predicting individual outcomes but can be quickly implemented without the need for
computational tools o.r complex data analysis
Figure-4.1
Inflexibility to New Data Types: Non-ML systems cannot easily incorporate different types of
data, such as real-time mobility or social media data, which limits their ability to use all available
information to improve accuracy.
Scalability Challenges: Scaling traditional methods to larger populations or different regions
requires significant effort and is prone to errors. They aren’t designed to automatically adjust to
more complex or larger datasets
In summary, while traditional methods provide a foundational understanding of disease spread, their
lack of precision, adaptability, and scalability highlights the need for more advanced ML-driven models,
especially in dealing with complex and rapidly evolving pandemics like COVID-19.
Central to the proposed system is its user-friendly interface that simplifies interactions for healthcare
professionals, allowing them to input data and receive predictions effortlessly. This interface will be
seamlessly integrated into existing healthcare IT infrastructures, such as Electronic Health Records (EHR)
systems, ensuring that it enhances rather than disrupts clinical workflows. Furthermore, the system is built
on a scalable cloud-based architecture, which allows it to handle large volumes of data and serve a broad
user base without performance degradation.
Privacy and security are paramount, with strict adherence to data protection regulations and the
implementation of advanced security measures to safeguard patient information. Additionally, mechanisms
are included to detect and mitigate any bias in training data or predictions, ensuring fairness and accuracy.
Overall, the proposed system offers significant improvements over traditional models by providing
more accurate, adaptable, and scalable predictions. It stands to revolutionize how healthcare providers
diagnose and manage COVID-19, improving resource allocation, patient outcomes, and overall public
health response.
Figure 4.2
Advantages
Increased Accuracy: By utilizing advanced machine learning algorithms, the proposed system can
analyze complex data patterns and provide more accurate predictions of COVID-19 infection
probabilities based on clinical symptoms and other health data.
Real-time Adaptability: The system can continuously update and refine its predictions as new data
becomes available, ensuring that it remains effective as the virus evolves or as new health data
comes to light.
Scalability: Built on a cloud-based architecture, the system can efficiently handle increasing
amounts of data and a growing number of users without a decline in performance, making it suitable
for widespread deployment.
User-Friendly Interface: The system features an intuitive interface that integrates seamlessly with
existing healthcare IT systems, making it easy for healthcare professionals to use without requiring
significant training.
Data-Driven Decisions: By providing reliable infection probability assessments, the system aids
healthcare professionals and policymakers in making informed decisions about testing, treatment,
and resource allocation.
Enhanced Public Health Response: The system's ability to predict potential outbreaks and
infection trends helps public health officials proactively manage and mitigate the spread of COVID-
19.
Compliance and Security: Designed with a strong focus on privacy and security, the system
adheres to relevant data protection regulations, ensuring that patient data is handled securely and
ethically.
Bias Mitigation: The proposed system includes mechanisms to detect and address biases in the
training data and model predictions.
These advantages make the proposed system a valuable tool in the ongoing fight against COVID-19,
offering enhanced capabilities to improve diagnosis, treatment, and management of the disease.
Applications
Clinical Decision Support: The system can be used by healthcare professionals to assist in
diagnosing COVID-19 based on symptoms and health data, aiding in the triage process and helping
to prioritize which patients should receive testing and immediate care.
Hospital Resource Management: It can inform hospital administrators about potential spikes in
COVID-19 cases, helping them allocate necessary resources such as ICU beds, ventilators, and
medical staff more efficiently.
Public Health Surveillance: The system can help public health officials monitor and predict the
spread of the virus, aiding in the implementation of targeted public health interventions and policies
such as lockdowns or social distancing measures.
Policy Planning and Evaluation: By providing data-driven insights into infection trends and
response effectiveness, the system can assist policymakers in designing and adjusting public health
policies and strategies.
Vaccine Distribution: The model can help strategize vaccine rollout by identifying regions or
demographics with higher infection probabilities, ensuring vaccines are distributed to those in
greatest need first.
Travel and Quarantine Measures: It can be used to assess the risk levels of different regions or
countries, aiding government agencies and health authorities in making informed decisions about
travel restrictions or quarantine requirements.
Educational Tool for Public Awareness: The system can be employed to educate the public by
providing clear, data-backed information on infection risks, helping to promote responsible
behavior and adherence to health guidelines.
Research and Development: Researchers can utilize the system to conduct epidemiological studies
and clinical research, enhancing the understanding of COVID-19 and potentially influencing the
development of treatments and preventive measures.
These applications demonstrate the versatility and potential impact of the proposed system across various
aspects of healthcare and public health management, highlighting its value in both current and future
infectious disease scenarios.
5. SYSTEM DESIGN
Figure 5.1
Prediction Module
• Real-Time Prediction: When new data is input into the system (e.g., a new patient's
symptoms and demographic details), the model processes this data in real-time to predict
the probability of COVID-19 infection.
• Output Generation: The prediction, along with confidence levels and other relevant
statistical insights, is generated and made ready for presentation in the user interface.
Overall Workflow
The system functions as a dynamic, interactive platform that not only predicts COVID-19 infection
probabilities but also learns from new data and evolves over time. It supports healthcare providers by offering
actionable insights that are immediately applicable in clinical settings, thereby enhancing the efficiency and
effectiveness of medical care and public health management.
Figure 5.2
Module Description
Dataset Collection Module: This module is the initial point of interaction with data,
responsible for the collection and aggregation of a wide array of data sources relevant to COVID-
19 infection prediction. It gathers data from electronic health records (EHRs), clinical reports,
patient surveys, and public health databases. The data encompasses a broad spectrum of
information including demographic details, clinical symptoms, patient medical history, laboratory
test results, and other relevant health indicators. This module ensures that the data is systematically
collected and standardized to facilitate seamless integration and processing in subsequent
modules.
Splitting of Dataset: After the data is collected and initially processed, it is split into
different sets to ensure that the machine learning model is trained and tested effectively. Typically,
the data is divided into three main subsets:
5.3.2.1 Training Dataset: This is the largest portion of the data and is used to train the machine
learning models. The model learns to recognize patterns and understand the relationships
between the features and the target variable, which in this case is the probability of
COVID-19 infection.
5.3.2.2 Validation Dataset: This subset is used to fine-tune model parameters and prevent the
model from overfitting. It helps in optimizing the model's performance by providing a
metric for evaluation while tuning the model's hyperparameters.
5.3.2.3 Testing Dataset: This is used after the model has been trained and validated. It serves as
a final, unbiased evaluation of the model to assess its performance on data it has never
seen before, ensuring the predictions are robust and reliable.
Dataset Preprocessing Module: Before the actual training, the data must be cleaned and
preparedto improve the model's accuracy and efficiency. The Dataset Preprocessing Module
handles:
5.3.3.1 Cleaning: Removing or correcting any inaccuracies or inconsistencies in the data, such
as missing values or duplicate records.
5.3.3.2 Transformation: Converting data into a format suitable for machine learning. This
includes normalizing or scaling numerical values so they fall within a specific range, and
encoding categorical variables into numerical formats.
5.3.3.3 Feature Engineering: Creating new features from existing data to enhance model
performance. This could involve deriving symptom severity scores from existing
symptom data or aggregating multiple features into a single composite indicator.
Training with Algorithm: This module is where the machine learning algorithms come
into play.Several algorithms might be evaluated to find the most effective one for predicting
COVID-19 infection probabilities, including:
5.3.4.1 Decision Trees and Random Forests: These are useful for handling nonlinear data with
complex interactions.
5.3.4.2 Logistic Regression: Ideal for binary classification tasks like predicting whether a
patient is likely to be infected or not.
5.3.4.3 Neural Networks: Particularly deep learning models, which are capable of modeling
extremely complex patterns in large datasets.
The chosen algorithm is trained on the training dataset, where it learns to predict the infection probability
based on the input features. The model's performance is continually assessed and refined using the
validation set, and finally, its effectiveness is validated on the testing set.
Together, these modules form a robust system for predicting COVID-19 infection probabilities, utilizing
advanced data handling and machine learning techniques to provide valuable insights that can aid in
managing the pandemic more effectively.
6 SOFTWARE DESCRIPTION
Software Description:
Python: paradigms. Python has a large standard library which provide tools suited to perform various tasks.
Python is a simple, less-clustered language with extensive features and libraries. Different programming
abilities are utilized for performing the experiment in our work. In this thesis, the following python libraries
were used.
Pandas: It is a python package that provides expressive data structures designed to work with both relational
and labelled data. It is an open-source python library that allows reading and writing data between data
structures.
NumPy: It is an open-source python package for scientific computing. NumPy also adds fast array
processing capacities to python.
Matplotlib: It is an open-source python package used for making plots and 2D representations. It integrates
with python to give effective and interactive plots for visualization.
TensorFlow: It is a mathematical open-source python library designed by Google Brain Team for Machine
intelligence.
Sklearn: It is an open-source python machine learning library designed to work alongside NumPy. It
features various machine learning algorithms for classification, clustering and regression. For the
development of the COVID-19 infection probability prediction project, the team will utilize the Jupyter
software environment. Jupyter is an open-source web application that allows users to create and share
documents containing live code, equations, visualizations, and narrative text. It supports various
programming languages, including Python, R, and Julia, making it suitable for data analysis, machine
learning, and scientific computing tasks.
Figure 6.1
Interactive Computing: Jupyter provides an interactive computing environment where users can write and
execute code in a step-by-step manner. This enables iterative development and experimentation with data
analysis techniques and machine learning models.
Notebook Interface: Jupyter notebooks offer a flexible and intuitive interface for organizing code,
visualizations, and explanatory text in a single document. Users can create, edit, and run code cells
interactively, facilitating collaborative work and reproducible research.
Rich Output Support: Jupyter notebooks support rich output formats, including HTML, LaTeX,
Markdown, images, and interactive widgets. This allows for the creation of dynamic and visually appealing
presentations, reports, and data visualizations within the same document.
Extensibility: Jupyter is highly extensible, with a vibrant ecosystem of third-party extensions and plugins
available. Users can customize their environment with additional features and functionality to suit their
specific needs and preferences.
Integration with Data Science Libraries: Jupyter seamlessly integrates with popular data science libraries
and frameworks such as NumPy, Pandas, Matplotlib, Seaborn, Scikit-learn, and TensorFlow. This enables
users to leverage a rich ecosystem of tools and resources for data analysis, visualization, and machine learning
model development.
Version Control and Sharing: Jupyter notebooks support version control systems such as Git, allowing
users to track changes to their code and collaborate with team members effectively. Notebooks can also be
shared publicly or privately via platforms like GitHub, JupyterHub, and Jupyter Notebook Viewer.
Deployment Options: Jupyter notebooks can be deployed locally on a user's machine or on cloud-based
platforms such as JupyterHub, Google Colab, or Microsoft Azure Notebooks. This provides flexibility in
terms of computing resources and scalability for handling large datasets and complex analyses.
By leveraging the capabilities of the Jupyter software environment, the project team aims to streamline the
development and deployment of predictive models for estimating COVID-19 infection probabilities based
on clinical symptoms. Jupyter's interactive and collaborative features will enable efficient exploration of
data, model training, evaluation, and documentation, ultimately contributing to the project's success in
addressing critical healthcare challenges.
Dataset
Dataset Data Collection Data collection was an essential and protracted process. Regardless the field of
research, accuracy of the data collection is essential to maintain cohesion. As the clinical information of
patients was not publicly available, it was an inflexible and tedious process to collect the data. Various
Hospitals and Health Institutes in Sweden and China were approached to get the most accurate data but due
to the present situation at hospitals with heavy inflow of patients with COVID-19, we couldn’t get access
to direct information. An intense search was conducted on various databases to gather open-source clinical
information of patients diagnosed with COVID-19.
Figure 6.2
Dataset Used
Dataset Used The data set that was used to train the model to predict COVID-19 was gathered from an
open source data shared by Yanyan Xu at a repository figshare[51]. The data set contained information
about hospitalized patients with COVID-19. It included demographic data, signs and symptoms, previous
medicalrecords, laboratory values that were extracted from electronic records. To train the model with equal
records of patients with negative samples another data set from Kaggel repository was used[4]. The original
data- set contained details of medications followed by the doctors to cure the disease. As our model doesn’t
require such data, those fields have been eliminated. The data-set is a combined multi- dimensional data.
Some of the data gives information whether the patient is diagnosed with a particular disease in the past
such as Renal Diseases, Digestive Diseases and other data contains precise clinical values obtained
previously It contains fields with textual data and some with precise values. Textual data was encoded
withinteger values for experimental setup.
• Imputation of missing values In our data, missing values have been handled by using simple
imputer from sklearn python package. The missing values are replaced by using mean strategy.
• Encoding Categorical Data, we used the package of OneHotEncoder in python, this package
handles categorical data by one-hot or dummy encoding scheme. Implementation The
experiment was conducted in the Python IDLE, which is a default integrated development and
learning environment for python. The experiment was conducted in various phases that are
mentioned below: • After data collection, the patient’s data is divided into record sets
containing 100 records, 150 records, 200 records, 250 records, 300 records, 355 records
respectively.
• A 5-fold cross validation technique is used to randomize the testing data-set to get accurate
results. Experiment on each machine learning algorithm is conducted by 5-fold cross
validation with each of the record sets.
• The prediction accuracy of each algorithm at each record set is compared and evaluated for
selecting the suitable algorithm for this dataset.
• A feature importance experiment is conducted to evaluate the importance of each attribute
on the artificial classification task. Algorithm Configurations In this section, the configuration
of the algorithms is mentioned. Changes made to the configuration of the algorithm can effect
the results.
• Support Vector Machines: SVC (kernel = ’linear’, random state = 0)
• ArtificialNeuralNetworksLayers:
ann.add(tf.keras.layers.Dense(units=6,activation=’relu’))
ann.add(tf.keras.layers.Dense(units=6,activation=’relu’))
ann.add(tf.keras.layers.Dense(units=1, activation=’sigmoid’))
7 SOURCE CODE
Description:
This project focuses on predicting COVID-19 infection probabilities based on clinical symptoms such
as cough, fever, and cold using a dataset provided. Leveraging the Jupyter software environment, the project
involves exploratory data analysis, feature engineering, model training, and evaluation. The dataset
contains patient records with symptoms and COVID-19 test results, enabling the development of
predictivemodels to estimate the likelihood of infection.
Workflow:
• Data Loading: Import the clinical symptoms dataset into the Jupyter environment for analysis
and model development.
• Exploratory Data Analysis (EDA): Explore the dataset to understand its structure, feature
distributions, and correlations. Visualize key statistics and relationships between variables using
libraries like Pandas, Matplotlib, and Seaborn.
• Data Preprocessing: Clean the dataset by handling missing values, outliers, and
inconsistencies. Convert categorical variables into numerical representations using encoding
techniques.
• Feature Engineering: Extract relevant features from the dataset and engineer new features if
necessary to capture meaningful information for model training.
• Model Selection: Choose appropriate machine learning algorithms such as logistic regression,
random forest, or support vector machines based on the nature of the problem and dataset
characteristics.
• Model Training: Split the preprocessed dataset into training and testing sets. Train the selected
machine learning models using the training data and evaluate their performance.
• Model Evaluation: Assess the performance of the trained models using evaluation metrics like
accuracy, precision, recall, and F1-score. Compare the performance of different models to select the
best-performing one.
• Predictive Modeling: Deploy the selected model to predict COVID-19 infection probabilities
for new patient records based on their clinical symptoms.
• Model Interpretation: Interpret the trained model's predictions and understand the importance
of different features in determining infection probabilities.
Software Requirements:
• Jupyter Notebook: For data analysis, visualization, and model development in a collaborative
and interactive environment.
• Python Libraries: Pandas for data manipulation, Matplotlib and Seaborn for data visualization,
Scikit-learn for machine learning algorithms, and NumPy for numerical computations.
• Hardware Requirements:
• Standard computing hardware with sufficient processing power and memory to handle data
processing and model training tasks efficiently.
# importing dependencies
import pandas as pd
# Load dataset
print(data.head())
sns.countplot(x='COVID-19', data=data)
plt.show()
Data Preprocessing:
plt.figure(figsize=(12, 8))
plt.title('Correlation Heatmap')
plt.show()
missing_values = data.isnull().sum()
if data[col].dtype == 'object':
print(col, data[col].unique())
Feature Engineering:
# Print the first few rows of the dataset after the first iteration
print(data_encoded.head())
# Print the first few rows of the dataset after the second iteration
print(data_encoded.head())
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
print("Accuracy:", accuracy)
probabilities = random_forest.predict_proba(X_test)
infection_probabilities = probabilities[:, 1]
print("Infection Probabilities:")
print(infection_probabilities[:5])
In this code:
• Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
• Predict COVID-19 infection probabilities for new patient records (not implemented in the code, just
mentioned as a step).
• Write the model predictions to a CSV file named predicted_infection.csv
Output:
Breathing Problem Fever Dry Cough Sore throat Running Nose Asthma \
0 Yes Yes Yes Yes Yes No
1 Yes Yes Yes Yes No Yes
2 Yes Yes Yes Yes Yes Yes
3 Yes Yes Yes No No Yes
4 Yes Yes Yes Yes Yes No
[5 rows x 21 columns]
Missing values:
Breathing Problem 0
Fever 0
Dry Cough 0
Sore throat 0
Running Nose 0
Asthma 0
Chronic Lung Disease 0
Headache 0
Heart Disease 0
Diabetes 0
Hyper Tension 0
Fatigue 0
Gastrointestinal 0
Abroad travel 0
Contact with COVID Patient 0
Interaction_2
0 1
1 0
2 1
3 0
4 1
[5 rows x 23 columns]
Interaction_3 Interaction_4
0 0 0
1 1 0
2 1 0
3 0 0
4 0 1
[5 rows x 25 columns]
Accuracy: 0.984360625574977
Infection Probabilities:
[1. 0. 1. 1. 1.]
8 TESTING
Test Procedure
Testing is performed to identify errors. It is used for quality assurance. Testing is an integral part of the
entire development and maintenance process. The goal of the testing during phase is to verify that the
specification has been accurately and completely incorporated into the design, as well as to ensure the
correctness of the design itself. For example, the design must not have any logic faults in it. If it is not
detected before coding commences, the cost of fixing the faults will be considerably higher as reflected.
Detection of design faults can be achieved by means of inspection as well as walkthrough. Testing is one
of the important steps in the software development phase.
MANUAL TESTING
Manual Testing is a type of software testing in which test cases are executed manually by a tester
without using any automated tools. The purpose of Manual Testing is to identify the bugs, issues, and
defects in thesoftware application. Manual software testing is the most primitive technique of all testing
types and it helps to find critical bugs in the software application.
Any new application must be manually tested before its testing can be automated. Manual Software Testing
requires more effort but is necessary to check automation feasibility. Manual Testing concepts does not
require knowledge of any testing tool.
Figure 8.1
In the realm of software development and data science, the journey from concept to deployment is
often paved with challenges. Among these challenges, perhaps none is as crucial as testing and evaluation.
These processes form the bedrock upon which robust, reliable models are built. In this discussion, we delve
into the significance of comprehensive testing and evaluation in the development lifecycle, exploring how
theyenable us to identify and rectify issues, refine models, and ultimately deliver superior solutions.
The Testing Process Testing serves as a litmus test for the efficacy and reliability of developed models.
It encompasses a spectrum of activities aimed at uncovering defects, vulnerabilities, and inaccuracies that
may compromise model performance. From unit tests that scrutinize individual components to
integrationtests that validate system-wide functionality, each phase plays a pivotal role in fortifying the
model againstpotential pitfalls.
One fundamental aspect of testing is the identification of bugs. These anomalies, ranging from syntax
errors to logic flaws, have the potential to undermine the integrity of the model and impede its functionality.
By subjecting the model to rigorous testing protocols, developers can unearth these bugs and institute
correctivemeasures, thereby enhancing its robustness and resilience.
Furthermore, testing facilitates the validation of model behavior under diverse scenarios. Through
techniques such as boundary testing, stress testing, and regression testing, developers can simulate real-
world conditions and assess how the model responds. This process not only instills confidence in its
performance but also unveils areas for optimization and refinement.
Evaluation: A Lens for Improvement While testing illuminates the shortcomings of a model, evaluation
provides the lens through which these deficiencies are addressed. Evaluation encompasses a multifaceted
analysis of model performance, encompassing metrics such as accuracy, precision, recall, and F1 score.
Bybenchmarking the model against predetermined criteria, evaluators can gauge its efficacy and identify
areasfor enhancement.
One of the key objectives of evaluation is to align model output with domain-specific objectives.
Whether it be minimizing classification errors in medical diagnostics or optimizing recommendation
algorithms in e-commerce, evaluation serves as a compass guiding model refinement. By soliciting
feedback from domain experts and end-users, developers can tailor the model to meet the unique demands
of its applicationdomain.
Moreover, evaluation facilitates the detection of bias and fairness issues inherent in the model. In an
era marked by increasing scrutiny of algorithmic decision-making, ensuring equity and transparency is
paramount. Through techniques such as fairness testing and bias mitigation, evaluators can scrutinize model
behavior for discriminatory patterns and implement corrective measures to uphold ethical standards.
The Iterative Cycle of Improvement Testing and evaluation are not discrete phases but rather iterative
processes intertwined with model development. As bugs are identified and rectified during testing, the
model undergoes continuous refinement based on evaluation feedback. This iterative cycle fosters a
cultureof continuous improvement, wherein each iteration brings the model closer to its zenith of
performance.
Furthermore, testing and evaluation serve as a mechanism for fostering collaboration and knowledge
sharing within development teams. By soliciting diverse perspectives and harnessing collective expertise,
developers can leverage the synergies of cross-functional collaboration to tackle complex challenges and
drive innovation.
In the dynamic landscape of model development, the significance of documentation and reporting
cannot be overstated. These processes serve as the cornerstone of knowledge dissemination, enabling
stakeholdersto understand, utilize, and contribute to the project. From documenting code and algorithms to
preparing user manuals and project reports, this discourse explores how effective documentation and
reporting practices underpin successful model development endeavors.
Documenting Code, Algorithms, and Methodologies At the heart of every model lies the intricacies of
code, algorithms, and methodologies. Documenting these elements not only enhances transparency but also
facilitates collaboration and knowledge transfer within development teams. By providing detailed
explanations of code structure, variable definitions, and algorithmic logic, developers empower their peers
to navigate the intricacies of the model with confidence.
Preparing User Manuals and Technical Documentation In addition to internal stakeholders, effective
communication with end-users is paramount for the successful adoption and utilization of the model. User
manuals and technical documentation serve as the bridge between the technical intricacies of the model and
the practical needs of its users. Through clear, concise instructions and illustrative examples, these
documents empower users to leverage the model to its fullest potential.
User manuals offer step-by-step guidance on installation, configuration, and operation, catering to
users with varying levels of technical proficiency. Meanwhile, technical documentation delves into the
underlying principles and methodologies, catering to developers, researchers, and domain experts seeking
deeper insights into the model's workings.
Creating Comprehensive Project Reports and Presentations At the culmination of the development
lifecycle, the creation of a comprehensive project report and presentation is essential for stakeholders to
gain a holistic understanding of the endeavor. These artifacts serve as a testament to the journey undertaken,
encapsulating the rationale, methodology, findings, and implications of the project.
Project reports provide a detailed narrative of the development process, encompassing key milestones,
challenges encountered, and solutions devised. They offer a comprehensive overview of the project scope,
objectives, and outcomes, enabling stakeholders to assess its impact and relevance within the broader
context.
Meanwhile, presentations distill the essence of the project into a concise, digestible format, facilitating
engagement and dialogue with stakeholders. Through compelling visuals, insightful analyses, and
compelling storytelling, presentations convey the significance of the project and elicit feedback, fostering
a culture of collaboration and alignment.
Deployment and Monitoring System Setup Deploying a model into the target environment requires
meticulous planning and execution to ensure seamless integration and optimal performance. Leveraging
best practices in deployment automation and containerization, we orchestrated the deployment process to
minimize downtime and mitigate risks. By adhering to standardized deployment protocols and leveraging
infrastructure as code (IaC) principles, we streamlined the deployment pipeline and enhanced repeatability
and scalability.
Simultaneously, we established a robust monitoring system to track model performance and identify
anomalies in real-time. Leveraging a combination of monitoring tools, logging frameworks, and custom
alerts, we instituted proactive measures to preemptively detect and address performance degradation,
ensuring the model's reliability and resilience in production environments.
Issue Resolution and Continuous Improvement Despite meticulous planning, deployment often
unveils unforeseen challenges and issues. Through prompt identification and resolution of these issues, we
mitigated risks and minimized disruptions to business operations. Leveraging agile methodologies and
collaborative problem-solving approaches, we engaged cross functional teams to diagnose and address
issues expediently, thereby ensuring the model's continued functionality and performance.
As the project neared completion, we conducted a comprehensive project review with the team to
reflect on key achievements, challenges, and lessons learned throughout the development journey. Through
structured retrospectives and open dialogue, we identified successes, areas for improvement, and
actionableinsights to inform future endeavors.
Building upon the insights gleaned from the project review, we finalized project documentation to
encapsulate the project's scope, objectives, methodologies, and outcomes. By documenting key learnings,
best practices, and recommendations, we ensured knowledge preservation and facilitated knowledge
transfer for future projects and teams.
8.5 Developed models using Python, validated and fine-tuned them, and tested on a small
dataset
Python stands as a cornerstone in the arsenal of data scientists and machine learning engineers,
offering a rich ecosystem of libraries and frameworks for model development. In this narrative, we embark
on a journey of model development, validation, and fine-tuning using Python, leveraging a small dataset as
our testing ground. Through meticulous validation and iterative fine-tuning, we illuminate the path towards
model excellence and efficiency.
Model Development in Python Harnessing the power of Python, we embarked on the journey of model
development, guided by the principles of simplicity, scalability, and modularity. Leveraging libraries
suchas NumPy, Pandas, and Scikit-learn, we engineered features, trained models, and evaluated their
performance with precision and efficiency.
Python's expressive syntax and extensive library support facilitated seamless data preprocessing, model
selection, and evaluation. From data ingestion to model deployment, Python served as our trusted
companion, enabling us to navigate the complexities of the model development lifecycle with agility and
grace.
Validation and Fine-Tuning With our models taking shape, the next frontier beckoned: validation and
fine- tuning. Armed with Python's arsenal of cross-validation techniques, hyperparameter optimization
algorithms, and performance metrics, we embarked on a quest to refine and optimize our models iteratively.
Through techniques such as k-fold cross-validation, grid search, and random search, we systematically
explored the hyperparameter space, seeking configurations that maximize model performance. Guided by
insights gleaned from validation experiments, we fine-tuned our models with surgical precision,
balancingbias and variance to achieve the optimal equilibrium.
Testing on a Small Dataset In the crucible of testing, models are put to the ultimate trial, where their
mettleis tested against real-world data. Despite the modest scale of our dataset, testing served as a litmus
test formodel robustness, generalization, and reliability.
9 RESULT
The performance metric mentioned in Section is utilized to evaluate the performance of the algorithms
thatwere selected after the Literature Review. Three algorithms that were identified as the most suitable for
theclassification task to predict COVID-19 are:
• RF (Random Forests).
Experiment Results
Each of the above stated algorithms were trained with the data-set that was collected and results were
interpreted. Performance of each algorithm was evaluated at different stages of training set. Each algorithm
was trained with records sets containing 100 records, 150 records ,200 records, 250 records, 300 records,
355 records respectively. This experiment is performed to obtain which algorithm would be the most
suitable for prediction of COVID-19. Also, as the data is split into smaller sets,we could also asses which
algorithm would perform better with different datasets available.
Support Vector Machine (SVM) algorithm is trained with each record sets to identify its accuracy at all
stages. At all stages, the data was divided into training and test data by using k-fold cross validation (5-
folds). SVM achieves an accuracy of 98.33%. represent the accuracy for every set of records achieved by
Support Vector Machine (SVM) algorithm.
Figure 9.1
The classification accuracy of Support Vector Machine (SVM) at each record set can be clearly identified
from the chart
Figure 9.2
Random Forest (RF) algorithm is trained in a similar way with each records set to identify its accuracy
at all stages. At all stages, the data was divided into training and test data by using k-fold cross validation
(5- folds). RF achieves an accuracy of 99.44%. The classification accuracy of Random Forest (RF)
algorithmfor every set of records is represented.
Figure 9.3
Figure 9.4
The classification accuracy of Random Forest (RF) at each record set can be identified from the chart in
Figure. The figure represents the change in accuracy while using each record set as training data.
Artificial Neural Networks (ANN) Algorithm is trained on data with record sets and tested. On
implementing ANN Algorithm, it achieves an classification accuracy of 99.25%. The classification
accuracy reported with each record set is tabulated in Table.
Figure 9.5
The accuracy of Artificial Neural Networks (ANN) with each record set is represented in Figure.
Figure 9.6
10 CONCLUSION
10.1 Conclusion
In this research, a systematic literature review has been conducted to identify the suitable algorithm for
prediction of COVID-19 in patients. There was no pure evidence found to summarize one algorithm as the
suitable technique for prediction. Hence, a set of algorithms which include Support Vector Machine (SVM),
Artificial Neural Networks (ANNs) and Random Forests (RF) were chosen. The selected algorithms were
trained with the patient clinical information. To evaluate the accuracy of machine learning models, each
algorithm is trained with record sets of varying number of patients. Using accuracy performance metric, the
trained algorithms were assessed. After result analysis, Random Forest (RF) showed better prediction
accuracy in comparison with both Support Vector Machine (SVM) and Artificial Neural Networks (ANNs).
The trained algorithms were also assessed to find the features that affect the prediction of COVID-19 in
patients. There is a lot of scope for Machine Learning in Healthcare. For Future work, it is recommended
to work on calibrated and ensemble methods that could resolve quirky problems faster with better
outcomesthan the existing algorithms. Also, an AI-based application can be developed using various sensors
and features to identify and help diagnose diseases. As healthcare prediction is an essential field for future,
A prediction system that could find the possibility of outbreak of novel diseases that could harm mankind
through socio-economic and cultural factor consideration can be developed.
Data-Driven Predictive Modeling: The project showcases the potential of leveraging machine learning
techniques to predict COVID-19 infection probabilities based on clinical symptoms. By analyzing patient
records and symptom data, predictive models can be developed to estimate the likelihood of infection,
aiding in early detection and intervention efforts.
Exploratory Data Analysis (EDA): EDA plays a crucial role in understanding the characteristics and
patterns present in the dataset. Designers should conduct thorough EDA to identify correlations between
clinical symptoms and COVID-19 test results, uncover potential outliers or missing values, and inform
feature engineering and model selection processes.
Feature Engineering: Effective feature engineering is essential for building accurate predictive models.
Designers should carefully select and engineer features that capture relevant information from the dataset,
such as symptom severity, duration, and frequency, to enhance the model's predictive performance.
Model Training and Evaluation: Model training involves selecting appropriate algorithms, tuning
hyperparameters, and evaluating model performance using suitable metrics such as accuracy, precision,
recall, and F1-score. Designers should experiment with various machine learning algorithms (e.g., logistic
regression, random forest, support vector machines) and ensemble techniques to identify the most effective
model for predicting COVID-19 infection probabilities.
Validation and Generalization: It is crucial to validate the predictive models using cross-validation
techniques to ensure their robustness and generalization to unseen data. Designers should partition the
dataset into training, validation, and test sets, perform cross-validation, and assess model performance on
unseen data to avoid overfitting and underfitting issues.
Continuous Monitoring and Improvement: Given the dynamic nature of the COVID-19 pandemic,
predictive models should be continuously monitored and updated with new data to maintain their accuracy
and relevance over time. Designers should establish a framework for monitoring model performance,
incorporating feedback from healthcare professionals and stakeholders, and iteratively improving model
algorithms and features.
Ethical Considerations: Designers should prioritize ethical considerations, data privacy, and patient
confidentiality throughout the project lifecycle. Ensuring compliance with relevant regulations and
guidelines (e.g., GDPR, HIPAA) is essential to protect sensitive health information and maintain trust and
transparency in predictive modeling efforts.
Overall, the project highlights the potential of data-driven approaches in predicting COVID-19
infection probabilities and informing public health interventions. Continued research and collaboration
among data scientists, healthcare professionals, and policymakers can further advance the development and
deploymentof predictive models to combat the COVID-19 pandemic effectively.
11 Reference
2. David J Cennimo Discusses Coronavirus Disease 2019 (COVID 19). Available from:
https://emedicine.medscape.com/article/2500114-overview. (Accessed 25 April 2020).
3. WangJ., LiuY., WeiY., XiuJ., YuT., ZhangX., ZhangL. Epidemiological and clinical characteristics of
99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet.
2020;395(10223):507–513. [PMC free article] [PubMed] [Google Scholar]
6. Erica Hersh Discusses How Long Is the Incubation Period for the Coronavirus? Available from:
https://www.healthline.com/health/coronavirusincubation-period#incubation-period. (Accessed 25 April
2020).
8. Anulekha Ray Discusses ABOUT Coronavirus: India's biggest Concerns are COVID 19 Patients with
No Symptoms. Available from: https://www.livemint.com/news/india/coronavirus-india-s-biggest-
concernsare-covid-19-patients-with-no-symptoms-11587533159071.html. (Accessed 26 April 2020).
9. Teena Thacker Discusses About No Symptoms in 80% of COVID Cases Raises Concern. Available
from: https://economictimes.indiatimes.com/industry/healthcare/biotech/healthcar e/no-symptoms-in-80-
of-covid-cases-raiseconcerns/articleshow/75260387.cms?from=mdr. (Accessed 26 April 2020).
10. Praveen Duddu Discusses About COVID 19 Coronavirus: Top Ten Most Affected Countries. Available
from:https://www.pharmaceuticaltechnology.com/features/covid-19-coronavirus-top-ten-most-
affectedcountries/.
12. V. Wang, Coronavirus Epidemic Keeps Growing, But Spread in China Slows. New York Times.
https://www.nytimes.com/2020/02/18/world/asia/chinacoronaviruscases.html?referringSource=articleSha
re. (Accessed 26 April 2020).
16. Covid19casesinGermany.Availablefrom:https://www.worldometers.info/coronavirus/country/Germany
/. (Accesses 26 April 2020).
20. Gavin Edwards discusses about Machine Learning: An Introduction. Available from:
https://towardsdatascience.com/machine-learning-anintroduction-23b84d51e6d0. (Accessed 27 April
2020). and Energy Systems; pp. 1–5. [Google Scholar]