Balaji
Balaji
Balaji
A report submitted in partial fulfillment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
in
ARTIFICIAL INTELLIGENCE & DATA SCIENCE
By
Regd. No.:111421243008
Under Supervision
of Mr. Hariharan,
ECRREDE Technologies pvt.Ltd,
Chennai-602024.
(Duration: 26 June, 2023 to 28th July, 2023)
th
i
ii
iv
ACKNOWLEDGEMENT
I also would like all the people that worked along with me ECRREDE,Chennai
with their patience and openness they created an enjoyable working environment.
It is indeed with a great sense of pleasure and immense sense of gratitude that I
acknowledge the help of these individuals.
(111421243008)
v
ABSTRACT
Diabetes mellitus is a prevalent chronic disease affecting millions worldwide, leading to severe
health complications if left unmanaged. Early detection and proactive intervention are crucial in
mitigating its impact. This study explores the application of machine learning algorithms for
predicting the onset or progression of diabetes based on a comprehensive dataset containing
clinical, demographic, and lifestyle factors. The research focuses on the development and
evaluation of predictive models utilizing various machine learning techniques, including but not
limited to logistic regression, decision trees, random forests, support vector machines, and neural
networks. Feature selection and engineering methodologies are employed to identify the most
significant risk factors and optimize model performance. The dataset comprises a diverse range of
attributes such as glucose levels, body mass index (BMI), family history, physical activity, and
other health metrics. Through rigorous preprocessing, cross-validation, and hyperparameter tuning,
the models are trained and validated to achieve robustness and accuracy in predicting diabetes risk.
Performance metrics including accuracy, sensitivity, specificity, and area under the receiver
operating characteristic curve (AUC-ROC) are utilized to assess and compare the models'
predictive capabilities. Additionally, interpretability analyses are conducted to understand the
contributions of different features and their impact on the predictive outcomes. The findings of this
research aim to provide insights into leveraging machine learning techniques as effective tools for
early diabetes risk assessment. The developed models hold promise for assisting healthcare
professionals in identifying high-risk individuals, facilitating personalized preventive strategies,
and ultimately improving the management and prognosis of diabetes.
Organisation Information:
ECRREDE Technologies is a software, hardware services, and product development company that
commenced in 2013. We provide extraordinary web designs, mobile applications for Android,
along with IT services, embedded or microcontroller-based services, and digital marketing. We
started with an aim to provide high-quality development services throughout the web to build high
enterprise-based applications. ECRREDE is a professionally managed company with years of
industry experience in developing and delivering Enterprise specific Software and Web
development solutions using latest technologies. Quality is the buzz word in today's world without
which no organization can survive. Along with quality we at ECRREDE. "Think Beyond" to take
one step ahead and focus on Delivery of the solutions. We design processes that focus not just
only on quality but also on delivery which increases the value to our global clients. Apart from
training our employees on latest technologies, we also empower them to deliver exciting solutions
to our clients. At the core ECRREDE operates in three specific domains namely Software
Development, Website Design & Development and Geographic Information Services.
vi
This ground up approach helps us deliver not only the solution to our clients but also add value
to at the core ECRREDE operates in three specific domains namely Software Development, web
designs, mobile applications for Android, along with IT services, embedded or microcontroller-
based services, and digital marketing. We started with an aim to provide high-quality
development services throughout the web to build high enterprise-based applications. We have
an expert team to support us in being a well reputed and sophisticated service provider in the IT
and automation industry and to enhance our client satisfaction. Under each division we further
provide specific industry solutions on focused domains with cutting edge technologies. We
emphasize on building relationships with our clients by delivering projects on time .
Methodologies:
We follow a structured methodology for our projects which starts from designing the solution to
the implementation phase. Well planned Project reduces the time to deliver the project and any
additional ad-hoc costs to our clients, hence we dedicate majority of our time understanding our
clients business and gather requirements. This ground up approach helps us deliver not only the
solution to our clients but also add value to your investments.
Under each division we further provide specific industry solutions on focused domains with
cutting edge technologies.
Under each division we further provide specific industry solution on focused domains with
cutting edge technologies. We emphasize on building relationships with our clients by delivering
projects on time and within economic.
vii
INDEX
1.1 Modules..............................................................................................................2
2. Problem statement.....................................................................................................3
3. Dataset description....................................................................................................4
4. Technology................................................................................................................5
5. Data preprocessing....................................................................................................6
5.2 Visualization.........................................................................................................8
6 Model Building..........................................................................................................11
8 Conclusion.................................................................................................................13
9 References................................................................................................................14
viii
Learning Objectives/Internship Objectives
Internships are generally thought of to be reserved for college students looking to gain
experience in a particular field. However, a wide array of people can benefit from
Training Internships in order to receive real world experience and develop their skills.
An objective for this position should emphasize the skills you already possess in the area
and your interest in learning more
Some internship is used to allow individuals to perform scientific research while others
are specifically designed to allow people to gain first-hand experience working.
Utilizing internships is a great way to build your resume and develop skills that can be
emphasized in your resume for future jobs. When you are applying for a Training
Internship, make sure to highlight any special skills or talents that can make you stand
apart from the rest of the applicants so that you have an improved chance of landing the
position.
ix
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES
x
DATE DAY NAME OF THE TOPIC/MODULE COMPLETED
10/07/23 Monday Machine learning types
11/07/23 Tuesday Introduction of Machine learning algorithms
3rd WEEK
works
18/07/23 Tuesday Problem statement for project
19/07/23 Wednesday Research/Gathering data
20/07/23 Thursday Data Preprocessing
21/07/23 Friday Model selection and training
xi
1. INTRODUCTION
Diabetes is a chronic disease that directly affects the pancreas, and the body is incapable of producing
insulin .Insulin is mainly responsible for maintaining the blood glucose level. Many factors, such as
excessive body weight, physical inactivity, high blood pressure, and abnormal cholesterol level, can
cause a person get affected by diabetes. It can cause many complications, but an increase in urination
is one of the most common ones. It can damage the skin, nerves, and eyes, and if not treated early,
diabetes can cause kidney failure and diabetic retinopathy ocular disease. According to IDF
(International Diabetes Federation) statistics, 537 million people had diabetes around the world in
2021 . In Bangladesh, approximately 7.10 million people had suffered from this disease, according to
2019 statistics . Early and accurate diagnosis of diabetes mellitus, especially during its initial
development, is challenging for medical professionals. Artificial intelligence and machine learning
techniques, providing a reference, can help them gain preliminary knowledge about this disease and
reduce their workload accordingly. Significant numbers of research have been performed to predict
diabetes automatically using machine learning and ensemble techniques. Most of these works
employed the open‐source Pima Indian dataset . Some of these articles on automatic diabetes
prediction employing the Pima Indian dataset are briefly discussed in the following paragraphs. For
instance, Kumar et al used the random forest algorithm to design a system that can predict diabetes
quickly and accurately. The dataset used in this work was collected from the UCI learning repository.
First, the authors used conventional data preprocessing techniques, including data cleaning,
integration, and reduction. The accuracy level was 90% using the random forest algorithm, which is
much higher when compared to other algorithms. In a recent paper , Mohan and Jain used the SVM
algorithm to analyze and predict diabetes with the help of the Pima Indian Diabetes Dataset. This work
used four types of kernels, linear, polynomial, RBF, and sigmoid, to predict diabetes in the machine
learning platform. The authors obtained diverse accuracies in different kernels, ranging between 0.69
and 0.82. The SVM technique with radial basis kernel function obtained the highest accuracy of 0.82.
Goyal and his team created a smart home health monitoring scheme to detect diabetes. The authors
also employed the Pima Indian dataset for their research. For predicting blood pressure status, they
used conditional decision making and for predicting diabetes, they used SVM, KNN, and decision tree.
Among these models, SVM worked better as they got 75% accuracy which is better than other
classifier algorithms. Hassan et al attempted to predict diabetes using different ensemble method‐
based machine learning algorithms and the Pima Indian dataset. The authors considered AUC (area
under the ROC curve) as their accuracy measure. Finally, the proposed ensemble classifier
accomplished an AUC value of 0.95. Jackins et al. proposed a multi‐disease prediction system,
including diabetes using machine learning techniques and the Pima Indian dataset. According to the
authors, the Naive Bayes performed better than the random forest technique with accuracy increments
of 0.43%.
Mounika et al. anticipated diabetes probabilities using machine learning techniques. This work
employed the public Pima Indian dataset and multiple machine learning frameworks. Kumari et al.
attempted to apply a soft voting classifier‐based ensemble approach for diabetes prediction. The
proposed soft voting classifier attained the overall highest accuracy and F1 score of 0.791 and 0.716,
respectively. Prabhu and Selvabharathi used the open‐source Pima Indian diabetes dataset for
predicting diabetes using the deep belief network model. The authors constructed the model in three
phases, that is, data preprocessing using min–max normalization, constructing the network model, and
fine‐tuning the test dataset to remove any partiality using NN‐FF classification. Finally, the authors
have done all the implementation and simulation of the model using MATLAB. The authors reported
an F1 score of 0.808, finding the best performance metric compared with the other classification
methods.
1
Predicting diabetes involves using various data points to forecast the likelihood of an individual developing
2
the condition. This process often involves analyzing historical health data, lifestyle factors, and other
relevant information to create a predictive model.
The aim of these predictive methods is to identify individuals at higher risk of developing diabetes to enable
early intervention and lifestyle modifications. Early identification can lead to proactive measures like
personalized dietary plans, exercise routines, and medication, ultimately helping to prevent or delay the onset
of diabetes and its complications. While machine learning is a powerful tool for solving problems, improving
business operations and automating tasks, it's also a complex and challenging technology, requiring deep
expertise and significant resources. Choosing the right algorithm for a task calls for a strong grasp of
mathematics and statistics. Training machine learning algorithms often involves large amounts of good
quality data to produce accurate results. The results themselves can be difficult to understand -- particularly
the outcomes produced by complex algorithms, such as the deep learning neural networks patterned after the
human brain. And ML models can be costly to run and tune. Still, most organizations either directly or
indirectly through ML-infused products are embracing machine learning. According to the "2023 AI and
Machine Learning Research Report" from Rackspace Technology, 72% of companies surveyed said that AI
and machine learning are part of their IT and business strategies, and 69% described AI/ML as the most
important technology. Companies that have adopted it reported using it to improve existing processes (67%),
predict business performance and industry trends (60%) and reduce risk (53%).
3
1.1 Module Description:
1. Data Collection:
- Gather a dataset containing relevant features (such as glucose levels, blood pressure, BMI, age, etc.)
and the target variable indicating diabetes presence (1 for diabetic, 0 for non-diabetic).
- You can find datasets on platforms like Kaggle, UCI Machine Learning Repository, or
healthcare repositories.
2. Data Preprocessing:
- Clean the data by handling missing values, outliers, and encoding categorical variables if necessary.
- Split the data into training and testing sets.
3. Feature Selection:
- Identify significant features that influence diabetes prediction.
- Perform feature scaling if required.
4. Model Selection:
- Choose appropriate machine learning algorithms for classification (e.g., Logistic Regression,
Decision Trees, Random Forest, SVM, etc.).
- Train multiple models to see which one performs better with your dataset.
5. Model Training:
- Fit the chosen models on the training data.
6. Model Evaluation:
- Evaluate model performance using metrics like accuracy, precision, recall, F1-score, and ROC
curve analysis.
- Tune hyperparameters to improve model performance through techniques like cross-validation or grid
search.
7. Deployment:
- Once you have a satisfactory model, create an interface (web app, mobile app, or API) for users to
input data and get predictions.
4
2. PROBLEM STATEMENT
Diabetes is a chronic health condition affecting millions worldwide, characterized by abnormal levels of
glucose in the blood. Early detection and intervention play a crucial role in preventing complications such
as cardiovascular diseases, kidney damage, and vision impairment. This project focuses on leveraging
machine learning models to predict the probability of an individual developing diabetes based on factors
such as:
Glucose levels
Blood pressure
Body Mass Index (BMI)
Age
Family history
Physical activity levels, etc.
Early Intervention: Identifying individuals at risk enables early intervention strategies, including lifestyle
modifications and preventive healthcare measures.
Resource Optimization: Healthcare resources can be better allocated by targeting high-risk individuals for
more frequent check-ups and proactive care.
Improved Patient Outcomes: Early detection allows for timely management, potentially reducing the severity
of complications associated with diabetes.
Societal Impact:
Reduced Healthcare Costs: Prevention and early intervention can lead to reduced healthcare expenditures
associated with diabetes-related complications.
Enhanced Quality of Life: Timely identification and management contribute to a better quality of life for
affected individuals.
Public Health Initiatives: Insights from predictive models can inform public health policies and programs
aimed at diabetes prevention and education.
5
3. DATA DESCRIPTION
When working on a diabetes prediction machine learning project, the dataset used typically contains
various health-related features and a target variable indicating the presence or absence of diabetes.
Dataset Description:
Source of the Dataset:
The dataset might have been obtained from sources like healthcare repositories, research studies, or public
datasets available on platforms like Kaggle, UCI Machine Learning Repository, or healthcare organizations'
databases.
Features Included:
The dataset comprises several features that serve as inputs to the machine learning model for diabetes
prediction. Common features might include:
Glucose Levels: Blood sugar levels measured during fasting or through glucose tolerance tests.
Blood Pressure: Systolic and diastolic blood pressure readings.
BMI (Body Mass Index): A measure indicating body fat based on height and weight.
Age: Age of the individuals in the dataset.
Family History: Presence or absence of diabetes in the family.
Physical Activity: Level of physical activity or exercise habits.
Skin Thickness, Insulin Levels, etc.: Additional health indicators that might contribute to predicting
diabetes risk.
Target Variable:
The target variable in this dataset is typically binary, representing the presence or absence of diabetes. It is
often encoded as:
6
4. TECHNOLOGY
PYTHON
Python is a versatile, high-level programming language known for its simplicity and readability. Created
by Guido van Rossum and first released in 1991, Python emphasizes code readability and a clean syntax,
allowing programmers to express concepts in fewer lines of code compared to other languages. It
supports multiple programming paradigms, including procedural, object-oriented, and functional
programming styles. it's popular, easy to learn, and future-ready.
Another plus point with Python is that you can switch domains easily. Python offers popular frameworks
like Django and Flask for backend development, Tkinter for GUI development, Pygame for Game
development which will open more doors of opportunity for you in the future. let's move to the next
step. If you go with Python, you must learn Sklearn for Machine Learning. Sklearn is a modern machine-
learning library written in Python. It makes it easy for you to try out different ML algorithms on your
own data
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
Linear Regression
Logistic Regression
Linear Discriminant Analysis
Classification and Regression Trees
Naive Bayes
K-Nearest Neighbors (KNN)
Learning Vector Quantization (LVQ)
7
Support Vector Machines (SVM)
Random Forest
Boosting
AdaBoost
PYTHON LIBRARIES
Pandas: Pandas is an open-source library in Python that is made mainly for working with relational or
labeled data both easily and intuitively. It provides various data structures and operations for manipulating
numerical data and time series
Matplotlib: Matplotlib is a low-level library of Python which is used for data visualization. It is easy to use
and emulates MATLAB like graphs and visualization.
Scikit-learn: Offers a wide range of tools for machine learning including classification, regression,
clustering, etc. It's beginner-friendly and well-documented.
TensorFlow & Keras: TensorFlow is an open-source library developed by Google for machine learning;
Keras is an API that runs on top of TensorFlow, simplifying the process of building neural networks.
PyTorch: Developed by Facebook, it's used for applications such as natural language processing and
computer vision. It's known for its flexibility and dynamic computation graphs.
8
5. DATA PREPROCESSING
import numpy as np
import pandas as pd
sns.set()
warnings.filterwarnings('ignore')
%matplotlib inline
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df.head()
output:
9
diabetes_df.columns
Output:
diabetes_df.info()
Output:
diabetes_df.describe()
Output:
10
11
diabetes_df.isnull().head(10)
Output:
diabetes_df.isnull().sum()
Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
p = diabetes_df.hist(figsize = (20,20))
Output:
12
Plotting the data distribution plots before removing null values
p = diabetes_df.hist(figsize = (20,20))
Output:
13
Check number of patients who are diabetic is half of the patients who are non-diabetic:
Output:
0 500
1 268
Name: Outcome, dtype: int64
Distplot use to see the distribution of the data as well as with the help of boxplot one can see the outliers in
that column.
Output:
14
5.3 Correlation between all the features:
plt.figure(figsize=(12,10))
p = sns.heatmap(diabetes_df.corr(), annot=True,cmap ='RdYlGn')
Output:
diabetes_df_copy.head()
Output:
15
After Standard scaling
sc_X = StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop(["Outcome"],axis = 1),),
columns=['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction',
'Age'])
X.head()
Output:
6. MODEL BUILDING
X = diabetes_df.drop('Outcome', axis=1) y
= diabetes_df['Outcome']
Split the data into training and testing data using the train_test_split function
RandomForest algorithm.
predictions = rfc.predict(X_test)
print("Accuracy_Score =", format(metrics.accuracy_score(y_test, predictions)))
Output:
Accuracy_Score = 0.7677165354330708
predictions = dtree.predict(X_test)
print("Accuracy Score =", format(metrics.accuracy_score(y_test,predictions)))
17
Output:
18
Accuracy Score = 0.7322834645669292
XgBoost classifier.
Building model using XGBoost
Output:
Output:
Random set of features from both the head and tail of the data to test that if our model is good enough to give
19
diabetes_df.head()
20
Output:
diabetes_df.tail()
Output:
Putting data points in the model will either return 0 or 1 i.e.( person suffering from diabetes or not ).
Output:
array([0], dtype=int64)
Therefore, Random forest is the best model for this prediction since it has an accuracy_score of 0.76.
21
CONCLUSION
22
After using all these patient records, we are able to build a machine learning model
(random forest – best one) to accurately predict whether or not the patients in the dataset have diabetes or not
along with that we were able to draw some insights from the data via data analysis and visualization. The
application of machine learning techniques for diabetes prediction presents a promising avenue for early
identification and proactive management of this prevalent health condition. Through the analysis of health-
related parameters such as glucose levels, blood pressure, BMI, and other factors, this project aimed to
predict the likelihood of an individual developing diabetes.
Key Findings:
Model Performance: Several machine learning algorithms were employed and evaluated, with certain
models showcasing higher predictive accuracy and performance metrics.
Feature Importance: Certain features, such as glucose levels and BMI, emerged as significant
indicators in predicting diabetes onset, contributing significantly to the model's predictive power.
Ethical Considerations: Ensuring data privacy, fairness, and transparency in model deployment
remains a critical consideration in utilizing predictive models in healthcare.
Implications:
Early Intervention: The developed model holds potential for early identification of individuals at risk,
facilitating timely intervention and preventive measures.
Healthcare Resource Allocation: Targeting high-risk populations identified by the model can
optimize healthcare resources for better management and improved outcomes.
Public Health Initiatives: Insights derived from this predictive model can inform public health
policies and educational campaigns aimed at diabetes prevention.
Future Directions:
BIBLIOGRAPHY
23
The following books are referred during the analysis and execution phase of the project.
"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron:
Fundamentals of Machine Learning for Predictive Data Analytics by John D. Kelleher, Brian
Mac
Namee, and Aoife D’Arcy
Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst
Appl. 2017;9(01):1
WEBLINKS
1. https://www.codewithharry.com/blogpost/complete-ml-roadmap-for-beginners/
- covering all the most important machine learning concepts. This tutorial is primarily for new users.
3. https://www.simplilearn.com/tutorials/machine-learning-tutorial- Machine
learning course/
24