Balaji

INTERNSHIP REPORT
A report submitted in partial fulfillment of the requirements for the Award of Degree of
BACHELOR OF TECHNOLOGY
in
ARTIFICIAL INTELLIGENCE & DATA SCIENCE
By
CHEBOLU BALAJI KUMAR
Regd. No.:111421243008
Under Supervision
of Mr. Hariharan,
ECRREDE Technologies pvt.Ltd,
Chennai-602024.
(Duration: 26 June, 2023 to 28th July, 2023)
th
DEPARTMENT OF ARTIFICAIL INTELLIGENCE & DATA SCIENCE

PRATHYUSHA ENGINEERING COLLEGE
(An Autonomous Institution)
Approved by AICTE, Permanently affiliated to anna university, Tiruvallur
POONAMALLEE,TIRUVALLUR,
CHENNAI-602025
i
ii
iv
ACKNOWLEDGEMENT
First I would like to thank Mr.Balaji, Head, of ECRREDE,Chennai for giving

me the opportunity to do an internship within the organization.
I also would like all the people that worked along with me ECRREDE,Chennai
with their patience and openness they created an enjoyable working environment.
It is indeed with a great sense of pleasure and immense sense of gratitude that I
acknowledge the help of these individuals.
I am highly indebted to Director Dr. P.M. Beulah Devamalar and Principal

Dr.Ramesh Babu for the facilities provided to accomplish this internship.
I would like to thank my Head of the Department Dr.R.Kannamma , Dr,S.

Madhusudhanan , Ms. R. Anitha and Ms. C. Kamatchi for their unwavering support
and guidance in securing and completing the internship at the aforementioned
organiation.
I am extremely great full to my department staff members and friends who

helped me in successful completion of this internship.
CHEBOLU BALAJI KUMAR
(111421243008)
v
ABSTRACT
Diabetes mellitus is a prevalent chronic disease affecting millions worldwide, leading to severe
health complications if left unmanaged. Early detection and proactive intervention are crucial in
mitigating its impact. This study explores the application of machine learning algorithms for
predicting the onset or progression of diabetes based on a comprehensive dataset containing
clinical, demographic, and lifestyle factors. The research focuses on the development and
evaluation of predictive models utilizing various machine learning techniques, including but not
limited to logistic regression, decision trees, random forests, support vector machines, and neural
networks. Feature selection and engineering methodologies are employed to identify the most
significant risk factors and optimize model performance. The dataset comprises a diverse range of
attributes such as glucose levels, body mass index (BMI), family history, physical activity, and
other health metrics. Through rigorous preprocessing, cross-validation, and hyperparameter tuning,
the models are trained and validated to achieve robustness and accuracy in predicting diabetes risk.
Performance metrics including accuracy, sensitivity, specificity, and area under the receiver
operating characteristic curve (AUC-ROC) are utilized to assess and compare the models'
predictive capabilities. Additionally, interpretability analyses are conducted to understand the
contributions of different features and their impact on the predictive outcomes. The findings of this
research aim to provide insights into leveraging machine learning techniques as effective tools for
early diabetes risk assessment. The developed models hold promise for assisting healthcare
professionals in identifying high-risk individuals, facilitating personalized preventive strategies,
and ultimately improving the management and prognosis of diabetes.
Organisation Information:
ECRREDE Technologies is a software, hardware services, and product development company that
commenced in 2013. We provide extraordinary web designs, mobile applications for Android,
along with IT services, embedded or microcontroller-based services, and digital marketing. We
started with an aim to provide high-quality development services throughout the web to build high
enterprise-based applications. ECRREDE is a professionally managed company with years of
industry experience in developing and delivering Enterprise specific Software and Web
development solutions using latest technologies. Quality is the buzz word in today's world without
which no organization can survive. Along with quality we at ECRREDE. "Think Beyond" to take
one step ahead and focus on Delivery of the solutions. We design processes that focus not just
only on quality but also on delivery which increases the value to our global clients. Apart from
training our employees on latest technologies, we also empower them to deliver exciting solutions
to our clients. At the core ECRREDE operates in three specific domains namely Software
Development, Website Design & Development and Geographic Information Services.
Programs and opportunities:
vi
This ground up approach helps us deliver not only the solution to our clients but also add value
to at the core ECRREDE operates in three specific domains namely Software Development, web
designs, mobile applications for Android, along with IT services, embedded or microcontroller-
based services, and digital marketing. We started with an aim to provide high-quality
development services throughout the web to build high enterprise-based applications. We have
an expert team to support us in being a well reputed and sophisticated service provider in the IT
and automation industry and to enhance our client satisfaction. Under each division we further
provide specific industry solutions on focused domains with cutting edge technologies. We
emphasize on building relationships with our clients by delivering projects on time .
Methodologies:
We follow a structured methodology for our projects which starts from designing the solution to
the implementation phase. Well planned Project reduces the time to deliver the project and any
additional ad-hoc costs to our clients, hence we dedicate majority of our time understanding our
clients business and gather requirements. This ground up approach helps us deliver not only the
solution to our clients but also add value to your investments.
Key parts of the report:
Under each division we further provide specific industry solutions on focused domains with
cutting edge technologies.
Benefits of the Ecrrede technologies through our report:
Under each division we further provide specific industry solution on focused domains with
cutting edge technologies. We emphasize on building relationships with our clients by delivering
projects on time and within economic.
vii
INDEX
S.no CONTENTS Page no

1. Introduction..............................................................................................................1
1.1 Modules..............................................................................................................2
2. Problem statement.....................................................................................................3
3. Dataset description....................................................................................................4
4. Technology................................................................................................................5
5. Data preprocessing....................................................................................................6
5.1 Importing libraries...............................................................................................7
5.2 Visualization.........................................................................................................8
5.3 Correlation between all features...........................................................................9
5.4 Data scaling........................................................................................................10
6 Model Building..........................................................................................................11
7 Results and Evaluation...............................................................................................12
8 Conclusion.................................................................................................................13
9 References................................................................................................................14
viii
Learning Objectives/Internship Objectives
 Internships are generally thought of to be reserved for college students looking to gain
experience in a particular field. However, a wide array of people can benefit from
Training Internships in order to receive real world experience and develop their skills.
 An objective for this position should emphasize the skills you already possess in the area
and your interest in learning more
 Internships are utilized in a number of different career fields, including architecture,

engineering, healthcare, economics, advertising and many more.
 Some internship is used to allow individuals to perform scientific research while others
are specifically designed to allow people to gain first-hand experience working.
 Utilizing internships is a great way to build your resume and develop skills that can be
emphasized in your resume for future jobs. When you are applying for a Training
Internship, make sure to highlight any special skills or talents that can make you stand
apart from the rest of the applicants so that you have an improved chance of landing the
position.
ix
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES
DATE DAY NAME OF THE TOPIC/MODULE COMPLETED

26/06/23 Monday Introduction of python language
27/06/23 Tuesday Learn Standard Deviation
1st WEEK
28/06/23 Wednesday Introduction of linear algebra

29/06/23 Thursday Basics of linear algebra
30/06/23 Friday Introduction of Statistics
01/07/23 Saturday Basics of Statistics

03/07/23 Monday Python Libraries & Framewroks
04/07/23 Tuesday Introduction of Machine learning
2nd WEEK
05/07/23 Wednesday Machine learning works

06/07/23 Thursday Understanding Machine learning
07/07/23 Friday Machine learning applications in real-time
08/07/23 Saturday Machine learning with python
x
10/07/23 Monday Machine learning types
11/07/23 Tuesday Introduction of Machine learning algorithms
3rd WEEK
12/07/23 Wednesday Understanding Machine learning algorithms.
13/07/23 Thursday Supervised learning & its Algorithms works
14/07/23 Friday Unsupervised Learning &its Algorithms works

17/07/23 Monday Reinforcement learning & its Algorithms
4th WEEK
works
18/07/23 Tuesday Problem statement for project
19/07/23 Wednesday Research/Gathering data
20/07/23 Thursday Data Preprocessing
21/07/23 Friday Model selection and training
NAME OF THE TOPIC /MODULE COMPLETED

5th DATE DAY
24/07/23 Monday Testing and Evaluation
WE 25/07/23 Tuesday Model Deployment

EK
26/07/23 Wednesday Project Submission
xi
1. INTRODUCTION
Diabetes is a chronic disease that directly affects the pancreas, and the body is incapable of producing
insulin .Insulin is mainly responsible for maintaining the blood glucose level. Many factors, such as
excessive body weight, physical inactivity, high blood pressure, and abnormal cholesterol level, can
cause a person get affected by diabetes. It can cause many complications, but an increase in urination
is one of the most common ones. It can damage the skin, nerves, and eyes, and if not treated early,
diabetes can cause kidney failure and diabetic retinopathy ocular disease. According to IDF
(International Diabetes Federation) statistics, 537 million people had diabetes around the world in
2021 . In Bangladesh, approximately 7.10 million people had suffered from this disease, according to
2019 statistics . Early and accurate diagnosis of diabetes mellitus, especially during its initial
development, is challenging for medical professionals. Artificial intelligence and machine learning
techniques, providing a reference, can help them gain preliminary knowledge about this disease and
reduce their workload accordingly. Significant numbers of research have been performed to predict
diabetes automatically using machine learning and ensemble techniques. Most of these works
employed the open‐source Pima Indian dataset . Some of these articles on automatic diabetes
prediction employing the Pima Indian dataset are briefly discussed in the following paragraphs. For
instance, Kumar et al used the random forest algorithm to design a system that can predict diabetes
quickly and accurately. The dataset used in this work was collected from the UCI learning repository.
First, the authors used conventional data preprocessing techniques, including data cleaning,
integration, and reduction. The accuracy level was 90% using the random forest algorithm, which is
much higher when compared to other algorithms. In a recent paper , Mohan and Jain used the SVM
algorithm to analyze and predict diabetes with the help of the Pima Indian Diabetes Dataset. This work
used four types of kernels, linear, polynomial, RBF, and sigmoid, to predict diabetes in the machine
learning platform. The authors obtained diverse accuracies in different kernels, ranging between 0.69
and 0.82. The SVM technique with radial basis kernel function obtained the highest accuracy of 0.82.
Goyal and his team created a smart home health monitoring scheme to detect diabetes. The authors
also employed the Pima Indian dataset for their research. For predicting blood pressure status, they
used conditional decision making and for predicting diabetes, they used SVM, KNN, and decision tree.
Among these models, SVM worked better as they got 75% accuracy which is better than other
classifier algorithms. Hassan et al attempted to predict diabetes using different ensemble method‐
based machine learning algorithms and the Pima Indian dataset. The authors considered AUC (area
under the ROC curve) as their accuracy measure. Finally, the proposed ensemble classifier
accomplished an AUC value of 0.95. Jackins et al. proposed a multi‐disease prediction system,
including diabetes using machine learning techniques and the Pima Indian dataset. According to the
authors, the Naive Bayes performed better than the random forest technique with accuracy increments
of 0.43%.
Mounika et al. anticipated diabetes probabilities using machine learning techniques. This work
employed the public Pima Indian dataset and multiple machine learning frameworks. Kumari et al.
attempted to apply a soft voting classifier‐based ensemble approach for diabetes prediction. The
proposed soft voting classifier attained the overall highest accuracy and F1 score of 0.791 and 0.716,
respectively. Prabhu and Selvabharathi used the open‐source Pima Indian diabetes dataset for
predicting diabetes using the deep belief network model. The authors constructed the model in three
phases, that is, data preprocessing using min–max normalization, constructing the network model, and
fine‐tuning the test dataset to remove any partiality using NN‐FF classification. Finally, the authors
have done all the implementation and simulation of the model using MATLAB. The authors reported
an F1 score of 0.808, finding the best performance metric compared with the other classification
methods.
1
Predicting diabetes involves using various data points to forecast the likelihood of an individual developing
2
the condition. This process often involves analyzing historical health data, lifestyle factors, and other
relevant information to create a predictive model.
The aim of these predictive methods is to identify individuals at higher risk of developing diabetes to enable
early intervention and lifestyle modifications. Early identification can lead to proactive measures like
personalized dietary plans, exercise routines, and medication, ultimately helping to prevent or delay the onset
of diabetes and its complications. While machine learning is a powerful tool for solving problems, improving
business operations and automating tasks, it's also a complex and challenging technology, requiring deep
expertise and significant resources. Choosing the right algorithm for a task calls for a strong grasp of
mathematics and statistics. Training machine learning algorithms often involves large amounts of good
quality data to produce accurate results. The results themselves can be difficult to understand -- particularly
the outcomes produced by complex algorithms, such as the deep learning neural networks patterned after the
human brain. And ML models can be costly to run and tune. Still, most organizations either directly or
indirectly through ML-infused products are embracing machine learning. According to the "2023 AI and
Machine Learning Research Report" from Rackspace Technology, 72% of companies surveyed said that AI
and machine learning are part of their IT and business strategies, and 69% described AI/ML as the most
important technology. Companies that have adopted it reported using it to improve existing processes (67%),
predict business performance and industry trends (60%) and reduce risk (53%).
3
1.1 Module Description:
1. Data Collection:
- Gather a dataset containing relevant features (such as glucose levels, blood pressure, BMI, age, etc.)
and the target variable indicating diabetes presence (1 for diabetic, 0 for non-diabetic).
- You can find datasets on platforms like Kaggle, UCI Machine Learning Repository, or
healthcare repositories.
2. Data Preprocessing:
- Clean the data by handling missing values, outliers, and encoding categorical variables if necessary.
- Split the data into training and testing sets.
3. Feature Selection:
- Identify significant features that influence diabetes prediction.
- Perform feature scaling if required.
4. Model Selection:
- Choose appropriate machine learning algorithms for classification (e.g., Logistic Regression,
Decision Trees, Random Forest, SVM, etc.).
- Train multiple models to see which one performs better with your dataset.
5. Model Training:
- Fit the chosen models on the training data.
6. Model Evaluation:
- Evaluate model performance using metrics like accuracy, precision, recall, F1-score, and ROC
curve analysis.
- Tune hyperparameters to improve model performance through techniques like cross-validation or grid
search.
7. Deployment:
- Once you have a satisfactory model, create an interface (web app, mobile app, or API) for users to
input data and get predictions.
4
2. PROBLEM STATEMENT
Diabetes is a chronic health condition affecting millions worldwide, characterized by abnormal levels of
glucose in the blood. Early detection and intervention play a crucial role in preventing complications such
as cardiovascular diseases, kidney damage, and vision impairment. This project focuses on leveraging
machine learning models to predict the probability of an individual developing diabetes based on factors
such as:
Glucose levels
Blood pressure
Body Mass Index (BMI)
Age
Family history
Physical activity levels, etc.
Significance and Potential Impact:

Healthcare Perspective:
Early Intervention: Identifying individuals at risk enables early intervention strategies, including lifestyle
modifications and preventive healthcare measures.
Resource Optimization: Healthcare resources can be better allocated by targeting high-risk individuals for
more frequent check-ups and proactive care.
Improved Patient Outcomes: Early detection allows for timely management, potentially reducing the severity
of complications associated with diabetes.
Societal Impact:
Reduced Healthcare Costs: Prevention and early intervention can lead to reduced healthcare expenditures
associated with diabetes-related complications.
Enhanced Quality of Life: Timely identification and management contribute to a better quality of life for
affected individuals.
Public Health Initiatives: Insights from predictive models can inform public health policies and programs
aimed at diabetes prevention and education.
5
3. DATA DESCRIPTION
When working on a diabetes prediction machine learning project, the dataset used typically contains
various health-related features and a target variable indicating the presence or absence of diabetes.
Dataset Description:
Source of the Dataset:
The dataset might have been obtained from sources like healthcare repositories, research studies, or public
datasets available on platforms like Kaggle, UCI Machine Learning Repository, or healthcare organizations'
databases.
Features Included:
The dataset comprises several features that serve as inputs to the machine learning model for diabetes
prediction. Common features might include:
Glucose Levels: Blood sugar levels measured during fasting or through glucose tolerance tests.
Blood Pressure: Systolic and diastolic blood pressure readings.
BMI (Body Mass Index): A measure indicating body fat based on height and weight.
Age: Age of the individuals in the dataset.
Family History: Presence or absence of diabetes in the family.
Physical Activity: Level of physical activity or exercise habits.
Skin Thickness, Insulin Levels, etc.: Additional health indicators that might contribute to predicting
diabetes risk.
Target Variable:
The target variable in this dataset is typically binary, representing the presence or absence of diabetes. It is
often encoded as:
1: Indicates the presence of diabetes (diabetic).

0: Indicates the absence of diabetes (non-diabetic)
6
4. TECHNOLOGY
PYTHON
Python is a versatile, high-level programming language known for its simplicity and readability. Created
by Guido van Rossum and first released in 1991, Python emphasizes code readability and a clean syntax,
allowing programmers to express concepts in fewer lines of code compared to other languages. It
supports multiple programming paradigms, including procedural, object-oriented, and functional
programming styles. it's popular, easy to learn, and future-ready.
Another plus point with Python is that you can switch domains easily. Python offers popular frameworks
like Django and Flask for backend development, Tkinter for GUI development, Pygame for Game
development which will open more doors of opportunity for you in the future. let's move to the next
step. If you go with Python, you must learn Sklearn for Machine Learning. Sklearn is a modern machine-
learning library written in Python. It makes it easy for you to try out different ML algorithms on your
own data
MACHINE LEARNING MODELS & ALGORITHMS
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
1. Supervised Machine Learning

2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
These ML algorithms help to solve different business problems like Regression,

Classification, Forecasting, Clustering, and Associations, etc.
 Linear Regression
 Logistic Regression
 Linear Discriminant Analysis
 Classification and Regression Trees
 Naive Bayes
 K-Nearest Neighbors (KNN)
 Learning Vector Quantization (LVQ)
7
 Support Vector Machines (SVM)
 Random Forest
 Boosting
 AdaBoost
PYTHON LIBRARIES
Libraries for Machine Learning in Python:

Numpy : NumPy library is an important foundational tool for studying Machine Learning. Many of its
functions are very useful for performing any mathematical or scientific calculation. As it is known that
mathematics is the foundation of machine learning, most of the mathematical tasks can be performed using
NumPy.
Pandas: Pandas is an open-source library in Python that is made mainly for working with relational or
labeled data both easily and intuitively. It provides various data structures and operations for manipulating
numerical data and time series
Matplotlib: Matplotlib is a low-level library of Python which is used for data visualization. It is easy to use
and emulates MATLAB like graphs and visualization.
Scikit-learn: Offers a wide range of tools for machine learning including classification, regression,
clustering, etc. It's beginner-friendly and well-documented.
TensorFlow & Keras: TensorFlow is an open-source library developed by Google for machine learning;
Keras is an API that runs on top of TensorFlow, simplifying the process of building neural networks.
PyTorch: Developed by Facebook, it's used for applications such as natural language processing and
computer vision. It's known for its flexibility and dynamic computation graphs.
8
5. DATA PREPROCESSING
5.1 Imported libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from mlxtend.plotting import plot_decision_regions
import missingno as msno
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import
classification_report import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
Here we will be reading the dataset which is in the CSV format
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df.head()
output:
9
diabetes_df.columns
Output:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
diabetes_df.info()
Output:
RangeIndex: 768 entries, 0 to 767

Data columns (total 9 columns):
# Column Non-Null Count Dtype
0 Pregnancies 76 non-null int64

8
1 Glucose 76 non-null int64
8
2 BloodPressure 76 non-null int64
8
3 SkinThickness 76 non-null int64
8
4 Insulin 76 non-null int64
8
5 BMI 76 non-null float64
8
6 DiabetesPedigreeFunction 76 non-null float64
8
7 Age 76 non-null int64
8
8 Outcome 76 non-null int64
8
dtypes: float64(2), int64(7)
diabetes_df.describe()
Output:
10
11
diabetes_df.isnull().head(10)
Output:
diabetes_df.isnull().sum()
Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
5.2 Data Visualization

Plotting the data distribution plots before removing null values
p = diabetes_df.hist(figsize = (20,20))
Output:
12
Plotting the data distribution plots before removing null values
p = diabetes_df.hist(figsize = (20,20))
Output:
13
Check number of patients who are diabetic is half of the patients who are non-diabetic:
color_wheel = {1: "#0392cf", 2: "#7bc043"}

colors = diabetes_df["Outcome"].map(lambda x: color_wheel.get(x + 1))
print(diabetes_df.Outcome.value_counts())
p=diabetes_df.Outcome.value_counts().plot(kind="bar")
Output:
0 500
1 268
Name: Outcome, dtype: int64
Distplot use to see the distribution of the data as well as with the help of boxplot one can see the outliers in
that column.
#Displot the values

plt.subplot(121), sns.distplot(diabetes_df['Insulin']) plt.subplot(122),
diabetes_df['Insulin'].plot.box(figsize=(16,5)) plt.show()
Output:
14
5.3 Correlation between all the features:
plt.figure(figsize=(12,10))
p = sns.heatmap(diabetes_df.corr(), annot=True,cmap ='RdYlGn')
Output:
5.4 Scaling the Data:
Before scaling down the data
diabetes_df_copy.head()
Output:
15
After Standard scaling
sc_X = StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop(["Outcome"],axis = 1),),
columns=['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction',
'Age'])
X.head()
Output:
6. MODEL BUILDING
Splitting the dataset
X = diabetes_df.drop('Outcome', axis=1) y
= diabetes_df['Outcome']
Split the data into training and testing data using the train_test_split function
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,

random_state=7)
RandomForest algorithm.
#use RandomForest algorithm.
Building the model using RandomForest
from sklearn.ensemble import RandomForestClassifier

16
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
check the accuracy of the model on the training dataset.
rfc_train = rfc.predict(X_train) from

sklearn import metrics
print("Accuracy_Score =", format(metrics.accuracy_score(y_train, rfc_train)))
Output: Accuracy = 1.0
TEST And EVALUATION:
Getting the accuracy score for Random Forest.
predictions = rfc.predict(X_test)
print("Accuracy_Score =", format(metrics.accuracy_score(y_test, predictions)))
Output:
Accuracy_Score = 0.7677165354330708
Decision Tree algorithm.
Building the model using DecisionTree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier() dtree.fit(X_train,

y_train)
TEST and EVALUATION:
Accuracy score for Decision Tree.
predictions = dtree.predict(X_test)
print("Accuracy Score =", format(metrics.accuracy_score(y_test,predictions)))
17
Output:
18
Accuracy Score = 0.7322834645669292
XgBoost classifier.
Building model using XGBoost
from xgboost import XGBClassifier

xgb_model = XGBClassifier(gamma=0)
xgb_model.fit(X_train, y_train)
Output:
Accuracy score for the XgBoost classifier.
from sklearn import metrics xgb_pred

= xgb_model.predict(X_test)
print("Accuracy Score =", format(metrics.accuracy_score(y_test, xgb_pred)))
Output:
Accuracy Score = 0.7401574803149606.
7. RESULT AND EVALUATION
Random set of features from both the head and tail of the data to test that if our model is good enough to give
the right prediction.
19
diabetes_df.head()
20
Output:
diabetes_df.tail()
Output:
Putting data points in the model will either return 0 or 1 i.e.( person suffering from diabetes or not ).
rfc.predict([[10,101,76,48,180,32.9,0.171,63]]) # 763 th patient
Output:
array([0], dtype=int64)
Therefore, Random forest is the best model for this prediction since it has an accuracy_score of 0.76.
21
CONCLUSION
22
After using all these patient records, we are able to build a machine learning model
(random forest – best one) to accurately predict whether or not the patients in the dataset have diabetes or not
along with that we were able to draw some insights from the data via data analysis and visualization. The
application of machine learning techniques for diabetes prediction presents a promising avenue for early
identification and proactive management of this prevalent health condition. Through the analysis of health-
related parameters such as glucose levels, blood pressure, BMI, and other factors, this project aimed to
predict the likelihood of an individual developing diabetes.
Key Findings:
Model Performance: Several machine learning algorithms were employed and evaluated, with certain
models showcasing higher predictive accuracy and performance metrics.
Feature Importance: Certain features, such as glucose levels and BMI, emerged as significant
indicators in predicting diabetes onset, contributing significantly to the model's predictive power.
Ethical Considerations: Ensuring data privacy, fairness, and transparency in model deployment
remains a critical consideration in utilizing predictive models in healthcare.
Implications:
Early Intervention: The developed model holds potential for early identification of individuals at risk,
facilitating timely intervention and preventive measures.
Healthcare Resource Allocation: Targeting high-risk populations identified by the model can
optimize healthcare resources for better management and improved outcomes.
Public Health Initiatives: Insights derived from this predictive model can inform public health
policies and educational campaigns aimed at diabetes prevention.
Future Directions:
Refinement of Models: Continual improvement and refinement of the predictive models by

incorporating more diverse datasets and exploring advanced algorithms.
Integration with Clinical Practice: Collaboration with healthcare providers to integrate these
predictive models into clinical practice, aiding in patient risk assessment and personalized care.
Longitudinal Studies: Conducting longitudinal studies to track individuals over time can enhance
the accuracy and robustness of predictive models
BIBLIOGRAPHY
23
The following books are referred during the analysis and execution phase of the project.
The Hundred-Page Machine Learning Book" by Andriy Burkov:
"Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron:
Fundamentals of Machine Learning for Predictive Data Analytics by John D. Kelleher, Brian
Mac
Namee, and Aoife D’Arcy
Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst
Appl. 2017;9(01):1
WEBLINKS
1. https://www.codewithharry.com/blogpost/complete-ml-roadmap-for-beginners/
- covering all the most important machine learning concepts. This tutorial is primarily for new users.
2. https://www.geeksforgeeks.org/machine-learning-projects/ were I refer and

understand machine learning sample projects/
3. https://www.simplilearn.com/tutorials/machine-learning-tutorial- Machine
learning course/
4. https://www.geeksforgeeks.org/python-web-scraping-tutorial/ were I refer

to understand the concept of web scrping.
24

Balaji

Uploaded by

Copyright:

Available Formats

Balaji

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Balaji

Uploaded by

Copyright:

Available Formats

INTERNSHIP REPORT

CHEBOLU BALAJI KUMAR

DEPARTMENT OF ARTIFICAIL INTELLIGENCE & DATA SCIENCE

First I would like to thank Mr.Balaji, Head, of ECRREDE,Chennai for giving

I am highly indebted to Director Dr. P.M. Beulah Devamalar and Principal

I would like to thank my Head of the Department Dr.R.Kannamma , Dr,S.

I am extremely great full to my department staff members and friends who

CHEBOLU BALAJI KUMAR

Programs and opportunities:

Key parts of the report:

Benefits of the Ecrrede technologies through our report:

S.no CONTENTS Page no

5.1 Importing libraries...............................................................................................7

5.3 Correlation between all features...........................................................................9

5.4 Data scaling........................................................................................................10

7 Results and Evaluation...............................................................................................12

 Internships are utilized in a number of different career fields, including architecture,

DATE DAY NAME OF THE TOPIC/MODULE COMPLETED

28/06/23 Wednesday Introduction of linear algebra

DATE DAY NAME OF THE TOPIC/MODULE COMPLETED

05/07/23 Wednesday Machine learning works

12/07/23 Wednesday Understanding Machine learning algorithms.

13/07/23 Thursday Supervised learning & its Algorithms works

14/07/23 Friday Unsupervised Learning &its Algorithms works

DATE DAY NAME OF THE TOPIC/MODULE COMPLETED

NAME OF THE TOPIC /MODULE COMPLETED

WE 25/07/23 Tuesday Model Deployment

Significance and Potential Impact:

1: Indicates the presence of diabetes (diabetic).

MACHINE LEARNING MODELS & ALGORITHMS

1. Supervised Machine Learning

These ML algorithms help to solve different business problems like Regression,

Libraries for Machine Learning in Python:

5.1 Imported libraries.

import matplotlib.pyplot as plt

import seaborn as sns

from mlxtend.plotting import plot_decision_regions

import missingno as msno

from pandas.plotting import scatter_matrix

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix

from sklearn import metrics

from sklearn.metrics import

classification_report import warnings

Here we will be reading the dataset which is in the CSV format

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

RangeIndex: 768 entries, 0 to 767

0 Pregnancies 76 non-null int64

5.2 Data Visualization

color_wheel = {1: "#0392cf", 2: "#7bc043"}

#Displot the values

5.4 Scaling the Data:

Before scaling down the data

Splitting the dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,

#use RandomForest algorithm.

Building the model using RandomForest