Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review

Prediction of Breast Cancer Using Machine Learning Algorithm

Uploaded by

sathish
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review

Prediction of Breast Cancer Using Machine Learning Algorithm

Uploaded by

sathish
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

PREDICTION OF BREAST CANCER USING MACHINE LEARNING ALGORITHMS

ABSTRACT

Cancer in whole have become a new normal in the 'disease' world and especially in this growing
generation. Many are contributing to the risk phenomenon such as dietary conditions. Lifestyle too plays a
major role here because many regret to do, eat or make something of a good will. Almost no one is
surveying these factors and these have led to a rapid growth on this tally for he past 20 years, or more. In
the varied population of the Americas and to a wider aspect, this has become an inevitable circumstance. In
this case, female aging 40 and above are more prone to two inexorable circumstances being Urinary Tract
Infections(UTIs) in one hand and Breast Cancer in the other.
This has become a frequently researched and scary topic among not only the physicians and researchers but
with the youth population too. Till day there is not even a slightest cure to the deadliest disease among
them all. As from the earlier times, Inhibiting is is better than cure. this still fits to these day among us all.
There are many kinds of tests and therapies to treat almost every time of cancers but till this day, a cure is
the biggest question. This has been the case since its inception. Awareness is being created in the form of
printing warning signs on the front of cigarette packets, chewing gums etc., but they must be mandatorily
imposed upon the people to create a widespread impact.
Here in this paper Detection of breast cancer is easily elaborated to ease up the process before going
professionally to get a small view on the prediction of the disease. The need to detect this disease earlier
has been of course a growing concern among the people of every nation.
This Breast Cancer Prediction system is mainly aimed at predicting the accuracy on how furious the
cancer have spread or how not at all. This code describes if the patient have cancer or not at all using the
given input, predicting the accuracy.

Keywords— Random Forest Classifier, KNearest Neighbor (KNN) XGBoost, Regression,


Classification, Mining, Training, Testing
METHODOLOGY

EXISTING SYSTEM

The existing model for the customer segmentation depicts that it is based on the K- means
clustering algorithm which comes under centroid-based clustering. The suitable K value for
the given dataset is selected appropriately which represents the predefined clusters. Raw and
unlabeled data is taken as input which is further divided into clusters until the best clusters
are found. Centroid based algorithm used in this model is efficient but sensitive to initial
conditions and outliers

PROPOSED SYSTEM

The main proposal of this project is to get the maximum accuracy, that is being vakued at
above 95% without using parameter tuning and overfitting. To do so, every

OBJECTIVE OF PROJECT

Customer segmentation is the practice of dividing a company‟s customers into groups that
reflect similarities among customers in each group. The main objective of segmenting
customers is to decide how to relate to customers in each segment to maximize the value of
each customer to the business

The emergence of many competitors and entrepreneurs has caused a lot of tension among
competing businesses to find new buyers and keep the old ones. As a result of the
predecessor, the need for exceptional customer service becomes appropriate regardless of the
size of the business.Furthermore, the ability of any business to understand the needs of each
of its customers will provide greater customer support in providing targeted customer
services and developing customized customer service plans. This understanding is possible
through structured customer service.

3
SYSTEM ARCHITECTURE

FIG 1: WORKING ARCHITECTURE OF THE MODULE

Data collection

Data used in this project is a set of product reviews collected from credit card transactions
records. This step is concerned with selecting the subset of all available data that you will be
working with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which I already know
the target answer is called labelled data.

Data pre-processing

Pre-processing is the process of three important and common steps as follows:


 Formatting: It is the process of putting the data in a legitimate way that it would be
suitable to work with. Format of the data files should be formatted according to the need.
Most recommended format is .csv files.
 Cleaning: Data cleaning is a very important procedure in the path of data science as it
constitutes the major part of the work. It includes removing missing data and complexity
with naming category and so on. For most of the data scientists, Data Cleaning continues of
80% of work.
 Sampling: This is the technique of analysing the subsets from whole large
datasets, which could provide a better result and help in understanding the

4
behaviour and pattern of data in an integrated way

Data visualization

Data Visualization is the method of representing the data in a graphical and pictorial way,
data scientists depict a story by the results they derive from analysing and visualizing the data.
The best tool used is Tableau which has many features to play around with data and fetch
wonderful results.

Feature extraction

Feature extraction is the process of studying the behaviour and pattern of the analysed data
and draw the features for further testing and training. Finally, my models are trained using
the Classifier algorithm. I used to classify module on Natural Language Toolkit library on
Python. I used the labelled dataset gathered. The rest of my labelled data will be used to
evaluate the models. Some machine learning algorithms were used to classify pre-processed
data. The chosen classifiers were Random forest. These algorithms are very popular in text
classification tasks.

Evaluation model

Evaluation is an essential part of the model development process. It helps to find the best
model that represents our data and how well the selected model will work in the future.
Evaluating model performance with the data used for training is not acceptable in data
science because it can effortlessly generate overoptimistically and over fitted models. To
avoid overfitting, evaluation methods such as hold out and cross-validations are used to test
to evaluate model performance. The result will be in the visualized form. Representation of
classified data in the form of graphs.
Accuracy is well-defined as the proportion of precise predictions for the test data. It can be
calculated easily by mathematical calculation i.e. dividing the number of correct
predictions by the number of total predictions.

5
ALGORITHMS USED

Logistic Regression
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

3.6.1 Decision Tree

A decision tree is one of the simplest yet highly effective classification and prediction visual
tools used for decision making. It takes a root problem or situation and explores all the
possible scenarios related to it on the basis of numerous decisions. Since decision trees are
highly resourceful, they play a crucial role in different sectors. From programming to
business analysis, decision tree examples are everywhere. If you also want to learn what a
decision tree is and how to create one, then you are in the right place. Let‟s begin and
uncover every essential detail about decision tree diagrams.

3.6.2 Random Forest

A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the
way a forest is a collection of many trees. DTs that are grown very deep often cause
overfitting of the training data, resulting a high variation in classification outcome for a small
change in the input data. They are very sensitive to their training data, which makes them
error-prone to the test dataset. The different DTs of an RF are trained using the different parts
of the training dataset. To classify a new sample, the input vector of that sample is required to
pass down with each DT of the forest. Each DT then considers a different part of that input
vector and gives a classification outcome. The forest then chooses the classification of having
the most „votes‟ (for discrete classification outcome) or the average of all trees in the forest
(for numeric classification outcome). Since the RF algorithm considers the outcomes from
many different DTs, it can reduce the variance resulted from the consideration of a single DT
for the same dataset.

6
Steps for Implementation:

 Initialise the classifier to be used.

 Train the classifier: All classifiers in scikit-learn uses a fit(X, y) method to fit the
model(training) for the given train data X and train label y.

 Predict the target: Given an non-label observation X, the predict(X) returns the
predicted label y.

 Evaluate the classifier model

MODULES

The project contains three parts:

 Dataset Collection- We had collected datasets from Kaggle notebooks. The dataset
contains the symptoms and the corresponding disease. It contains 303 rows.

 Train and test the model- We had used three classification algorithms named
Decision Tree, Logistic regression, and Random Forest to train the dataset. After
training, we had tested the model and found the prediction of disease with maximum
accuracy.

 Hyperparameter tuning-Hyperparameters cannot be directly learned from the


regular training process. They are usually fixed before the actual training process
begins. These parameters express important properties of the model such as its
complexity or how fast it should learn.

Following are the steps to do this project (use Jupyter Notebook):


7
A) Collect the dataset.

B) Import the necessary libraries.

C) Visualize the dataset.

D) Train the dataset using LR, KNN, RF, SVM.

E) Test the model and find the accuracy score

F) Based on the scores predict which algorithm is best for prediction.

G) Build a deployment model using Azure, AWS or Heroku

H) Enter the values and predict the accuracy.

RESULTS AND DISCUSSION


8
4.1. PERFORMANCE ANALYSIS

Coming to the performance it works in a time rate of 1 second per statement and code
implied. Duplicated and similar lookalike data‟s can be removed efficiently too. The
performance of a predictive model is calculated and compared by choosing the right metrics.
So, it is very crucial to choose the right metrics for a particular predictive model in order to
get an accurate outcome. It is very important to evaluate proper predictive models because
various kinds of data sets are going to be used for the same predictive model.

F-
Algorithm Precision Recall Accuracy
measure

KNN 0.845 0.823 0.835 89.62%

Logistic Regression 0.857 0.882 0.869 92.25%

9
Random Forest 0.867 0.882 0.909 86.16%

SVM 0.837 0.911 0.873 88.25%

Fig 3: Accuracies Obtained

CHAPTER 5
10
CONCLUSION AND FUTURE ENHANCEMENTS

Nothing should go noticed. Symptoms should be checked upon before the arrival of the
unnoticed demon. Prevention is, was and will always be better than the cure.
Cancer are the most brutal thing a person will be experiencing in their lifetime, but if found
beforehand. It can be handled and the respective person can see through their remission.
In this paper, we have researched the possible outcome of almost every Machine Learning
algorithm and came to a discussion that whatever be the algorithm, a clear cut need of pre-
processing, training and testing is needed to achieve the maximum accuracy in not just this
Breast Cancer Module, but every module.
Using this bit of a code, one can easily detect the possibility of whether a person has Breast
Cancer or not and can enquire the hospitals about further actions to be taken. The subsequent
results show us that by the usage of graphical representation and attribute filtering in
successive levels increased the accuracy to almost a whopping 6% in our case.
The highest accuracy obtained here was almost 97% which has been achiever by using
Random Forest Algorithm. Due to the proper cleaning mechanism, almost every algorithm can
reach up to a minimum of a 90 percent value and out of this Random Forest stands out.
Physical diagnosis has become a very well waged business nowadays. Even a slightest help
from a machine can help one save heap loads of money for someone in any corner of the
world. By this way Machine Learning provided a significant breakthrough not only in
medical field but every other field too. Random Forest not only gives the perfect result but it
stands out and stays stable throughout the code making it relevant to make it possible to use
it for every other code too.

11
REFERENCES

[1] Shannon Doherty, Breast cancer analysis using lazy 2011learners


https://www.webmd.com/breast-cancer/features/ shannen-doherty-breast-cancer

[2] M Navya Sri, ANIT, Analaysis of NNC and SVM for Machine Learning 2020

[3] N Gupta, Google Scholar, Prediction of Areolar cancer

[4] Jiaxin Li, Jilin University, 5year survival forpersonhaving-breast-cancer(2020).

[5] Mohammad Milan Islam, University of Waterloo, Prediction of residual diseases and
breast cancer.2020 https:// link.springer.com/article/10.1007/s42979-020-00305-

[6] National Cancer Institute. Inflammatory breast cancer.


http://www.cancer.gov/types/breast/ibc-fact-sheet, 2016.

[7] Chang Ming, BCRAT and BOADICEA comparison. Peking


University,2019,Presonalized breast cancer risk prediction
https://link.springer.com/article/10.1007/ s42979-020-00305-w

[8] Rouse HC, Ussher S, Kavanagh AM, Cawson JN. Examining invasive biopsy of
ultrasound mammogram in breast cancer 2019.

[9] Nitasha, Punjab Technical Univerisity, 2019, Review on Prediction of breast


cancer using data mining http://www.ijcstjournal.org/volume-7/issue-4/IJCST-
V7I4P8.pdf

[10]Rucha Kanade, Xavier School of Engineering 2019, Breast cacner prediction using
gradient boosters.

12
APPENDICES

SOURCE CODE

import numpy import

matplotlib.pyplot as plt import

pandas as pd

import seaborn as sns

df=pd.read_csv("data.csv")

df.head()

df.info() df.isna().sum()

df.shape

df=df.dropna(axis=1)

df.shape df.describe()

df['diagnosis'].value_counts()

sns.countplot(df['diagnosis'])

from sklearn.preprocessing import LabelEncoder labelencoder_Y =

LabelEncoder()

df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)

df.iloc[:,1:32].corr()

plt.figure(figsize=(10,10))
13
sns.heatmap(df.iloc[:,1:10].corr(),annot=True,fmt=".0%")

X=df.iloc[:,2:31].values

Y=df.iloc[:,1].values

from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.20,random_state=0) from

sklearn.preprocessing import StandardScaler

X_train=StandardScaler().fit_transform(X_train)

X_test=StandardScaler().fit_transform(X_test)

def models(X_train,Y_train):

from sklearn.linear_model import LogisticRegression


log=LogisticRegression(random_state=0) log.fit(X_train,Y_train)

from sklearn.tree import DecisionTreeClassifier

tree=DecisionTreeClassifier(random_state=0,criterion='entropy') tree.fit(X_train,Y_train)

from sklearn.ensemble import RandomForestClassifier

forest=RandomForestClassifier(random_state=0,criterion="entropy",n_estimators=10
)

forest.fit(X_train,Y_train)

print('[0]logistic regression accuracy:',log.score(X_train,Y_train))

print('[1]Decision tree accuracy:',tree.score(X_train,Y_train))


14
print('[2]Random forest accuracy:',forest.score(X_train,Y_train))

return log,tree,forest
model=models(X_train,Y_train)

from sklearn.metrics import accuracy_score from

sklearn.metrics import classification_report for i in

range(len(model)):

print("Model",i) print(classification_report(Y_test,model[i].predict(X_test)))

print('Accuracy : ',accuracy_score(Y_test,model[i].predict(X_test)))

15
SCREENSHOTS

B-1: DATASET

16
B-2: COUNTPLOT

17
B-3: PAIRPLOT

B-4: HEAT MAP

18
19
B-5: REPORT GENERATION

B-6: DETAILED COMPARISION OF THE ACCURACIES

20
B-7: CONSTRUCTING THE WEB APPLICATION (UI)

21

You might also like