Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
ABSTRACT
Cancer in whole have become a new normal in the 'disease' world and especially in this growing
generation. Many are contributing to the risk phenomenon such as dietary conditions. Lifestyle too plays a
major role here because many regret to do, eat or make something of a good will. Almost no one is
surveying these factors and these have led to a rapid growth on this tally for he past 20 years, or more. In
the varied population of the Americas and to a wider aspect, this has become an inevitable circumstance. In
this case, female aging 40 and above are more prone to two inexorable circumstances being Urinary Tract
Infections(UTIs) in one hand and Breast Cancer in the other.
This has become a frequently researched and scary topic among not only the physicians and researchers but
with the youth population too. Till day there is not even a slightest cure to the deadliest disease among
them all. As from the earlier times, Inhibiting is is better than cure. this still fits to these day among us all.
There are many kinds of tests and therapies to treat almost every time of cancers but till this day, a cure is
the biggest question. This has been the case since its inception. Awareness is being created in the form of
printing warning signs on the front of cigarette packets, chewing gums etc., but they must be mandatorily
imposed upon the people to create a widespread impact.
Here in this paper Detection of breast cancer is easily elaborated to ease up the process before going
professionally to get a small view on the prediction of the disease. The need to detect this disease earlier
has been of course a growing concern among the people of every nation.
This Breast Cancer Prediction system is mainly aimed at predicting the accuracy on how furious the
cancer have spread or how not at all. This code describes if the patient have cancer or not at all using the
given input, predicting the accuracy.
EXISTING SYSTEM
The existing model for the customer segmentation depicts that it is based on the K- means
clustering algorithm which comes under centroid-based clustering. The suitable K value for
the given dataset is selected appropriately which represents the predefined clusters. Raw and
unlabeled data is taken as input which is further divided into clusters until the best clusters
are found. Centroid based algorithm used in this model is efficient but sensitive to initial
conditions and outliers
PROPOSED SYSTEM
The main proposal of this project is to get the maximum accuracy, that is being vakued at
above 95% without using parameter tuning and overfitting. To do so, every
OBJECTIVE OF PROJECT
Customer segmentation is the practice of dividing a company‟s customers into groups that
reflect similarities among customers in each group. The main objective of segmenting
customers is to decide how to relate to customers in each segment to maximize the value of
each customer to the business
The emergence of many competitors and entrepreneurs has caused a lot of tension among
competing businesses to find new buyers and keep the old ones. As a result of the
predecessor, the need for exceptional customer service becomes appropriate regardless of the
size of the business.Furthermore, the ability of any business to understand the needs of each
of its customers will provide greater customer support in providing targeted customer
services and developing customized customer service plans. This understanding is possible
through structured customer service.
3
SYSTEM ARCHITECTURE
Data collection
Data used in this project is a set of product reviews collected from credit card transactions
records. This step is concerned with selecting the subset of all available data that you will be
working with. ML problems start with data preferably, lots of data (examples or
observations) for which you already know the target answer. Data for which I already know
the target answer is called labelled data.
Data pre-processing
4
behaviour and pattern of data in an integrated way
Data visualization
Data Visualization is the method of representing the data in a graphical and pictorial way,
data scientists depict a story by the results they derive from analysing and visualizing the data.
The best tool used is Tableau which has many features to play around with data and fetch
wonderful results.
Feature extraction
Feature extraction is the process of studying the behaviour and pattern of the analysed data
and draw the features for further testing and training. Finally, my models are trained using
the Classifier algorithm. I used to classify module on Natural Language Toolkit library on
Python. I used the labelled dataset gathered. The rest of my labelled data will be used to
evaluate the models. Some machine learning algorithms were used to classify pre-processed
data. The chosen classifiers were Random forest. These algorithms are very popular in text
classification tasks.
Evaluation model
Evaluation is an essential part of the model development process. It helps to find the best
model that represents our data and how well the selected model will work in the future.
Evaluating model performance with the data used for training is not acceptable in data
science because it can effortlessly generate overoptimistically and over fitted models. To
avoid overfitting, evaluation methods such as hold out and cross-validations are used to test
to evaluate model performance. The result will be in the visualized form. Representation of
classified data in the form of graphs.
Accuracy is well-defined as the proportion of precise predictions for the test data. It can be
calculated easily by mathematical calculation i.e. dividing the number of correct
predictions by the number of total predictions.
5
ALGORITHMS USED
Logistic Regression
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
A decision tree is one of the simplest yet highly effective classification and prediction visual
tools used for decision making. It takes a root problem or situation and explores all the
possible scenarios related to it on the basis of numerous decisions. Since decision trees are
highly resourceful, they play a crucial role in different sectors. From programming to
business analysis, decision tree examples are everywhere. If you also want to learn what a
decision tree is and how to create one, then you are in the right place. Let‟s begin and
uncover every essential detail about decision tree diagrams.
A random forest (RF) is an ensemble classifier and consisting of many DTs similar to the
way a forest is a collection of many trees. DTs that are grown very deep often cause
overfitting of the training data, resulting a high variation in classification outcome for a small
change in the input data. They are very sensitive to their training data, which makes them
error-prone to the test dataset. The different DTs of an RF are trained using the different parts
of the training dataset. To classify a new sample, the input vector of that sample is required to
pass down with each DT of the forest. Each DT then considers a different part of that input
vector and gives a classification outcome. The forest then chooses the classification of having
the most „votes‟ (for discrete classification outcome) or the average of all trees in the forest
(for numeric classification outcome). Since the RF algorithm considers the outcomes from
many different DTs, it can reduce the variance resulted from the consideration of a single DT
for the same dataset.
6
Steps for Implementation:
Train the classifier: All classifiers in scikit-learn uses a fit(X, y) method to fit the
model(training) for the given train data X and train label y.
Predict the target: Given an non-label observation X, the predict(X) returns the
predicted label y.
MODULES
Dataset Collection- We had collected datasets from Kaggle notebooks. The dataset
contains the symptoms and the corresponding disease. It contains 303 rows.
Train and test the model- We had used three classification algorithms named
Decision Tree, Logistic regression, and Random Forest to train the dataset. After
training, we had tested the model and found the prediction of disease with maximum
accuracy.
Coming to the performance it works in a time rate of 1 second per statement and code
implied. Duplicated and similar lookalike data‟s can be removed efficiently too. The
performance of a predictive model is calculated and compared by choosing the right metrics.
So, it is very crucial to choose the right metrics for a particular predictive model in order to
get an accurate outcome. It is very important to evaluate proper predictive models because
various kinds of data sets are going to be used for the same predictive model.
F-
Algorithm Precision Recall Accuracy
measure
9
Random Forest 0.867 0.882 0.909 86.16%
CHAPTER 5
10
CONCLUSION AND FUTURE ENHANCEMENTS
Nothing should go noticed. Symptoms should be checked upon before the arrival of the
unnoticed demon. Prevention is, was and will always be better than the cure.
Cancer are the most brutal thing a person will be experiencing in their lifetime, but if found
beforehand. It can be handled and the respective person can see through their remission.
In this paper, we have researched the possible outcome of almost every Machine Learning
algorithm and came to a discussion that whatever be the algorithm, a clear cut need of pre-
processing, training and testing is needed to achieve the maximum accuracy in not just this
Breast Cancer Module, but every module.
Using this bit of a code, one can easily detect the possibility of whether a person has Breast
Cancer or not and can enquire the hospitals about further actions to be taken. The subsequent
results show us that by the usage of graphical representation and attribute filtering in
successive levels increased the accuracy to almost a whopping 6% in our case.
The highest accuracy obtained here was almost 97% which has been achiever by using
Random Forest Algorithm. Due to the proper cleaning mechanism, almost every algorithm can
reach up to a minimum of a 90 percent value and out of this Random Forest stands out.
Physical diagnosis has become a very well waged business nowadays. Even a slightest help
from a machine can help one save heap loads of money for someone in any corner of the
world. By this way Machine Learning provided a significant breakthrough not only in
medical field but every other field too. Random Forest not only gives the perfect result but it
stands out and stays stable throughout the code making it relevant to make it possible to use
it for every other code too.
11
REFERENCES
[2] M Navya Sri, ANIT, Analaysis of NNC and SVM for Machine Learning 2020
[5] Mohammad Milan Islam, University of Waterloo, Prediction of residual diseases and
breast cancer.2020 https:// link.springer.com/article/10.1007/s42979-020-00305-
[8] Rouse HC, Ussher S, Kavanagh AM, Cawson JN. Examining invasive biopsy of
ultrasound mammogram in breast cancer 2019.
[10]Rucha Kanade, Xavier School of Engineering 2019, Breast cacner prediction using
gradient boosters.
12
APPENDICES
SOURCE CODE
pandas as pd
df=pd.read_csv("data.csv")
df.head()
df.info() df.isna().sum()
df.shape
df=df.dropna(axis=1)
df.shape df.describe()
df['diagnosis'].value_counts()
sns.countplot(df['diagnosis'])
LabelEncoder()
df.iloc[:,1]=labelencoder_Y.fit_transform(df.iloc[:,1].values)
df.iloc[:,1:32].corr()
plt.figure(figsize=(10,10))
13
sns.heatmap(df.iloc[:,1:10].corr(),annot=True,fmt=".0%")
X=df.iloc[:,2:31].values
Y=df.iloc[:,1].values
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.20,random_state=0) from
X_train=StandardScaler().fit_transform(X_train)
X_test=StandardScaler().fit_transform(X_test)
def models(X_train,Y_train):
tree=DecisionTreeClassifier(random_state=0,criterion='entropy') tree.fit(X_train,Y_train)
forest=RandomForestClassifier(random_state=0,criterion="entropy",n_estimators=10
)
forest.fit(X_train,Y_train)
return log,tree,forest
model=models(X_train,Y_train)
range(len(model)):
print("Model",i) print(classification_report(Y_test,model[i].predict(X_test)))
print('Accuracy : ',accuracy_score(Y_test,model[i].predict(X_test)))
15
SCREENSHOTS
B-1: DATASET
16
B-2: COUNTPLOT
17
B-3: PAIRPLOT
18
19
B-5: REPORT GENERATION
20
B-7: CONSTRUCTING THE WEB APPLICATION (UI)
21