Machine Learning-Based Breast Cancer Detection
Machine Learning-Based Breast Cancer Detection
UNIVERSITY OF BUEA
STUDY
By
Supervisor Co-Supervisor
April 2023
ii
DEDICATION
UNIVERSITY OF BUEA
CERTIFICATION
Technology of the University of Buea, in partial fulfillment of the requirements for the
Networks has been read, examined and approved by the examination panel composed
of:
This dissertation has been accepted by the Faculty of Engineering and Technology
ACKNOWLDGEMENTS
This dissertation could not be made possible today without the help of so many
people, to whom I would like to express my gratitude for their help and support
I thank the University of Buea, the Faculty of Engineering and Technology and all
our Professors, Doctors and Lecturers who gave me insight throughout the project. I
particularly thank my supervisors Dr. FOZIN Theophile and Dr. NGWASHI Divine
To Prof. TANYI Emmanuel, Prof. TSAFACK Pierre, Dr. SITAMZE Bertrand, Dr.
NKEMENI Valery, Dr. FENDJI Danielle, Dr. TENE, and so many more, I say thanks
fortheir constant support and remarks which helped me through the dissertation.
I also thank Dr. MANGA at the Radiology Department of the Douala General
I also thank my fellow mates who tipped in advice time to time, as we all
moved on together: Forka Da-Silva, Samba Shanice, Ngoran Bedes, Nchunga Florine,
and manyothers.
I thank the TATAP family, my dad TATAP Jean, my mum TATAP Marie, my
brothers and sisters – Joel, Peter, Prince and Olive, and also the SIAKET family, who
helped me with all that was necessary for the success of this project. Their constant love
was a strong driving force that helped me remember my goal and complete this project.
Finally, and most importantly, I thank God Almighty for his mercies throughout my
dissertation.
v
ABSTRACT
The detection of breast cancer is a very crucial task even the most seasoned doctors
can‟t perform with a hundred percent accuracy. Late detection of the tumor has caused a
significant number of deaths each year. Fortunately, the introduction of artificial
intelligence hashelped solve this worry in the field. The aim of our study is to develop a
more accurate way of diagnosing breast cancer using machine learning. Our solution
involves the use of artificial intelligence methods like Convolutional Neural Networks to
diagnose breast cancer. The algorithm is realized in Python environment. The result
shows that this method is far more efficient than all other techniques, achieving
accuracy above 97%. Moreover, a web app is also deployed for user-friendliness, in
order to detect any subsequent sample patient image. This solution could be greatly used
in the field of medicine to perform earlyand accurate detection of the tumor, and will go
a long way to prove that technology can revolutionize the way we live.
TABLE OF CONTENTS
DEDICATION...........................................................................................................ii
ACKNOWLDGEMENTS ........................................................................................iv
ABSTRACT .............................................................................................................. v
LIST OF TABLES..................................................................................................... x
LIST OF ABBREVIATIONS……………………………………………….........xii
CHAPTER ONE
GENERAL INTRODUCTION
CHAPTER TWO
LITERATURE REVIEW
CHAPTER THREE
CHAPTER FOUR
4.4 Discussion of Results and Comparative Analysis with Previous Works .... 50
CHAPTER FIVE
GENERAL CONCLUSION
References ............................................................................................................... 56
Appendices .............................................................................................................. 61
ix
LIST OF FIGURES
LIST OF TABLES
Table 4. 1 Comparative analysis of results with previous works on deep learning for
LIST OF ABBREVIATIONS
AI Artificial Intelligence
BC Breast Cancer
DL Deep Learning
DT Decision Tree
FN False Negative
FP False Positive
LR Logistic Regression
ML Machine Learning
NB Naïve Bayes
RF Random Forest
TF Tensor Flow
TN True Negative
TP True Positive
xiii
CHAPTER ONE
GENERAL INTRODUCTION
Breast cancer is one of the most lethal and heterogeneous disease in this present
era that causes the death of enormous number of women all over the world. [1] Breast
cancer (BC)is the most common cancer in women, affecting about 10% of all women at
some stages of their life. In recent years, the incidence rate keeps increasing and data
show that the survival rate is 88% after five years from diagnosis and 80% after 10 years
from diagnosis [1]. Early prediction of breast cancer is one of the most crucial works in
the follow-up process. It is the second largest disease that is responsible of women
(cancerous). Benign tumors tend to grow slowly and do not spread. Malignant tumors
can grow rapidly, invade and destroy nearby normal tissues, and spread throughout the
body A lot of fatty and fibrous tissues of the breast start abnormal growth that becomes
the cause of breast cancer. The cancer cells spread throughout the tumors that cause
Figure 1.1 shows the various types of breast cancer that exist. There are different
types of breast cancer which occurs when affected cells and tissues spread throughout
the body. Ductal Carcinoma in Situ (DCIS) is type of the breast cancer that occurs
when abnormal cells spread outside the breast it is also known as the non-invasive
cancer. The second type is Invasive Ductal Carcinoma (IDC) and it is also known as
infiltrative ductal carcinoma. This type of the cancer occurs when the abnormal cells of
breast spread over all the breasttissues and IDC cancer is usually found in men. Mixed
Tumors Breast Cancer (MTBC)is the third type of breast cancer and it is also known as
2
invasive mammary breast cancer. Abnormal duct cell and lobular cell causes such kind
of cancer. The fourth type of cancer is Lobular Breast Cancer (LBC) which occurs
inside the lobule. It increases the chances of other invasive cancers. Mucinous Breast
Cancer (MBC) is the fifth type that occurs because of invasive ductal cells, it is also
known as colloid breast cancer. It occurs when the abnormal tissues spread around the
duct. Inflammatory Breast Cancer (IBC) is last type that causes swelling and reddening
of breast. It is a fast-growing breast cancer, when the lymph vessels block in break cell,
It is found that most women who have breast cancer symptoms and signs will initially
notice only one or two. Some people do not have any signs or symptoms at all. The
• A lump or thickening in or near the breast or in the underarm (armpit) area; [2]
• Skin redness;
• Dimpling or puckering;
• Fluid, other than breast milk, from the nipple, especially if it‟s bloody; [2]
• Scaly, red or swollen skin on the breast, nipple or areola (the dark area of skin that is
Breast ultrasound: A machine that uses sound waves to make pictures, called sonograms,
an
area of the breast looks abnormal on a screening mammogram, doctors may have you geta
Breast magnetic resonance imaging (MRI): A kind of body scan that uses a magnet
linked to a computer. The MRI scan will make detailed pictures of areas inside the
breast.
Biopsy: This is a test that removes tissue or fluid from the breast to be looked at
under a microscope and do more testing. There are different kinds of biopsies (for
Broad areas in life are using AI in the various ways. AI and ML-powered software and
devices are mimicking human thought patterns to facilitate the digital transformation of
society. AI systems perceive their environment, deal with what they perceive, solve
problems and act to help with tasks to make everyday life easier. The following are
• Voice Assistants: Digital assistants like Siri, Google Home, and Alexa use AI- backed
5
Voice User Interfaces (VUI) to process and decipher voice commands. AI gives these
applications the freedom to not solely rely on voice commands but also leverage vast
• Entertainment Streaming Apps: Streaming giants like Netflix, Spotify, and Hulu are
continually feeding data into machine learning algorithms to make the user experience
seamless. [3]
• Smart Input Keyboards: The latest versions of mobile keyboard apps combine the
experience. [3]
• Navigation and Travel: The work of AI programmers behind navigation apps like
Google Maps and Waze never ends. Yottabytes of geographical data which is up- dated
• Security and Surveillance: It is nearly impossible for a human being to keep a constant
eye on too many monitors of a CCTV network at the same time. So, naturally, we have
felt the need to automate such surveillance tasks and further enhance them by leveraging
• Internet of Things: The confluence of AI and the Internet of Things (IoT) opens up a
human interference to operate. While IoT deals with devices interacting with the
internet, the AI part helps these devices to learn from data. [3]
the Face ID unlock feature in most of the flagship smartphone models today. The
biggest challenge faced by this technology is widespread concern around the racial and
Given what has been said in Section 1.1, the inaccuracy of the diagnose of breast
cancer using traditional methods has proven to be considerable, and so there is the need
From the problem statement, we set our research question as: how can the accuracy of
traditionalmethods?
1.5 Objectives
General objective
The main objective of this work is to increase the accuracy of breast cancer detection
Specific objectives
• Design of a high accurate and low error rate machine learning technique.
• Develop a web app capable of performing any subsequent prediction with a patient
data
After this introduction, our work is organized as follows. Chapter 2 gives a state of art
and the previous works carried out in the project. It presents the major contributions
carried out by other scientists, researchers and Engineers on the project, their results
obtained, and the limitations of their study. It is from this chapter that a foreknowledge
Then, chapter 3 focuses on the methodology used for the project. Here, the project
method- ology is explained, giving the steps taken to build and train the machine
learning model. In this chapter we see how the machine learning algorithm is trained,
how the main KPIs of the model are generated, and how the web app development is
performed. Chapter 4 presents the results of the trained model, the web app developed, the
performance index given by the confusion matrix and the ROC curve, and also a
comparative analysis of the results with those of previous work. Finally, we have a
conclusion of the work in Chapter 5 where we are presenting the summary of findings,
CHAPTER TWO
LITERATURE REVIEW
2.1 Introduction
This chapter gives the recent research work and contributions done in the field of breast
cancer detection with machine learning techniques, and explains the various methods
algorithms to build systems that have the ability to automatically learn and improve
machine learning and artificial intelligence (AI) that imitates the way humans gain
certain types of knowledge. While traditional machine learning algorithms are linear,
abstraction.
At its most basic sense, machine learning uses programmed algorithms that learn and
acceptable range. With the feeding of new data, these algorithms tend to make more
accurate predictions. Although there are some variations of how to group machine
learning algorithms, they can be divided into three broad categories according to their
purposes and the way the underlying machine is being taught. These three categories
are: supervised, unsupervised and semi-supervised. There also exists a fourth category
In this type of algorithms, a model gains knowledge from the data that has predefined
examples of data with both input and expected output to compare its output with the
correct input. Classification problem is one of the standard formulations for supervised
learning task where the data is mapped into a class after looking at numerous input-
a given dataset consisting of multiple data along with their corresponding classes. It can
be used both for decision trees and artificial neural networks. In decision trees it can be
used to determine which attributes of the data given provides the most relevant
information. In artificial neural networks, the models are trained on the given dataset and
1 Logic Regression
regression and can model only a dichotomous variable which usually represents the
instance belongs to a certain class. Since it is a probability, the outcome lies between 0
to differentiate two classes. For example, a probability value higher than 0.50 for an
Support vector machine (SVM) algorithm can classify both linear and non-linear data.
It first maps each data item into an n-dimensional feature space where n is the number
of features. It then identifies the hyper plane that separates the data items into two
classes while maximizing the marginal distance for both classes and minimizing the
classification errors. The marginal distance for a class is the distance between the
decision hyper plane and its nearest instance which is a member of that class. Figure 2.2
shows an illustration of the support Vector machine. The SVM has identified a hyper
plane (actually a line) which maximizes the separation between the „star‟ and „circle‟
classes. More formally, each datapoint is plotted first as a point in an n-dimension space
(where n is the number of features) with the value of each feature being the value of a
specific coordinate. To perform the classification, we then need to find the hyper plane
Figure 2. 2: A simplified illustration of how the support vector machine works [5]
Decision tree (DT) is one of the earliest and prominent machine learning algorithms. A
decision tree tests and corresponds outcomes for classifying data items into a tree-like
structure. The nodes of a decision tree normally have multiple levels where the first or
top-most node is called the root node. All internal nodes (i.e., nodes having at least one
Figure 2.3 shows an illustration of the Decision Tree. Each variable (C1, C2, and C3) is
represented by a circle and the decision outcomes (Class A and Class B) are shown by
with either „True‟ or „False‟ based on the outcome value from the test of its ancestor
node.
Depending on the test outcome, the classification algorithm branches towards the
appropriate child node where the process of test and branching repeats until it reaches the
leaf node.The leaf or terminal nodes correspond to the decision outcomes. DTs have been
found easy to interpret and quick to learn, and are a common component to many
medical diagnosticprotocols. When traversing the tree for the classification of a sample,
the outcomes of all tests at each node along the path will provide sufficient information
A random forest (RF) is an ensemble classifier and consisting of many DTs similar
to the way a forest is a collection of many trees. DTs that are grown very deep often
cause over fitting of the training data, resulting a high variation in classification
outcome for a small change in the input data. They are very sensitive to their training
data, which makes them error-prone to the test dataset. The different DTs of an RF are
Figure 2.4 shows an illustration of the RF algorithm which consists of three different
decision trees. Each of those three decision trees was trained using a random subset
of the training data. To classify a new sample, the input vector of that sample is
required to passdown with each DT of the forest. Each DT then considers a different part
of that input vector and gives a classification outcome. The forest then chooses the
classification of having the most ‟votes‟ (for discrete classification outcome) or the
average of all trees in the forest (for numeric classification outcome). Since the RF
algorithm considers the outcomes from many different DTs, it can reduce the variance
Naïve Bayes (NB) is a classification technique based on the Bayes‟ theorem. This theorem
can describe the probability of an event based on the prior knowledge of conditions
related to that event. This classifier assumes that a particular feature in a class is not
directly related to any other feature although features for that class could have
(white circle) to either „green‟ class or „red‟ class, Figure 2.5 shows an illustration of
the Naive Baiyes Algorithm. According to this figure, it is reasonable to believe that any
new object is twice as likely to have „green‟ membership rather than „red‟ since there are
twice as many „green‟ objects (40) as „red‟. In the Bayesian analysis, this belief is known
as the prior probability. Therefore, the prior probabilities of „green‟ and „red‟ are 0.67
(40 ÷ 60) and 0.33 (20 ÷ 60), respectively. Now to classify the „white‟ object, we need
to draw a circle around this object which encompasses several points (to be chosen
prior) irrespective of their class labels. Four points (three „red‟ and one „green) were
considered in this figure. Thus, the likelihood of „white‟ given „green‟ is 0.025 (1 ÷ 40)
14
and the likelihood of „white‟ given „red‟ is 0.15 (3 ÷ 20). Although the prior probability
indicates that the new „white‟ object is more likely to have „green‟ membership, the
likelihood shows that it is more likely to be in the „red‟ class. In the Bayesian analysis,
the final classifier is produced by combining both sources of information (i.e., prior
these two types of information and the product is called the „posterior‟ probability.
Finally, the posterior probability of „white‟ being „green‟ is 0.017 (0.67 × 0.025) and
the posterior probability of „white‟ being „red‟ is 0.049 (0.33 × 0.15). Thus, the new
„white‟ object should be classified as a member of the „red‟ class according to the NB
technique.
The K-nearest Neighbor (KNN) algorithm is one of the simplest and earliest
the NB tech- nique, the KNN algorithm does not require to consider probability values.
KNN algorithm is the number of nearest Neighbors considered to take ‟vote‟ from. The
se- lection of different values for ‟K‟ can generate different classification results for the
same sample object. Figure 2.6 shows an illustration of the KNN algorithm. For K=3,
15
the new object (star) is classified as ‟black‟; however, it has been classified as ‟red‟
when K=5.
Artificial neural networks (ANNs) are a set of machine learning algorithms which are
in- spired by the functioning of the neural networks of human brain. They were first
proposed by McCulloch and Pitts and later popularized by the works of Rumelhart et
al. in the 1980s. In the biological brain, neurons are connected to each other through
rewired (e.g., through neuroplasticity) that helps to adapt, process and store
information. Figure 2.7 shows an illustration of artificial neural networks with two
hidden layers. The arrows connect the output of nodes from one layer to the input of
interconnected group of nodes. The output of one node goes as input to another node
grouped into a matrix called layer depending on the transformation they perform. Apart
16
from the input and output layer, there can be one or more hidden layers in an ANN
framework. Nodes and edges have weights that enable to adjust signal strengths of
communication which can be amplified or weakened through re- peated training. Based
on the training and subsequent adaption of the matrices, node and edge weights, ANNs
Figure 2. 7: An illustration of the artificial neural network structure with two hidden layers [5]
In unsupervised learning, only input data is provided to the model the use of labeled
datasets. Unsupervised learning algorithms do not use labeled input and output data. An
pervised learning methods are suitable when the output variables (i.e. the labels) are not
data in the form of small clusters. Algorithm is used to find out the similarity between
different data points. Data points exactly consist of at least one cluster that is most
2 C-Mean CLUSTERING: Clusters are identified on the similarity basis. Cluster that
consists of similar data point belongs to one single family. In C mean algorithm each
data point belongs to one single cluster. It is mostly used in medical images
raw data in the form of matrix. Each cluster is separated from other clusters in the form
of hierarchy. Every single cluster consists of similar data points. Probabilistic model is
expectation maximization.
ma- chine learning methods. With more common supervised machine learning
methods, you train a machine learning algorithm on a “labeled” dataset in which each
machine learning that combines a small amount of labeled data with a large amount of
learning (with no labeled training data) and supervised learning (with only labeled
training data). Semi supervised learning is used in speech analysis. Since labeling of
audio files is a very intensive task, Semi-Supervised learning is a very natural approach
to solve this problem. Internet Con- tent Classification: Labeling each webpage is an
impractical and unfeasible process and thus uses Semi-Supervised learning algorithms.
18
Deep learning has gained massive popularity in scientific computing, and its algorithms
are widely used by industries that solve complex problems. All deep learning
algorithms use different types of neural networks to perform specific tasks. Here is the
of multiple layers and are mainly used for image processing and object detection. Yann
LeCun developed the first CNN in 1988 when it was called LeNet. It was used for
recognizing characters like ZIP codes and digits. CNN‟s are widely used to identify
satellite images, process medical images, forecast time series, anddetect anomalies. [6]
Neural Network (RNN) that can learn and memorize long-term dependencies. Recalling
LSTMs retain information over time. They are useful in time-series prediction because
they remember previous inputs. LSTMs have a chain-like structure where four
are typically used for speech recognition, music composition, and pharmaceutical
development. [6]
RNNs have connections that form directed cycles, which allow the outputs from the
LSTM to be fed as inputs to the current phase. The output from the LSTM becomes an
input to the current phase and can memorize previous inputs due to its internal memory.
RNNs are commonly used for image captioning, time-series analysis, natural-language
algorithms that create new data instances that resemble the training data. GAN has two
components: a generator, which learns to generate fake data, and a discriminator, which
learns from that false information. The usage of GANs has increased over a period of
time. They can be used to improve astronomical images and simulate gravitational
lensing for dark-matter research. Video game developers use GANs to upscale low-
resolutions via image training. GANs help generate realistic images and cartoon
5 Radial Basis Function Networks (RBFNs): RBFNs are special types of feed
forward neural networks that use radial basis functions as activation functions. They
have an input layer, a hidden layer, and an output layer and are mostly used for
learning about deep learning technology. MLPs belong to the class of feed forward
20
neural networks with multiple layers of perceptrons that have activation functions.
MLPs consist of an input layer and an output layer that are fully connected. They have
the same number of input and output layers but may have multiple hidden layers and can
artificial neural networks. Data visualization attempts to solve the problem that humans
cannot easily visualize high- dimensional data. SOMs are created to help users
8 Deep Belief Network (DBNS): DBNs are generative models that consist of
multiple layers of stochastic, latent variables. The latent variables have binary values
and are often called hidden units. [6] DBNs are a stack of Boltzmann Machines with
connections between the layers, and each RBM layer communicates with both the
previous and subsequent layers. Deep Belief Networks (DBNs) are used for image-
are stochastic neural networks that can learn from a probability distribution over a set of
inputs. This deep learning algorithm is used for dimensionality reduction, classification,
regres- sion, collaborative filtering, feature learning, and topic modeling. RBMs
constitute the building blocks of DBNs. RBMs consist of two layers: Visible units and
Hidden units Each visible unit is connected to all hidden units. RBMs have a bias unit
that is connected to all the visible units and the hidden units, and they have no output
nodes. [6]
10 Auto encoders: Auto encoders are a specific type of feed forward neural network in
which the input and output are identical. Geoffrey Hinton designed auto encoders in the
21
1980s to solve unsupervised learning problems. They are trained neural networks that
replicate the data from the input layer to the output layer. Auto encoders are used for
processing. [6]
Prediction
Extensive work was carried out in the field of Artificial Intelligence, especially
based system that predicts common diseases. The symptoms dataset was imported from
the UCI ML depository, where it contained symptoms of many common diseases. The
system used CNN and KNN as classification techniques to achieve multiple diseases
prediction. Moreover, the proposed solution was supplemented with more information
that concerned the living habits of the tested patient, which proved to be helpful in
understanding the level of risk attached to the predicted disease. Dahiwade et al.
compared the results between KNN and CNN algorithm in terms of processing time and
accuracy. The accuracy and processing time of CNN were 84.5% and 11.1 seconds,
respectively.
In light of this study, the findings of Chen et al. [8] also agreed that CNN outperformed
typical supervised algorithms such as KNN, NB, and DT. The authors concluded that
the proposed model scored higher in terms of accuracy, which is explained by the
capability of the model to detect complex nonlinear relationships in the feature space.
Moreover, CNN detects features with high importance that renders better description of
the disease, which enables it to accurately predict diseases with high complexity. This
conclusion is well sup- ported and backed with empirical observations and statistical
arguments. Nonetheless, the presented models lacked details, for instance, neural
22
networks parameters such as network size, architecture type, learning rate and back
evaluated in terms of accuracy, which debunks the validity of the presented findings.
Moreover, the authors did not take into consideration the bias problem that is faced by
the tested algorithms. In illustration, the incorporation of more feature variables could
extensive research efforts were made to identify those studies that applied more than
one supervised machine learning algorithm on single disease prediction. Two databases
(i.e., Scopus and PubMed) were searched for different types of search items. Thus, they
selected 48 articles in total for the comparison among variants supervised machine
learning algorithms for dis- ease prediction. They found that the Support Vector
Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the
Na¨ıve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm
showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which
Prediction
Sengar et al. [9] attempted to detect breast cancer using ML algorithms, namely RF,
Bayesian Networks and SVM. The researchers obtained the Wisconsin original breast
cancer dataset from the UCI repository and utilized it for comparing the learning
models in terms of key parameters such as accuracy, recall, precision, and area of ROC
graph. The classifiers were tested using K-fold validation method, where the chosen value
of K is equal to 10. The simulation results have proved that SVM excelled in terms of
23
recall, accuracy, and precision. However, RF had a higher probability in the correct
contrast, Yao [10] experimented with various data mining methods including RF and
SVM to determine the best suited algorithm for breast cancer prediction. Per results, the
96.27%, 96.78%, and 94.57%, respectively, while SVM scored an accuracy value of
conclusion that the RF algorithm performed better than SVM because the former
well for large datasets and prefaces lower chances of variance and data over fitting. The
raw data for training proved to be disadvantageous for ML models. According to Yao,
omitting parts of data reduces the quality of images, and therefore the performance of
techniques and analyzed their accuracy across various journals. Her main focus is to
techniques in order to find out the most appropriate method that will support the large
dataset with good accuracy of prediction. She found out that machine learning
techniques were used in 27 papers, ensemble techniques were used in 4 papers, and
deep learning techniques were used in 8 papers. She concluded by saying that each
technique is suitable under different conditions and on different type of dataset, after the
algorithm SVM is the most suitable algorithm for prediction of breast cancer. Different
researchers have provided the analysis of prediction algorithms by using the dataset
from Wisconsin Diagnostic Breast Cancer (WDBC), and the analysis shows that each
time the accuracy of SVM algorithm is higher than the other machine learning
algorithms.
Delen et al. [11] used artificial neural networks, decision trees and logistic regression to
develop prediction models for breast cancer survival by analyzing a large dataset, the
SEER cancer incidence database. Two popular data mining algorithms (artificial neural
networks and decision trees) were used, along with a most commonly used statistical
method (logis- tic regression) to develop the prediction models using a large dataset
(more than 200,000 cases). 10-fold cross-validation method was used to measure the
unbiased estimate of the three prediction models for performance comparison purposes.
The results indicated that the decision tree (C5) is the best predictor with 93.6%
accuracy on the holdout sample (this prediction accuracy is better than any reported in
the literature), artificial neural networks came out to be the second with 91.2% accuracy
and the logistic regression models came out to be the worst of the three with 89.2%
accuracy. The comparative study of multiple prediction models for breast cancer
survivability using a large dataset along with a 10-fold cross-validation provided us with
an insight into the relative prediction ability of different data mining methods. Using
Lundin et al. [12] used ANN and logistic regression models to predict 5, 10, and 15-
year breast cancer survival. They studied 951 breast cancer patients and used tumor
size, axillary nodal status, histological type, mitotic count, nuclear pleomorphism,
tubule formation, tumor necrosis, and age as input variables. In this study, they showed
25
that data mining could be a valuable tool in identifying similarities (patterns) in breast
cancer cases, which can be used for diagnosis, prognosis, and treatment purposes the
area under the ROC curve (AUC) was used as a measure of accuracy of the prediction
models in generating survival estimates for the patients in the independent validation
set. The AUC values of the neural network models for 5-, 10- and 15- year breast-
cancer-specific survival were 0.909, 0.886 and 0.883, respectively. The corresponding
AUC values for logistic regression were 0.897, 0.862 and 0.858. Axillary lymph node
status (N0 vs. N+) predicted 5-year survival with a specificity of 71% and a sensitivity
of 77%. The sensitivity of the neural network model was 91% at this specificity level.
The rate of false predictions at 5 years was 82/300 for nodal status and 40/300 for the
neural network. When nodal status was excluded from the neural network model, the
rate of false predictions increased only to 49/300 (AUC 0.877). An artificial neural
network is very accurate in the 5-, 10- and 15-year breast cancer-specific survival
prediction. The consistently high accuracy over time and the good predictive
Ahmad et al. [13] implemented machine learning techniques, i.e., Decision Tree (C4.5),
Support Vector Machine (SVM), and Artificial Neural Network (ANN) to develop the
predictive models for patients registered in the Iranian Center for Breast Cancer (ICBC)
program from 1997 to 2008. The dataset contained 1189 records, 22 predictor
variables, and one outcome variable. The main goal of their paper was to compare the
specificity, and ac- curacy. Their analysis showed that accuracy of DT, ANN and SVM
are 0.936, 0.947 and 0.957 respectively. The SVM classification model predicts breast
cancer recurrence with least error rate and highest accuracy. The predicted accuracy of
26
the DT model is the lowestof all. The results are achieved using 10-fold cross-validation
for measuring the unbiased prediction accuracy of each model. Ayer et al. [14] used
two of the most frequently used computer models in clinical risk estimation are Logistic
Regression and Artificial Neural Network. A study was conducted to review and
compare these two models, elucidate the advantages and disadvantages of each, and
provide criteria for model selection. The two models were used for estimation of
breast cancer risk on the basis of mammographic descriptors and demographic risk
factors. Although they demonstrated similar performance, the two models have unique
was more advantageous of ANNs over logistic regression model due to its hidden
layers of nodes. In fact, a special ANN with no hidden node has been shown to be
identical to a logistic regression model. ANNs are particularly useful when there are
implicit inter- actions and complex relationships in the data, whereas logistic regression
models are the better choice when one needs to draw statistical inferences from the
output.
Maxine Tan et al., [15] proposed a novel computerized system to predict breast cancer
risk using quantitative assessment of mammographic image. In this research work the
data collected from 335 women. The proposed model was applied the collected images.
The output of this research work showed 159 cancers were identified. The remaining
176 werecancer free images. In this work SVM classifier was used to predict the cancer
disease.
Habib Dhahri et al., [16] explained the various machine learning methods are used to
predict cancer disease. This proposed model worked based upon genetic programming
combined with machine learning algorithms. The purpose of this study [12] was easy to
27
differentiate the different type of breast cancer tumor such as benign and malignant.
Here genetic programming was applied to find out features and the correct attributes
values of machine learning approaches. The performance of the proposed system was
measured depends upon the sensitivity, specificity, precision and. This research work
proved that genetic programming concept automatically generates the model by feature
Konstantina Kourou et al. [17] said that different type of cancer disease is available
now. The disease was identified in early stage it was curable. To detect the diseases
frequent screening was needed. Based upon the risk level the cancer disease the peoples
are categorized. Machine Learning algorithms are used to identify the important
features from the complex data set. The various machine learning approaches are
Machines concepts and Decision Trees techniques has been broadly used in cancer disease
detection research. This research work presented the reviewed of latest machine learning
Somil Jain et al. [18] discussed about the causes of breast cancer and how to predict in
earlier stage. Recently many people are affected by the cancer disease. Different
machine learning approaches and data mining techniques were used for medical data
prediction done for various diseases like breast cancer. The major contribution of this
research work was to calculate correctness of the classification concepts and identify the
best algorithm based upon the accuracy level and predicting capability [5].
Anusha Bharat et al. [19] explained in their research work machine learning was mostly
Breast cancer is one of the cost common diseases throughout the world. The different
types of cancer cells are Benign and Malignant. In this proposed work SVM classifier
28
is used to predict the cancer. The SCM concept was applied on Wisconsin Breast
Cancer dataset. The above data set was trained by using other concepts: KNN, Naives
Bayes and CART. The accuracy level of each algorithm is compared. Shubham Sharma
et al., says that most of the Indian women are affected by the breast cancer. From the
affected women 50% women going to fatal condition due to breast cancer. This
research work was used to compare various machine learning algorithms used in
diagnosis breast cancer detection. The obtained result was used in breast cancer
detection.
Moh‟d Rasoul Al-hadidi et al. [20] says that breast cancer was a critical disease in
females throughout the entire world. Detection cancer disease is the initial state to
improve their survival dates. Radiologists are using mammography images to predict
the disease. Here the authors proposed a new model to sense the breast cancer with high
exactness. In this process was divided in to two stage. In the first stage the image
processing concepts are applied to organize the input mammography pictures for feature
and pattern extraction task. Supervised learning methods are applied on the extracted
features. Here Back Propagation Neural Network concept and the Logistic Regression
(LR) techniques are used. Finally, the accuracy of the above-mentioned model was
compared.
Naresh Khuriwal et al. [21] says that breast cancer is the critical disease for
women. The main target of this study was curing the cancer in the initial stage with
using scientific methods. Early diagnosis of the disease can be used to remove the
cancer disease totally. In this research work, 41 research papers are reviewed. Here the
authors were used Deep Leaning concept and convolution neural network algorithms
are used for predict breast cancer. The mamograph images are taken from MIAS
database. The output of this research shows 98% accuracy. The processing step was
29
divided into three stages. In the first stage the data were collected and remove the
unwanted data by using pre-processing techniques. Then the dataset was split for
training and testing purpose. Finally develop a model usingmachine learning algorithms
B.M.Gayathri et al. [22] explained due to the change of life styles of women and avoid
processing. To detect breast cancer disease machine learning concepts are used. In this
research work performed a comparative study of Relevance vector machine (RVM) with
some othermachine learning concepts are used in for breast cancer prediction.
Dana Bazazeh et al. [23] says that breast cancer most wide spread fatal disease among
women throughout the world. Machine Learning approaches are used to diagnosis the
breast cancer in early stage. In this research work the authors compared most
commonly used machine leaning concepts are Support Vector Machine concept,
Random Forest technique and Bayesian Networks approach Wisconsin original breast
Zhiqiong Wang et al. [24] used the convolutional neural network (CNN) deep features
to detect cancer disease. Initial step the authors used mass detection method based upon
CNN deep learning features and unsupervised machine learning clustering concept.
Then build a various feature set likes morphological features, texture features, and
density features. Finally, ELM classifier was designed to classify benign and malignant
Yawen Xiao et al. [25] says that breast cancer disease is common disease in female
category of the people. In this research work demonstrated a new system embedded
with deep learning concept based unsupervised feature extraction algorithm. The
stacked auto- encoder concept was also used with a support vector machine technique
30
to predict breast cancer. The proposed method was tested by using Wisconsin
Diagnostic Breast Cancer data set. The result displays that SAE-SVM method used to
Junaid Ahmad Bhat et al. [26] developed a new tool used to detect the breast cancer
disease in early stage. In this research work the authors was presented preliminary
results of the project BCDM developed by using Matlab software. The algorithm was
for women breast cancer diagnosis from digital mammographic images. The research
about work conveyed to create PC helped conclusion instruments that can help the
abridge
prediction using deep learning and machine learning were investigated to find out the
accuracy and repetition of selected algorithms with best performance. Total number of
papers we have found using keyword search were 43,900 that we have got from
different platforms like ACM, IEEE, Research Gate and Science Direct. Our search
query was focused on four keywords: machine learning, deep learning, data mining and
LR=0.897)
DT=0.935,
LR=0.894)
DT=0.936,
SVM=0.957)
RF=0.963,
SVM=0.959)
(2010) (ANN=0.965,
LR=0.963)
32
Investigation of the relevant literature helps in sorting out different deep learning and
ma- chine learning techniques that could be exploited in detecting breast cancer. After
reviewing all techniques, their performance, their accuracy, the number of times they
appeared in a journal, the optimum machine learning techniques that I selected for
breast cancer detection was artificial neural network, more precisely, Convolutional
Neural Networks (CNN) because it was used in more references than any other
algorithm. For my proposed methodology, the CNN architecture and functioning which
CHAPTER THREE
3.1 Introduction
This chapter carefully explains the various methods and processes taken to realize the
project. It reveals the model used, the training of the model in the software, and several
other parameters.
The summary of the project methodology is explained in Figure 3.1. This project
(non- cancerous).
biopsy using machine learning. First, the CNN model is built and trained in colab by
importing the chosen data set to it. Then, once a high accuracy achieved, a web app is
created in the front end to allow a new prediction to be made for any patient image data.
Google Colab was chosen preferred to Kaggle for training the model because it is very
34
simple to use and also has default codes to directly call a dataset into the model.
wisconsin- data. Dr. William H. Wolberg, from the University of Wisconsin Hospitals,
Madison, obtained this breast cancer database [28]. Figure 3.2 shows the first five
rows and columns of the data set. In this data set there are 30 input parameters more
than 600 patient cases used. Target variables can only have two values in a
classification model: 0 (false) or 1 (true). Since this dataset doesn‟t contain image
data, another dataset containing histopathological FNA biopsy images was also used
to classify the instances into either benign or malignant, from the site
Figure 3. 2: Section of data set showing first five rows and columns
35
a free environment that provides users with free Graphics Processing Unit (GPU)
and Tensor Processing Unit (TPU) runtimes for training their machine learning
algorithms. Once in the Google Colab environment, a user will simply have to login
to a Gmail account and create a new notebook in order to create a new neural
network. CNN was chosen over other ANN algorithms because since digital images
are a bunch of pixels with high values, it makes sense to use CNN to analyze them.
CNN decreases their values, which is better for the training phase with less
computational power and less information loss The main reason why ReLu
early papers observed that training a deep network with ReLu tended to converge
much more quickly and reliably than training a deep network with sigmoid
activation.
CNNs are comprised of three types of layers. These are convolutional layers,
pooling layers and fully-connected layers. When these layers are stacked, a CNN
architecture has been formed [29]. Figure 3.2 shows the architecture. The CNN
model takes as input the sequence of word embeddings, summarizes the sentence
meaning by convolving the sliding window and pooling the saliency through the
sentence, and yields the fixed-length distributed vector with other layers, such as
The CNN model was trained in Colab using Tensor Flow and Keras modules. The
classification was done separating the data set into a training set and validation set. The
training was set for ten epochs for the training and validation data sets. A resulting
graph for the training and validation processing was made to highlight the accuracy and
loss for both sets, which will be discussed in the next chapter.
Convolution Layer: Figure 3.4 shows the convolution operation. This is the first layer
of the convolutional network that performs feature extraction by sliding the filter over
the input image. The output or the convolved feature is the element-wise product of
filters in the image and their sum for every sliding action. The output layer, also known
as the feature map, corresponds to original images like curves, sharp edges, textures, etc.
In the case of networks with more convolutional layers, the initial layers are meant for
extracting the generic features while the complex parts are removed as the network gets
deeper.
37
Pooling Layer: Figure 3.5 shows the functioning of the pooling layer. The primary
purpose of this layer is to reduce the number of trainable parameters by decreasing the
spatial size of the image, thereby reducing the computational cost. The image depth
remains un- changed since pooling is done independently on each depth dimension. Max
Pooling is the most common pooling method, where the most significant element is
38
taken as input from the feature map. Max Pooling is then performed to give the output
image with dimensions reduced to a great extent while retaining the essential
information.
Fully Connected Layer: Figure 3.6 shows the functioning of the fully connected layer.
The last few layers which determine the output are the fully connected layers. The
output from the pooling layer is Flattened into a one-dimensional vector and then given
as input to the fully connected layer. The output layer has the same number of neurons
as the number of categories we had in our problem for classification, thus associating
compared to the actual production for error generation. The error is then back
propagated to update the filters(weights) and bias values. Thus, one training is
The diagnostic ability of classifiers has usually been determined by the confusion
matrix and the Receiver Operating Characteristic (ROC) curve. In the machine learning
research domain, the confusion matrix is also known as error or contingency matrix.
The basic framework of the confusion matrix has been provided in Figure 3.7 In this
framework, true positives (TP) are the positive cases where the classifier correctly
identified them. Similarly, true negatives (TN) are the negative cases where the classifier
correctly identified them. False positives (FP) are the negative cases where the classifier
incorrectly identifiedthem as positive and the false negatives (FN) are the positive cases
where the classifier incorrectly identified them as negative. The following measures,
which are based on the confusion matrix, are commonly used to analyze the
learning algorithms. The acceptable ranges for the accuracy, precision, F1 score,
sensitivity and specificity are from 90% to 100%, while the acceptable range for the
ROC is one of the fundamental tools for diagnostic test evaluation and is created by
plotting the true positive rate against the false positive rate at various threshold settings.
The area under the ROC curve (AUC) is also commonly used to determine the
classifier and vice versa. Figure 3.7 illustrates a presentation of three ROC curves based on
an abstract dataset. The area under the blue ROC curve is half of the shaded rectangle.
Thus, the AUC value for this blue ROC curve is 0.5. Due to the coverage of a larger
area, the AUC value for the red ROC curve is higher than that of the black ROC curve.
Hence, the classifier that produced the red ROC curve shows higher predictive
accuracy compared with the other two classifiers that generated the blue and red ROC
curves.
41
The web app development proposed should satisfy the SDLC, or Software
Development Life Cycle, which is a set of steps used to create software applications.
Figure 3.8 shows an illustration of the seven steps of the software development life
cycle. These steps divide the development process into tasks that can then be assigned,
completed, and measured. It simply outlines each task required to put together a
software application. This helps to reduce waste and increase the efficiency of the
development process. Monitoring also ensures the project stays on track, and continues
1 Planning
In this phase, the scope and purpose of the application are defined. The purpose of the
web app is to enable any patient or medical expert to know in a very accurate way if an
imagedata uploaded is benign or malignant. The first step would be for the user to get a
42
fine needle aspiration biopsy from a hospital. The cost for the biopsy is approximately
10$, which is around 7,000 FCFA. That is the cost to get an image data that can be used
in theapp.
2 Define Requirements
Our application is supposed to read the uploaded image from the user, and compare it
with the thousands of trained images in the back end in order to give an accurate
3 Coding
The app is designed to the web using Python as the programming language, Google
Colab or Visual Studio Code as the IDE, and Streamlit as the web development
framework. Themodel was trained with our CNN algorithm in ten epochs.
The source code for building the web app in python is given in Appendix A.
4 Software Development
The Streamlit utility is imported to the code. The ap.py code was downloaded from
In the command prompt window, the directory of the app.py file is used to create a
heroku app from the code. Next after logging in to heroku using heroku login, the
app.py was created in heroku using heroku create aicancertracer. The codes on how to
After this has been done, some files are needed to run the app online. They are the
The requirements.txt file contains all the libraries that need to be installed for the
project to work. This file can be created manually by going through all files and
looking at what
43
libraries are used or automatically using something like pipreqs. This can be found in
Appendix B.
Using the setup.sh and Procfile files, you can tell Heroku the needed commands for
starting the application. In the setup.sh file, we will create a streamlit folder with a
credentials.toml and a config.toml file. The Procfile is used to execute the setup.sh and
then call streamlit run to run the application. This can be found in Appendices C and D
respectively.
The final step is to create a git init and push all the files mentioned to heroku master in
5 Testing
Prior to launching the app on the web, the app was tested on the local host using ngrok
in Google Colab. After coding, the ngrok utility was used to test and see if the app was
okay before using streamlit to embed the code. In case the IDE used wan VS Code, the
app canstill be tested on the computer‟s local host from port 8500, 8501 or 8502.
6 Deployment
herokuapp.com
7 Maintenance
Heroku is a platform that requires a lot of maintenance, and that is done by going
to the heroku CLI dashboard and adjusting the settings in order to keep the app
running.The kaffeine module was also used to keep the app running, at https://kaffeine.
44
In Unified Modelling Language (UML), a use case diagram helps understand how a
user might interact with the system. In this case the use case diagram of the system is
presented in Figure 3.9. There are two actors in the process; the patient and the doctor.
The interactions between the actors and cases are demonstrated in the following
paragraphs. [33]
Case 1: Perform diagnosis: The doctor will perform a series of tests on the patient
First, the doctor will perform a fine needle aspiration biopsy, by taking tissue sample
from the breast for further examination. Next, he will inspect the physical appearance
of the
biopsy sample under a microscope, and load the image that result in the web app that
45
Next, the doctor will share the image data of the diagnosis with the patient as to show
him or her the state of findings, and so that the patient can also check his status via the
web app. If the cancer is malignant, the doctor will proceed with a number of tests to
see how advanced the cancer is and how far it has spread.
Then the doctor will look into other factors to determine the prognosis of the disease.
[33]
Case 2: Propose treatment: When the above investigations are completed, the doctor
willcounsel the patient regarding the best treatment options available, based on the type
Besides proposing the treatment options to patient, the doctor also need to explain to
the patient about the risks of taking the particular treatment and chance of recovery.
Upon endorsement by the patient, the doctor will schedule the treatments for the patient.
[33]
In this chapter we have seen how the data set and images were trained in CNN, and
how the web app was developed using Streamlit in Google Colab. For the next step we
shall view the results of the training, the ROC curve, the confusion matrix, together
CHAPTER FOUR
4.1 Introduction
This chapter presents the results of training the CNN model, giving information about
the ROC curve, the confusion matrix, the accuracy and other KPIs, and also, the
application deployed.
From the coding of the model which is found in Appendix A, Figure 4.1 shows the
result of the ROC curve gotten after training the model. The output was a 0.98 AUC,
The next result obtained was the confusion matrix after training the dataset, which gave
all the resulting KPI shown. Figure 4.2 shows the result of the confusion matrix, while
Figure 4.3 shows a graph of the training and validation accuracy versus the training and
validationloss.
47
erfererertrt
From the results gotten, the accuracy of the prediction is 97.9%, implying a very good
A plot of the training and validation accuracies of the training and validation data sets
were also gotten from the tensor flow prediction, together with a plot of their respective
losses, as a function of the epochs history. From Figure 4.3, the training and validation
accuracy increases with an increase in the number of epochs, while the training and
validation loss decreases with an increase in the number of epochs, indicating that the
The Streamlit app was able to predict new cases of benign or malignancy when the user
Figure 4.4 shows the homepage of the web app, while Figure 4.5 and Figure 4.6 show
the results of prediction with a benign and malignant image respectively. The app takes
three seconds to give the prediction results for both the benign and malignant images.
A benign image was uploaded to the app as can be seen in Figure 4.5, and the result
Next, a malignant image was also uploaded to the app as can be seen in figure 4.6 and
theresult was still benign, indication a correct classification made by the application.
CNN was used for this prediction and gave an accuracy of 97.9%, which is relatively
superior to that obtained from previous works. Table 4.1 summarizes the comparison of
From the previous works, it is clearly revealed the CNN is the most preferred deep
learning technique. It is preferred over other neural network algorithms such as RNN,
Feed Forward Neural Networks, Kohonen Self Organizing Neural Network or VGG-16
due to the fact that it digitizes high pixel values faster than the other algorithms.
Therefore, the choice I made to use it for the project was surely the best.
51
Table 4. 1 Comparative analysis of results with previous works on deep learning for
breast cancer detection
Sensitivity = 0.968
FNR=0.13,
0.9688
0.955
In addition, from the previous works, the main KPI which is the accuracy, ranges from
0.846 to 0.987, indicating strong classification. Our own yields 0.979, which is higher
Another remark is the fact that multiclass variables other than the traditional benign or
malignant classification were also detected at the output, indicating the possibility to extend
the usage of the project according to the type of cancer to be detected. In general, my results
52
are not very much different from the results in the previous works. The accuracy and
other KPIs are very high enough, indicating that the training and prediction was
efficient.
So far so good, we have all the results of our classification model, the accuracy was
very high, higher than those of the previous works, indicating a superiority in the
selection of the model, and the training dataset. Also, the web app was deployed to
enable any user with a new sample patient image, to train and get a prediction of the
most likelihood of the classification instance of the image data, hence saving lives
CHAPTER FIVE
GENERAL CONCLUSION
In this dissertation, we proposed a simple and effective method for the classification of
histopathology breast cancer images in case of a large training data. Following the
training of the artificial neural network with the breast cancer dataset, we had the
following results
From the above values, we realize the training and classification of the model was
accurately done, with ease and simplicity. Approximately 4,000 images were used for
training, with 20% of this number used for validation. Therefore, the feature extraction
The realization of this project means that there exist various ways through which breast
cancer and other diseases could be easily detected using AI. Cancer detection in
medical imaging is a field that can achieve many good results with deep learning
technology. Re- viewed papers are summarized in Table 4.1. So far, the results very
satisfactory, but the development of deep learning technology is very fast, the supply of
researchable medical image data is getting bigger, and the research funds are getting
54
rich, so the future is bright. In the future, it will be easier and more accurate to diagnose
not only medical images but also EHR and genetic information with the help of deep
learning technology. The devel- opment of deep learning technologies is important for
this, but the role of physicians who understand and use these technologies becomes
5.3 Recommendations
This project can be widely used in the field of medicine, whereby diseases like breast
cancer, heart disease, and other disease can be easily diagnosed for the good of
has been successfully applied to the good of the society. This study will be valuable to
medical area as it will allow fast diagnostic of breast cancer even in areas without
specialist. Moreover, it could be of high interest to patients in case they are to confirm
their diagnosis.
This study had some limitations. The histopathology images were downsized to fit the
available GPU. As more GPU memory becomes available, future studies will be able to
train models using larger image sizes, or retain the original image resolution without
the need for downsizing. Retaining the full resolution of the images will provide finer
For future work, we intend to use and evaluate other CNN pretrained models for the
features extraction stage, and extend the application usability to other types of cancer,
Most papers published in the field of breast cancer detection and subtype classification
55
use machine learning techniques. However, deep learning models have not been
heavily investigated in this domain. A thought for the future would be to present
patient status such as LSTM, GAN and RNN, as these types of research have not yet
We also intend to develop a mobile app for this solution in order to maximize the utility
of the project.
56
REFERENCES
[1] F. Noreen, L. Liu, H. Sha, and H. Ahmed, “Prediction of breast cancer, comparative
review of machine learning techniques, and their analysis,” IEEE Access, vol. PP, pp.
1–1, 08 2020.
https://www.cancer.org/cancer/breast-cancer/ screening-tests-and-early-detection/breast-
https://insights.daffodilsw.com/blog/ 10-uses-of-artificial-intelligence-in-day-to-day-life,
machine learning algorithms for disease prediction,” BMC Medical Informatics and
[6] A. Biswal, “Top 10 deep learning algorithms you should know in 2023.” https:
07-23.
[7] D. Dahiwade, G. Patle, and E. Meshram, “Designing disease prediction model using
[8] H. Chen, “An efficient diagnosis system for detection of parkinson‟s disease using
[9] P. Sengar, M. Gaikwad, and D.-A. Nagdive, “Comparative study of machine learning
[10] Y. Dengju, J. Yang, and X. Zhan, “A novel method for disease prediction: Hybrid of
random forest and multivariate adaptive regression splines,” Journal of Computers, vol.
8, 01 2013.
[11] D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: A com-
parison of three data mining methods,” Artificial intelligence in medicine, vol. 34, pp.
113–27, 07 2005.
tificial neural networks applied to survival prediction in breast cancer,” Oncology, vol.
“Using three machine learning techniques for predicting breast cancer recurrence,”
[15] M. Tan, B. Zheng, J. Leader, and D. Gur, “Association between changes in mam-
mographic image features and risk for near-term breast cancer development,” IEEE
[16] H. Dhahri, “Automated breast cancer diagnosis based on machine learning algo-
[18] S. Jain and P. Kumar, “Prediction of breast cancer using machine learning,” Recent
[19] A. Bharat, N. Pooja, and R. Reddy, “Using machine learning algorithms for breast
[21] N. Khuriwal and N. Mishra, “Breast cancer diagnosis using deep learning algorithm,” 10
2018.
[22] B. Gayathri and C. Sumathi, “Comparative study of relevance vector machine with
various machine learning techniques used for detecting breast cancer,” pp. 1–5, 12
2016.
[23] R. Shubair, “Comparative study of machine learning algorithms for breast cancer
[24] Z. Wang, M. Li, H. Wang, H. Jiang, Y. Yao, H. Zhang, and J. Xin, “Breast cancer
detection using extreme learning machine based on feature fusion with cnn deep fea-
[25] Y. Xiao, J. Wu, Z. Lin, and X. Zhao, “Breast cancer diagnosis using an unsupervised
[26] J. Bhat, V. George, and B. Malik, “Cloud computing with machine learning could help
59
[27] M. Perumal, “A research on computer aided detection system for women breast can- cer
Sciences of the United States of America, vol. 87, pp. 9193–6, 01 1991.
[29] K. O‟Shea and R. Nash, “An introduction to convolutional neural networks,” ArXiv e-
prints, 11 2015.
[30] J. Zhang and C. Zong, “Deep neural networks in machine translation: An overview,”
[31] X. Kang, B. Song, and F. Sun, “A deep similarity metric method based on incomplete data
for traffic anomaly detection in iot,” Applied Sciences, vol. 9, p. 135, 01 2019.
https://www.javatpoint.com/software-engineering-software-development-life-cycle, 2021.
Accessed: 2022-07-26.
[33] O. Sheta and A. Nour Eldeen, “Building a health care data warehouse for cancer
2012.
morphometric patterns with estrogen receptor status in breast cancer pathologic spec-
[36] M. Toğaçar, B. Ergen, and Z. Cömert, “Application of breast cancer diagnosis based on
criminant analysis using invasive breast cancer images processed with autoencoders,”
[37] S. Sharma and D. R. Mehra, “Breast cancer histology images classification: Training
tional neural network for breast cancer classification using rna-seq gene expression
network for estrogen and progesterone scoring using breast ihc images,” Pat- tern
Appendices
import s tr ea ml i t
i m p o r t t e n s o r f l o w as t f
from z i p f i l e i m p o r t Z i p F i l e i m p o r t os , g l ob
i m p o r t cv2
numpy as np
from s k l e a r n i m p o r t p r e p r o c e s s i n g
from s k l e a r n . m o d e l s e l e c t i o n i m p o r t t r a i n t e s t s p l i t from k e r a s .
models i m p o r t S e q u e n t i a l
MaxPooling2D from k e r a s . l a y e r s i m p o r t B a t c h N o r m a l i z a t i o n
ten
from z i p f i l e i m p o r t Z i p F i l e
i m p o r t m a t p l o t l i b . p y p l o t as p l t i m p o r t p i c k l e
i m p o r t PIL
from i o i m p o r t Bytes IO , S t r i n g I O i m p o r t m a t p l o t l i b . p y p l o t as p l t
i m p o r t numpy as np i m p o r t os
i m p o r t PIL
i m p o r t t e n s o r f l o w as t f
62
from t e n s o r f l o w i m p o r t k e r a s
from t e n s o r f l o w . k e r a s i m p o r t l a y e r s
mport keras
from t e n s o r f l o w . k e r a s i m p o r t l a y e r s
ort date
from i o i m p o r t Bytes IO
from I P y t h o n i m p o r t d i s p l a y
from s k l e a r n . d a t a s e t s i m p o r t l o a d b r e a s t c a n c e r
from s k l e a r n . m e t r i c s i m p o r t p l o t r o c c u r v e , p l o t c o n f u s i o n m
a t r i x i m p o r t base 64
i m p o r t m a t p l o t l i b . p y p l o t as p l t i m p o r t pandas as pd
i m p o r t s e a b o r n as s n s i m p o r t u u i d
from g o o g l e . c o l a b i m p o r t f i l e s c a n c e r = l o a d b r e a s t c a n c e r ( )
= pd . S e r i e s ( c a n c e r . t a r g e t )
X t r a i n , X t e s t , y t r a i n , y t e s t = t r a i n t e s t s p l i t ( X, y ) X t r a i n
. head ( )
def p l o t t o s t r ( ) :
img = Bytes IO ( )
p l t . s a v e f i g ( img , f o r m a t = ‟ png ‟ )
# P l o t ROC c u r v e
p l o t r o c c u r v e ( c l f , X t e s t , y t e s t )r o c c u r v e = p l o t t o s t r ( )
# P l o t Confusion M a t r i x
p l o t c o n f u s i o n m a t r i x ( c l f , X t e s t , y t e s t )c o n f u s i o n m a t r i x = p l
ot to str ()
DATADIR = ” / c o n t e n t / FNA”
p a t h = os . p a t h . j o i n ( DATADIR, c a t e g o r y )
import pathlib
d a t a d i r = p a t h l i b . Pa t h ( DATADIR)
i m a g e c o u n t = l e n ( l i s t ( d a t a d i r . g l ob ( ‟ * / * . png ‟ ) ) )
b e n i g n = l i s t ( d a t a d i r . g l ob ( ‟ b e n i g n / * ‟ ) )
m a l i g n a n t = l i s t ( d a t a d i r . g l ob ( ‟ m a l i g n a n t / * ‟ ) )
b a t c h s i z e = 32
t r a i n d s = t f . k e r a s . u t i l s . i m a g e d a t a s e t f r o m d i r e c t o r y (d a t a d i
r ,
v a l i d a t i o n s p l i t = 0 . 2 , s u b s e t =” t r a i n i n g ” ,
s ee d = 123 ,
i m a g e s i z e =( i m g h e i g h t , img width ) , b a t c h s i z e = 32 )
v a l d s = t f . k e r a s . u t i l s . i m a g e d a t a s e t f r o m d i r e c t o r y (d a t a d i r ,
v a l i d a t i o n s p l i t =0.2 ,
s u b s e t =” v a l i d a t i o n ” , s ee d = 123 ,
64
i m a g e s i z e =( i m g h e i g h t , img width ) , b a t c h s i z e = 32 )
c l a s s n a m e s = t r a i n d s . c l a s s n a m e sp r i n t ( c l a s s n a m e s )
i m p o r t m a t p l o t l i b . p y p l o t as p l t
AUTOTUNE = t f . d a t a . AUTOTUNE
t r a i n d s = t r a i n d s . cache ( ) . s h u f f l e ( 1 0 0 0 ) . p r e f e t c h ( b u f f e r s i z
e =AUTOTUNE) v a l d s = v a l d s . cache ( ) . p r e f e t c h ( b u f f e r s i z e
=AUTOTUNE)
n o r m a l i z e d d s = t r a i n d s . map ( lambda x , y : ( n o r m a l i z a t i o n l a y e
# N o t i c e t h e p i x e l v a l u e s a r e now i n „ [ 0 , 1 ] „ .
model = S e q u e n t i a l ( [
l a y e r s . R e s c a l i n g ( 1 . / 2 5 5 , i n p u t s h a p e =( i m g h e i g h t , img width ,
l a y e r s . MaxPooling2D ( ) ,
s . MaxPooling2D ( ) ,
s . MaxPooling2D ( ) ,
layers . Flatten () ,
l a y e r s . Dense ( 1 2 8 , a c t i v a t i o n = ‟ r e l u ‟ ) , l a y e r s . Dense ( n u m c l a s s e s
)
65
])
i t s =True ) , m e t r i c s =[ ‟ ac c u r a c y ‟ ] )
model . summary ( )
epochs =10
h i s t o r y = model . f i t ( t r a i n d s ,
v a l i d a t i o n d a t a = v a l d s , epochs = epochs
)acc = h i s t o r y . h i s t o r y [ ‟ a cc u r a c y ‟ ]
e p o c h s r a n g e = r a n g e ( epochs )
p l t . f i g u r e ( f i g s i z e = ( 8 , 8 ) )p l t . s u b p l o t ( 1 , 2 , 1 )
p l t . p l o t ( e p o c h s r a n g e , acc , l a b e l = ‟ T r a i n i n g Accuracy ‟ )
p l t . p l o t ( e p o c h s r a n g e , v a l a c c , l a b e l = ‟ V a l i d a t i o n Accuracy ‟ ) p
l t . l e g e n d ( l o c = ‟ lower r i g h t ‟ )
p l t . t i t l e ( ‟ T r a i n i n g and V a l i d a t i o n Accuracy ‟ )
p l t . subplot (1 , 2 , 2)
p l t . p l o t ( e p o c h s r a n g e , l o s s , l a b e l = ‟ T r a i n i n g Loss ‟ )
p l t . p l o t ( e p o c h s r a n g e , v a l l o s s , l a b e l = ‟ V a l i d a t i o n Loss ‟ ) p l t .
l e g e n d ( l o c = ‟ upper r i g h t ‟ )
d a t a a u g m e n t a t i o n = k e r a s . S e q u e n t i a l ([
66
l a y e r s . Random Flip ( ” h o r i z o n t a l ” , i n p u t s h a p e =( i m g h e i g h t ,
img width , 3 ) ) ,
model = S e q u e n t i a l ( [ d a t a a u g m e n t a t i o n ,
s . MaxPooling2D ( ) ,
s . MaxPooling2D ( ) ,
s . MaxPooling2D ( ) ,
l a y e r s . Dropout ( 0 . 2 ) , l a y e r s . F l a t t e n ( ) ,
l a y e r s . Dense ( 1 2 8 , a c t i v a t i o n = ‟ r e l u ‟ ) , l a y e r s . Dense ( n u m c l a s s e s
])
i t s =True ) , m e t r i c s =[ ‟ ac c u r a c y ‟ ] )
model . summary ( )
epochs = 15
h i s t o r y = model . f i t ( t r a i n d s ,
validation data=val ds ,
67
epochs = epochs
acc = h i s t o r y . h i s t o r y [ ‟ a cc u r a c y ‟ ]
e p o c h s r a n g e = r a n g e ( epochs )
p l t . f i g u r e ( f i g s i z e = ( 8 , 8 ) )p l t . s u b p l o t ( 1 , 2 , 1 )
p l t . p l o t ( e p o c h s r a n g e , acc , l a b e l = ‟ T r a i n i n g Accuracy ‟ )
p l t . p l o t ( e p o c h s r a n g e , v a l a c c , l a b e l = ‟ V a l i d a t i o n Accuracy ‟ ) p
l t . l e g e n d ( l o c = ‟ lower r i g h t ‟ )
p l t . t i t l e ( ‟ T r a i n i n g and V a l i d a t i o n Accuracy ‟ )
p l t . subplot (1 , 2 , 2)
p l t . p l o t ( e p o c h s r a n g e , l o s s , l a b e l = ‟ T r a i n i n g Loss ‟ )
p l t . p l o t ( e p o c h s r a n g e , v a l l o s s , l a b e l = ‟ V a l i d a t i o n Loss ‟ ) p l t .
l e g e n d ( l o c = ‟ upper r i g h t ‟ )
i f f i l e i s n o t None :
s t r e a m l i t . image ( image )
i m g a r r a y = np . a r r a y ( image )
68
dims ( img , a x i s = 0 )
p r e d i c t i o n s = model . p r e d i c t ( img )
s c o r e = t f . nn . softmax ( p r e d i c t i o n s [ 0 ] )
i f c l a s s n a m e s [ np . argmax ( s c o r e ) ] == ” b e n i g n ” :
s t r e a m l i t . t i t l e ( ” T h i s image i s most l i k e l y {} wi t h a { : . 2 f } p e r c
ent confidence .”
a l l o w h t m l =True )
i f c l a s s n a m e s [ np . argmax ( s c o r e ) ] == ” m a l i g n a n t ” :
s t r e a m l i t . t i t l e ( ” T h i s image i s most l i k e l y {} wi t h a { : . 2 f } p e r c
ent confidence .”
o w h t m l =True )
s t r e a m l i t = = 0 . 7 9 . 0pandas = = 1 . 2 . 3 numpy = = 1 . 1 8 . 5
m a t p l o t l i b = = 3 . 3 . 2s e a b o r n = = 0 . 1 1 . 0
t e n s o r f l o w −cpu = = 1 . 1 1 . 1
echo ”\
[ g e n e r a l ]\n\
” > ˜ / . s t r e a m l i t / c r e d e n t i a l s . toml
echo ”\
[ s e r v e r ]\n\
h e a d l e s s = t r u e \ n \ enableCORS= f a l s e \ n \p o r t = $PORT\ n \
” > ˜ / . s t r e a m l i t / c o n f i g . toml
git init
heroku login