Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lung Cancer Detection Using Machine Learning Algorithms and Neural Network On A Conducted Survey Dataset Lung Cancer Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Volume 8, Issue 6, June – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Lung Cancer Detection using Machine Learning


Algorithms and Neural Network on a Conducted
Survey Dataset Lung Cancer Detection
Ratika Nisha Gupta
Student Student
Master of computer applications Master of computer applications
Graphic era hill university, Dehradun, India Graphic era hill university, Dehradun, India

Abstract:- Lung cancer is the expansion of malignant classifiers developed in this work predicted the various factors
cells in the lungs. Due to the rising frequency of cancer, that influence the survival time, would help doctors make
both the death rate for men and women has increased. more informed decisions about treatment plans and help
Lung cancer is a condition in which lung cells proliferate patients develop more educated decisions about different
uncontrolled. Although lung cancer cannot be averted, treatment options.
the risk can be decreased. Therefore, early identification
of lung cancer is essential for improving patient survival. This study has explained the survival rate analysis of
Lung cancer incidence is directly inversely correlated patients with advanced lung cancer who did not receive any
with the frequency of heavy smokers. Various type of therapeutic modality and to evaluating performance
classification techniques, including Naive Bayes, Random scores daily activities the results of this study have found
forest, Logistic Regression, Knn, Kernal svm and slight improvement in survival rates. Random Forest
Artificial neural network were used to investigate the algorithms were found to result in the good prediction
lung cancer prediction. The primary goal of this study is performance in terms of accuracy of 88% and Artificial
to investigate the effectiveness of classification algorithms neaural network were found in the best prediction giving
and neaural network in the early identification of lung accuracy of 89%.
cancer.
II. LITERATURE REVIEW
Keywords:- Naive Bayes, Random Forest, Logistic
Regression, Knn , Kernal svm, Artificial Neaural Network In paper [11], Pankaj Nanglia, Sumit Kumar, and others
,Machine Learning, Lung Cancer. introduced a novel hybrid technique known as the Kernel
Attribute Selected Classifier, in which they integrate SVM
I. INTRODUCTION with Feed-Forward Back Propagation Neural Network,
assisting in lowering the computational complexity of the
Lung Cancer is the most treacherous disease for human classification. They suggested three block processes for the
beings. Lung cancer is responsible for more deaths than classification, processed the Block 1 is the dataset. The first
combined death count of colon, prostate, ovarian and breast block involves feature extraction using the SURF method, the
cancer . Lung cancer is a serious health concern for humans second block involves optimization using a genetic algorithm,
and alone in the United States of America with a count of and the third block involves classification using FFBPNN.
225,000 people each year . The main factor causing lung
cancer is smoking and the duration of smoking is directly  Chao Zhang, Xing Sun, Kang Dang, and others use the
proportional to the person getting affected with cancer. To multicenter data set to conduct a sensitivity analysis in
detect lung cancer manually is a very tedious and risky job paper [12]. The two categories they selected were
even for specialists. To gain deeper insights and identification Diameter and Pathological outcome.
of lung cancer in early stages, different machine leaning  In paper [18] K.Mohanambal , Y.Nirosha et al studied
methods are used in classification. By applying techniques structural co-occurrence matrix (SCM) to extract the
such as random forest and other classification algorithms, an feature from the images and based on these features
automated system can be built which can perform with higher categorized them into malignant or benign. The SVM
accuracy rate and helps in accurate classification. classifier is used to classify the lung nodule according to
their malignancy level (1 to 5).
lung cancer is the leading cause of cancer death in both  Radhika P. R. and Rakhi. A. S. Nair's paper [16] primarily
men and women in the United States. The main objective of focused on the prediction and categorization of medical
this paper is to analyze the lung cancer data available models imaging data. They made use of the data.world dataset and
to lung cancer survivability prediction model and to develop the UCI Machine Learning Repository. Support vector
accurate survival prediction models using Machine Learning. machines had superior accuracy (99.2%), according to a
Logistic regression,naïve byes, knn ,Random Forest (RF) comparative research using several machine learning
,Kernal svm, Artificial neaural network have been applied for algorithms. Naive Bayes provides 10%, Decision Tree
constructing a lung cancer survivability prediction model. The

IJISRT23JUN281 www.ijisrt.com 68
Volume 8, Issue 6, June – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
provides 80% 87.87% and 66.7% are provided via logistic cancer detection ought to be more precise and reliable. The
regression. open source is used to gather lung cancer parameters. Python
 The algorithm for lung cancer detection was examined in is the programming language in use.
paper [17] by Vaishnavi. D1, Arya. K. S2, Devi Abirami.
T3, and M. N. Kavitha4. They used the discretely sampled Numerous parameters, such as smoking, anxiety, peer
Dual-tree Complex Wavelet Transform (DTCWT) for pre- pressure, chronic disease, fatigue, allergy, alcohol consuming,
processing. The second order statistical texture analysis etc., are used to predict the lung cancer. The user starts
approach known as GLCM provides a table of the co- activity in this system by using lung cancer dataset. Data
occurrence of various combinations of Gray levels in an gathered from the user during data collection and pre-
image. processing processes is utilized .The initialization data is then
analyzed and splitted into training and testing dataset then the
III. METHODOLOGY model is fitted into the dataset, which evaluates the dataset
and give accuracy to the user.
The total economic development of a developing
country, such as India, where the majority of the population The system is shown as a block diagram :
depends on health, is scared of lung cancer. Therefore, lung

Apply
Collection Selection Processing machine Evaluate Display
of data of data data learning result final result
algorithms

Fig.1 System Block Diagram

 Logistic regression drawn from a Gaussian distribution, leading us to think of a


The logistic function, often known as the sigmoid Gaussian Naive Bayes.
function, is used in this method. This S-shaped curve can
assign any real value number to a value between 0 and 1, but  Random forest
never exactly within those bounds. Logistic regression so This forest is made up of a number of decision trees that
models the default class probability. The logistic function, were frequently trained using the bagging approach. The
which enables us to compute the log-odds or the probit, is fundamental concept of bagging is to reduce variation by
used to predict the likelihood. As a result, the inputs are averaging numerous noisy but roughly impartial models.
combined linearly to create the model, but this linear
combination is related to the log-odds of the default class.  Kernel Svm
Using a kernel function, data can be input and then
 K-Nearest neighbors transformed into the format needed for processing. The term
K-Nearest Neighbours is a strategy that classifies new "kernel" is employed because the window for manipulating
cases based on similarity measures and stores all of the the data in a Support Vector Machine is provided by a set of
existing examples. The test phase made use of all training mathematical operations. In order for a non-linear decision
data. This accelerates training while slowing down and surface to turn into a linear equation in a higher number of
increasing the expense of the test phase. If there are two dimension spaces, Kernel Function often changes the training
classes, the number of neighbours in this method, k, is set of data. The inner product between two points in a
typically an odd number. Use distance measures such the common feature dimension is what it basically returns.
Euclidean distance, Hamming distance, Manhattan distance,
and Minkowski distance to calculate the distance between  Artificial neural networks (ANNs)
points in order to determine the ones that are the closest in A class of machine learning techniques known as
similarity. artificial neural networks (ANNs) are modelled after the form
and operation of biological neural networks seen in the human
 The naive Bayes brain. Artificial neurons (ANNs) are made up of
The naïve Bayesian classifier is a probabilistic classifier interconnected nodes, also referred to as "units," that are
built on the foundation of the Bayes theorem and has arranged in layers. An input layer, one or more hidden layers,
significant assumptions about the independence of the and an output layer are the typical divisions of the layers.An
features. As a result, by applying the Bayes theorem, overview of how an artificial neural network functions is
P(X|Y)=P(Y|X)P(X)P(Y), we may determine the likelihood given below:
that X will occur given that Y has already occurred. The  Input Layer: The neural network's initial data or training
evidence in this case is Y, and the hypothesis is X. Here, it is features are delivered to the input layer. Each input neuron
assumed that each predictor or trait is independent and that its is associated with a certain characteristic or aspect of the
presence has no effect on the others. The term "naive" is a data.
result. In this instance, we'll assume that the values were

IJISRT23JUN281 www.ijisrt.com 69
Volume 8, Issue 6, June – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Hidden Layers: There may be one or more hidden layers to solve the problem will determine the loss function that
between the input and output layers. Multiple synthetic is used.
neurons or units are present in each hidden layer,  Backpropagation: The primary algorithm used to train the
processing data and transmitting it to subsequent layers. neural network is backpropagation. In order to minimise
 Weights and Bias: Each neuronal link inside the network the loss, it calculates the gradient of the loss function with
has a corresponding weight. These weights are modified respect to the network weights and modifies the weights in
during the training phase to enhance the performance of the opposite direction of the gradient. Usually,
the network. Each neuron also has a bias, which can be optimisation methods like stochastic gradient descent
thought of as an activation threshold. (SGD) or its variations are used for this process.
 Activation Function: A neuron's output is determined by  Training: The neural network is trained by supplying
its inputs and internal state by the activation function. training examples to the network periodically, modifying
Sigmoid, ReLU (Rectified Linear Unit), and tanh the weights via backpropagation, and optimising the loss
(hyperbolic tangent) are often used activation functions. function. The aim is to reduce the loss and enhance the
They give the network non-linearities, which help it learn forecast accuracy of the network.
intricate patterns.  Prediction: After the neural network has been trained,
 Loss Function: A loss function evaluates the discrepancy predictions can be made using brand-new, unexplored
between the neural network's output and the predicted data. Forward propagation is used to feed the input data
output. Whether regression or classification is being used through the network, and the output layer delivers the
anticipated outcome.

Fig.2 parameters affecting lung cancer diagram

This graph shows persons having age 50 above is


having lung cancer nowadays which is great in number.

IV. RESULT AND DISCUSSION

The dataset was trained and the random forest model


achieved a training accuracy of 88% and Artificial neural
network gives accuracy of 89% which is highest then any
other model. The table shows the accuracy achieved by all
other models:

Fig3. lung cancer-age diagram

Table 1 the accuracy achieved by all other models:

MODEL LOGISTIC KNN RANDOM NAÏVE BYES KERNEL SVM ANN


REGRESSION FOREST
ACCURACY 87% 86% 88% 83% 84% 89%

IJISRT23JUN281 www.ijisrt.com 70
Volume 8, Issue 6, June – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
The confusion matrix is given as: depicts the true label Second application of this research could be used for full
vs. the predicted label scaled system for assistance to the radiologists and doctors for
better decision making. In future work, more numbers of
datasets and parameters should be taken into consideration
which can benefit the classifiers.

REFERENCES

[1]. SRS Chakravarthy and H. Rajaguru. "Lung Cancer


Detection using Probabilistic Neural Network with
modified Crow-Search Algorithm." Asian Pacific
Journal of Cancer Prevention, 20, 7, 2019, 2159-2166,
doi: 10.31557/APJCP.2019.20.7.2159.
[2]. AA. Borkowski, MM. Bui, LB. Thomas, CP. Wilson,
LA. DeLand, SM. Mastorides. "Lung and Colon Cancer
Histopathological Image Dataset." (LC25000). ArXiv:
Fig 4. confusion matrix 1912.12142v1 [eess.IV], 2019.
[3]. W. Ausawalaithong, A. Thirach, S. Marukatat, and T.
The figure shows the precision, recall, f1-score and Wilaiprasitporn, "Automatic Lung Cancer Prediction
support for the different categories. from Chest X-ray Images Using the Deep Learning
Approach," 2018 11th Biomedical Engineering
International Conference (BMEiCON), Chiang Mai,
2018, pp. 1-5, doi: 10.1109/BMEiCON.2018.8609997.
[4]. K. Yu, C. Zhang, G. Berry, et al. "Predicting non-small
cell lung cancer prognosis by fully automated
microscopic pathology image features." Nat Commun 7,
12474 (2016), doi: 10.1038/ncomms12474
[5]. G. A. Silvestri, et al. "Noninvasive staging of non-small
cell lung cancer: ACCP evidence-based clinical practice
guidelines (2nd edition)." Chest vol. 132, 3 Suppl
(2007): 178S-201S. doi:10.1378/chest.07-1360.
[6]. https://www.cdc.gov/cancer/lung/basic_info/symptoms.
htm
[7]. https://www.kaggle.com/datasets/jillanisofttech/lung-
Fig 5 precision, recall, f1-score
cancer-detection
[8]. https://www.mayoclinic.org/diseases-conditions/lung-
V. CONCLUSION AND FUTURE
cancer/symptoms-causes/syc-20374620
ENHANCEMENTS
[9]. https://www.datacamp.com/blog/classification-
To conclude this research the lung cancer features were machine-learning
classified with high accuracy and with limited computation [10]. M. Šarić, M. Russo, M. Stella and M. Sikora, "CNN-
power. The preprocessing of the data was done efficiently based Method for Lung Cancer Detection in Whole
which helped the model for less time consumption. In the end Slide Histopathology Images," 2019 4th International
of the research comparative study was done to asses the Conference on Smart and Sustainable Technologies
(SpliTech), Split, Croatia, 2019, pp. 1-4, doi:
quality of results. The random forest model obtaining the
accuracy of 88% gives a quality result and ANN gives 10.23919/SpliTech.2019.8783041.
accuracy of 89%. As well as it was observed that Naïve byes
has the lowest achieving accuracy of 83%. This makes ANN
an efficient neural network in terms of accuracy .

In the future work the lung cancer detection can be done


on imaging format which can build a 4D image structure such
as 4D MRI. This can used for accurate segmentation of the
dataset and which can help to detetect the lung cancer more
accurately.

IJISRT23JUN281 www.ijisrt.com 71

You might also like