Gene Expression Analysis On Cancer Dataset
Gene Expression Analysis On Cancer Dataset
Gene Expression Analysis On Cancer Dataset
Genes are the basis of tumor formations around the body, better known as cancer.
They inhibit basic processes such as cell death (apoptosis) and promote cell division
to an unhealthy extent. The expression of every gene provides a baseline as to how
far a cancer has progressed, the organ or tissue it originated from and it‘s
approximated course of action. The analysis of such gene expression values using
traditional machine learning methods provides a higher efficiency and accuracy at
finding relationships between genes and may serve as a future for diagnosis for cancer
using these values. The main challenge is to use the bases created to efficiently
compute the highly effective genes for specific types of cancer using their expression
values and thus, raise the question of a potential relationship between them for each
type. A Random Forest Model has been used to perform Feature Selection over the
dataset to extract the important features (i.e.) the most influential genes. They are then
visualized using traditional packages in Python (i.e. Scikit-plot, Matplotlib, Seaborn)
and using a data visualization tool called Tableau to project the result of the analysis.
v
TABLE OF CONTENTS
1 INTRODUCTION 01
1.1 Outline 01
1.2 Model IDE 01
1.3 Problem Statement 01
1.4 Objective 03
2 LITERATURE SURVEY 04
3 RANDOM FOREST ALGORITHM 06
3.1 Introduction to Machine Learning 06
3.2 Training the data 07
3.3 Methods in Supervised Learning 08
3.4 Approaches in Classification 09
3.5 Packages 13
3.6 Background Study 16
3.7 The Involvement of Genes 17
3.8 Types of Cancer and their probable genes 20
4 RESULTS AND DISCUSSION 26
4.1 Results 26
4.2 Analysis 37
4.3 System Requirements 38
5 SUMMARY AND FUTURE SCOPE 40
5.1 Summary 40
5.2 Future Scope 40
References 41
APPENDIX - Source Code 45
Screenshots 50
vi
LIST OF FIGURES
1 Classification vs Regression 8
2 k-Nearest Neighbour 10
5 Distribution of dataset 27
8 Confusion Matrix 30
9 Classification Report 31
10 Precision formula 31
11 Recall formula 32
12 F1-Score formula 32
vii
CHAPTER – 1
INTRODUCTION
1.1. OUTLINE:
Cancers being the second leading cause of death have taken a massive toll on
the population of Earth. With many cases being reported even with no prior history
of risks or unhealthy habits, the race to find a cure or a preventive measure is
increasing. Due to the unreliable causes present and the limited biological study,
there exists the need for the intervention of other fields to help fasten the study and
discovery of precision medicine and quicker diagnosing tools. This project acts as a
base which may be further developed to act as a pre-diagnostic tool for the early
detection of cancers.
This project uses Python 3.6 for the programming in a scientific development
environment called the Jupyter Notebook. Various data manipulation, machine
learning and visualization packages are used to create and analyse the dataset
using a traditional machine learning model. A Data Visualization tool called Tableau
is used to interpret the results provided by the model after analysis to represent and
act as a proof for the intended result.
Tumors are groups of abnormal cells that form lumps or growths. They can
start in any one of the trillions of cells in our bodies. Tumors grow and behave
differently, depending on whether they are cancerous (malignant), non-cancerous
(benign) or precancerous.
1
1.3.1 Cancerous tumors (Malignant)
Cancer can start in any part of the body. When cancer cells form a lump or
growth, it is called a cancerous tumor. A tumor is cancerous when it:
• has cells that can break away and travel through the blood or lymphatic
system and spread to lymph nodes and distant parts of the body.
Cancer that spreads from the first place it started (called the primary tumor) to
a new part of the body is called Metastatic cancer. When cancer cells spread and
develop into new tumors, the new tumors are called Metastases.
There are several types of cancer with their rates of occurrence differing based
on different criteria such as gender, age, lifestyle, habits etc. In the Machine
Learning model created, we discuss the occurrence of specifically 5 different types
of cancer based on the gene expression values of 16382 genes. Namely,
2
1.4. OBJECTIVE:
The main objective of this model is to ensure earlier, faster and a more reliable
detection and treatment for cancer. This model concentrates on 5 sub types of
malignant cancer and not all the types of cancers and the different types of cells
they may occur on. But if needed, this model can be extended to support other types
of cancer or tissue types under the circumstance that the required data is prevalent,
proper and reliable for further processing and application of the model to provide
accurate results as any un-reliability may pose a risk if applied for practical usage.
3
CHAPTER – 2
LITERATURE SURVEY
In the years gone by, research on the topic of exploiting machine learning
and deep learning algorithms to classify cancer-related data into the specific types
of cancer and their analysis have taken place widely. This is due to the demand in
understanding the deeper relationship between cancer and human genes, and also
the relationship between the genes involved themselves.
Qun-Xiong Zhu et al proposed a system using MMI for feature selection and
ELM as the classifier, to perform cancer classification computationally using gene
expression data. (2018)
4
Muxuan Liang et al proposed a new machine learning model, called
multimodal Deep Belief Network. It was trained using Contrastive Divergence in an
unsupervised manner to find correlations and key genes that play a role in the
pathogenesis of cancer. (2015)
5
CHAPTER – 3
The name machine learning was coined in 1959 by Arthur Samuel. Evolved
from the study of pattern recognition and computational learning theory in artificial
intelligence, machine learning explores the study and construction of algorithms that
can learn from and make predictions on data – such algorithms overcome following
strictly static program instructions by making data-driven predictions or decisions,
through building a model from sample inputs. Machine learning is employed in a
range of computing tasks where designing and programming explicit algorithms with
good performance is difficult or unfeasible; example applications include email
filtering, detection of network intruders, and computer vision.
Within the field of data analytics, machine learning is a method used to devise
complex models and algorithms that lend themselves to prediction; in commercial
use, this is known as predictive analytics. These analytical models allow
researchers, data scientists, engineers, and analysts to "produce reliable,
repeatable decisions and results" and uncover "hidden insights" through learning
from historical relationships and trends in the data.
6
3.2 TRAINING THE DATA:
There are basically two widely-used types of training that can be done to
create a model:
i. Supervised Learning
ii. Un-supervised Learning
7
3.3. METHODS IN SUPERVISED LEARNING:
Classification
Regression
3.3.1 Classification: