Gene Expression Analysis On Cancer Dataset

ABSTRACT
Genes are the basis of tumor formations around the body, better known as cancer.
They inhibit basic processes such as cell death (apoptosis) and promote cell division
to an unhealthy extent. The expression of every gene provides a baseline as to how
far a cancer has progressed, the organ or tissue it originated from and it‘s
approximated course of action. The analysis of such gene expression values using
traditional machine learning methods provides a higher efficiency and accuracy at
finding relationships between genes and may serve as a future for diagnosis for cancer
using these values. The main challenge is to use the bases created to efficiently
compute the highly effective genes for specific types of cancer using their expression
values and thus, raise the question of a potential relationship between them for each
type. A Random Forest Model has been used to perform Feature Selection over the
dataset to extract the important features (i.e.) the most influential genes. They are then
visualized using traditional packages in Python (i.e. Scikit-plot, Matplotlib, Seaborn)
and using a data visualization tool called Tableau to project the result of the analysis.
v
TABLE OF CONTENTS
Chapter No. Title Page No.

Declaration iii
Acknowledgement iv
Abstract v
Table of Contents vi
List of Figures vii
1 INTRODUCTION 01
1.1 Outline 01
1.2 Model IDE 01
1.3 Problem Statement 01
1.4 Objective 03
2 LITERATURE SURVEY 04
3 RANDOM FOREST ALGORITHM 06
3.1 Introduction to Machine Learning 06
3.2 Training the data 07
3.3 Methods in Supervised Learning 08
3.4 Approaches in Classification 09
3.5 Packages 13
3.6 Background Study 16
3.7 The Involvement of Genes 17
3.8 Types of Cancer and their probable genes 20
4 RESULTS AND DISCUSSION 26
4.1 Results 26
4.2 Analysis 37
4.3 System Requirements 38
5 SUMMARY AND FUTURE SCOPE 40
5.1 Summary 40
5.2 Future Scope 40
References 41
APPENDIX - Source Code 45
Screenshots 50
vi
LIST OF FIGURES
S.No Figures Page No.
1 Classification vs Regression 8
2 k-Nearest Neighbour 10
3 Random Forest Simplified 12
4 The Epigenetics of Cancer 19
5 Distribution of dataset 27
6 Density plot of gene_1 vs gene_2 28
7 Cumulative density plot of 4 features 29
8 Confusion Matrix 30
9 Classification Report 31
10 Precision formula 31
11 Recall formula 32
12 F1-Score formula 32
13 Box Graph of Important Features 34
14 10th Decision Tree 35
15 20th Decision Tree 36
16 Line graph – Averaged important features vs Classes 37
vii
CHAPTER – 1
INTRODUCTION
1.1. OUTLINE:
Cancers being the second leading cause of death have taken a massive toll on
the population of Earth. With many cases being reported even with no prior history
of risks or unhealthy habits, the race to find a cure or a preventive measure is
increasing. Due to the unreliable causes present and the limited biological study,
there exists the need for the intervention of other fields to help fasten the study and
discovery of precision medicine and quicker diagnosing tools. This project acts as a
base which may be further developed to act as a pre-diagnostic tool for the early
detection of cancers.
This project uses Python 3.6 for the programming in a scientific development
environment called the Jupyter Notebook. Various data manipulation, machine
learning and visualization packages are used to create and analyse the dataset
using a traditional machine learning model. A Data Visualization tool called Tableau
is used to interpret the results provided by the model after analysis to represent and
act as a proof for the intended result.
1.2. MODEL IDE:
Jupyter Notebook (formerly IPython Notebooks) is a web-based

interactive computational environment for creating Jupyter notebook documents.
The IDE was installed in Anaconda, an open-source distribution for the languages
Python and R used to perform Data Science and Machine Learning. The IDE‘s UI
in which the model was developed is given in Fig 1.1.
1.3. PROBLEM STATEMENT:
Tumors are groups of abnormal cells that form lumps or growths. They can
start in any one of the trillions of cells in our bodies. Tumors grow and behave
differently, depending on whether they are cancerous (malignant), non-cancerous
(benign) or precancerous.
1
1.3.1 Cancerous tumors (Malignant)
Cancer can start in any part of the body. When cancer cells form a lump or
growth, it is called a cancerous tumor. A tumor is cancerous when it:
• grows into nearby tissues
• has cells that can break away and travel through the blood or lymphatic
system and spread to lymph nodes and distant parts of the body.
Cancer that spreads from the first place it started (called the primary tumor) to
a new part of the body is called Metastatic cancer. When cancer cells spread and
develop into new tumors, the new tumors are called Metastases.
1.3.2 Types of malignant cancer:
There are several types of cancer with their rates of occurrence differing based
on different criteria such as gender, age, lifestyle, habits etc. In the Machine
Learning model created, we discuss the occurrence of specifically 5 different types
of cancer based on the gene expression values of 16382 genes. Namely,
 BRCA – Breast Cancer
 LUAD – Lung Adenocarcinoma (Lung Cancer)
 PRAD – Prostate Adenocarcinoma (Prostate Cancer)
 KIRC – Kidney Renal Clear Cell Carcinoma (Kidney Cancer)
 COAD – Colon Adenocarcinoma (Colon Cancer)
Cancerous tumors or Malignant tumors are present in 4 stages, where the 4 th

stage is the final or the most critical stage. During this stage, the cancer has spread
throughout the affected person‘s body and so, the time present for treatment is
decreased. Hence, a biopsy may waste time, posing a threat to the patient‘s life.
The presence of detection of reception by gene expression aids in an earlier and
faster detection and treatment towards cancer
2
1.4. OBJECTIVE:
The main objective of this model is to ensure earlier, faster and a more reliable
detection and treatment for cancer. This model concentrates on 5 sub types of
malignant cancer and not all the types of cancers and the different types of cells
they may occur on. But if needed, this model can be extended to support other types
of cancer or tissue types under the circumstance that the required data is prevalent,
proper and reliable for further processing and application of the model to provide
accurate results as any un-reliability may pose a risk if applied for practical usage.
3
CHAPTER – 2
LITERATURE SURVEY
In the years gone by, research on the topic of exploiting machine learning
and deep learning algorithms to classify cancer-related data into the specific types
of cancer and their analysis have taken place widely. This is due to the demand in
understanding the deeper relationship between cancer and human genes, and also
the relationship between the genes involved themselves.
Joseph M. De Guia et al proposed using microarray gene expression related

to different types of cancer genes to classify them as accurately as possible. (2018)
TaeJin Ahn et al proposed a Deep Neural Network (DNN), a part of Deep

Learning, to classify or identify normal cells from cancerous cells using gene
expression data integrated from various datasets. (2018)
Qun-Xiong Zhu et al proposed a system using MMI for feature selection and
ELM as the classifier, to perform cancer classification computationally using gene
expression data. (2018)
Comparison between two machine learning models (Logistic Regression and

Artificial Neural Network) and a statistical algorithm (ANOVA) was performed by
Behrouz Shamsei and Cuilan Gao to determine their costs, advantages and
drawbacks. The dataset used was animal gene expression values for
Medulloblastoma cancer type. (2016)
Gene expression data for Hepatocellular carcinoma was analysed using

clustering and classification techniques (Network Topology and SVM) by Chen
Shen and Zhi-Ping Liu to identify possible module biomarkers. (2017)
Feature Extraction methods such as Chi-Square, F-Score, PCA and MRMR

and an ensemble-SVM classifier were used in collaboration to classify Colon cancer
by Saima Rathore et al. (2014)
Stacked Denoising Autoencoder, a type of Neural Network, was used as a

Feature Extraction method to learn the highly influential genes from a generated
gene expression dataset by Vitor Teixeira et al. (2017)
4
Muxuan Liang et al proposed a new machine learning model, called
multimodal Deep Belief Network. It was trained using Contrastive Divergence in an
unsupervised manner to find correlations and key genes that play a role in the
pathogenesis of cancer. (2015)
Relief-F is a filtering algorithm used for Feature Selection. Linear-SVM and

k-NN classifiers were used to classify using the filtering algorithm by Yuhang Wang
and F. Makedon to compare its output efficiency when compared to other feature
filtering methods. This proved that the use of the Relief filtering algorithm provided
better performance in a gene expression (microarray) dataset. (2004)
Ujjwal Maulik et al proposed a method where forward greedy search

algorithm was used together with Transductive SVMs to selectively procure genes
that could be used to predict cancer subtypes. (2013)
Ambrosio Hernandez at al published an article to assess the possibilities of

gene expression patterns playing a vital role in cancers (specifically Colon Cancer).
Genomic approach was used to determine the effects of such patterns and their
reliability during different metastatic stages in Colon Cancer in humans. (2000)
Sara Alghunaim and Heyam H. Al-Baity performed a comparative study of

the efficiency of 3 machine learning models (SVM, Decision Tree and Random
Forest), with gene expression and DNA Methylation datasets to analyse the
performance and drawbacks of traditional machine learning methods on such
datasets. (2019)
A Deep Neural Forest Model was used by Jayadeep Pati on a gene

expression dataset for Lung Cancer to analyse the genes and provide targeted set
of genes that may cause Lung Cancer. (2019)
Jig Xu et al proposed a novel deep flexible Neural Network Model to classify

the type of cancer present using gene expression data.
5
CHAPTER – 3
RANDOM FOREST ALGORITHM
3.1 INTRODUCTION TO MACHINE LEARNING:
Machine learning is a field of computer science that uses statistical

techniques to give computer systems the ability to "learn" (e.g., progressively
improve performance on a specific task) with data, without being explicitly
programmed.
The name machine learning was coined in 1959 by Arthur Samuel. Evolved
from the study of pattern recognition and computational learning theory in artificial
intelligence, machine learning explores the study and construction of algorithms that
can learn from and make predictions on data – such algorithms overcome following
strictly static program instructions by making data-driven predictions or decisions,
through building a model from sample inputs. Machine learning is employed in a
range of computing tasks where designing and programming explicit algorithms with
good performance is difficult or unfeasible; example applications include email
filtering, detection of network intruders, and computer vision.
Machine learning is closely related to (and often overlaps with) computational

statistics, which also focuses on prediction-making through the use of computers. It
has strong ties to mathematical optimization, which delivers methods, theory and
application domains to the field. Machine learning is sometimes conflated with data
mining, where the latter sub-field focuses more on exploratory data analysis and is
known as unsupervised learning.
Within the field of data analytics, machine learning is a method used to devise
complex models and algorithms that lend themselves to prediction; in commercial
use, this is known as predictive analytics. These analytical models allow
researchers, data scientists, engineers, and analysts to "produce reliable,
repeatable decisions and results" and uncover "hidden insights" through learning
from historical relationships and trends in the data.
6
3.2 TRAINING THE DATA:
There are basically two widely-used types of training that can be done to
create a model:
i. Supervised Learning
ii. Un-supervised Learning
3.2.1 Supervised Learning:
Supervised learning is the machine learning task of learning a function that

maps an input to an output based on example input-output pairs. It infers a function
from labeled training data consisting of a set of training examples. In supervised
learning, each example is a pair consisting of an input object (typically a vector) and
a desired output value (also called the supervisory signal). A supervised learning
algorithm analyzes the training data and produces an inferred function, which can
be used for mapping new examples. An optimal scenario will allow for the algorithm
to correctly determine the class labels for unseen instances. This requires the
learning algorithm to generalize from the training data to unseen situations in a
"reasonable" way.
3.2.2 Unsupervised Learning:
Unsupervised machine learning is the machine learning task of inferring a

function that describes the structure of "unlabeled" data (i.e. data that has not been
classified or categorized). Since the examples given to the learning algorithm are
unlabeled, there is no straightforward way to evaluate the accuracy of the structure
that is produced by the algorithm—one feature that distinguishes unsupervised
learning from supervised learning and reinforcement learning.
The type of training used in this model is SUPERVISED LEARNING.
7
3.3. METHODS IN SUPERVISED LEARNING:
Supervised Learning mainly consists of two methods,
 Classification
 Regression
Fig 3.1 Classification vs Regression
3.3.1 Classification:
In machine learning, classification is the problem of identifying to which of a

set of categories (sub-populations) a new observation belongs, on the basis of a
training set of data containing observations (or instances) whose category
membership is known. Examples are assigning a given email to the "spam" or "non-
spam" class, and assigning a diagnosis to a given patient based on observed
characteristics of the patient (gender, blood pressure, presence or absence of
certain symptoms, etc.). Classification is an example of pattern recognition. An
algorithm that implements classification, especially in a concrete implementation, is
known as a classifier. The corresponding unsupervised procedure is known as
clustering, and involves grouping data into categories based on some measure of
inherent similarity or distance.

Gene Expression Analysis On Cancer Dataset

Uploaded by

Copyright:

Gene Expression Analysis On Cancer Dataset

Uploaded by

Document Information

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Gene Expression Analysis On Cancer Dataset

Uploaded by

Copyright:

ABSTRACT

Chapter No. Title Page No.

S.No Figures Page No.

3 Random Forest Simplified 12

4 The Epigenetics of Cancer 19

6 Density plot of gene_1 vs gene_2 28

7 Cumulative density plot of 4 features 29

13 Box Graph of Important Features 34

14 10th Decision Tree 35

15 20th Decision Tree 36

16 Line graph – Averaged important features vs Classes 37

1.2. MODEL IDE:

Jupyter Notebook (formerly IPython Notebooks) is a web-based

1.3. PROBLEM STATEMENT:

• grows into nearby tissues

1.3.2 Types of malignant cancer:

 BRCA – Breast Cancer

 LUAD – Lung Adenocarcinoma (Lung Cancer)

 PRAD – Prostate Adenocarcinoma (Prostate Cancer)

 KIRC – Kidney Renal Clear Cell Carcinoma (Kidney Cancer)

 COAD – Colon Adenocarcinoma (Colon Cancer)

Cancerous tumors or Malignant tumors are present in 4 stages, where the 4 th

Joseph M. De Guia et al proposed using microarray gene expression related

TaeJin Ahn et al proposed a Deep Neural Network (DNN), a part of Deep

Comparison between two machine learning models (Logistic Regression and

Gene expression data for Hepatocellular carcinoma was analysed using

Feature Extraction methods such as Chi-Square, F-Score, PCA and MRMR

Stacked Denoising Autoencoder, a type of Neural Network, was used as a

Relief-F is a filtering algorithm used for Feature Selection. Linear-SVM and

Ujjwal Maulik et al proposed a method where forward greedy search

Ambrosio Hernandez at al published an article to assess the possibilities of

Sara Alghunaim and Heyam H. Al-Baity performed a comparative study of

A Deep Neural Forest Model was used by Jayadeep Pati on a gene

Jig Xu et al proposed a novel deep flexible Neural Network Model to classify

RANDOM FOREST ALGORITHM

3.1 INTRODUCTION TO MACHINE LEARNING:

Machine learning is a field of computer science that uses statistical

Machine learning is closely related to (and often overlaps with) computational

3.2.1 Supervised Learning:

Supervised learning is the machine learning task of learning a function that

3.2.2 Unsupervised Learning:

Unsupervised machine learning is the machine learning task of inferring a

The type of training used in this model is SUPERVISED LEARNING.

Supervised Learning mainly consists of two methods,

Fig 3.1 Classification vs Regression

In machine learning, classification is the problem of identifying to which of a

You might also like