1. Introduction
Classification is the process of categorizing or grouping data based on specific criteria. In machine learning and data science, classification is a type of supervised learning algorithm used to predict the class or category of a given observation based on its input features [
1].
In classification, a training dataset is used to create a model that can be used to classify new data. The training dataset contains labeled examples of input features and their corresponding output labels or classes. A model is trained on this dataset to learn patterns and relationships between the input features and output classes.
Once the model is trained, it can be used to predict the class or category of new, unseen data by feeding in the input features and using the learned relationships to determine the corresponding output label or class.
Classification is commonly used in a variety of applications, such as spam detection, sentiment analysis, image recognition, and medical diagnosis. There are many different algorithms and techniques used for classification, including logistic regression, decision trees, support vector machines, and neural networks. The choice of algorithm depends on the type and complexity of the data, as well as the desired level of accuracy and interpretability [
2].
Cancer classification using machine learning is a rapidly evolving field that involves the development and application of algorithms to automatically classify cancers based on their characteristics. Machine learning algorithms use patterns in data to learn and make predictions, and they have shown great promise in improving cancer diagnosis and treatment.
There are several approaches to cancer classification using machine learning. One common approach is to use supervised learning algorithms, which are trained on labeled data to predict the class of new, unlabeled data. In the context of cancer classification, labeled data may include images of cancer cells or tissue samples that have been annotated with their corresponding cancer type or stage. Supervised learning algorithms, such as decision trees, random forests, and support vector machines, can be used to automatically classify new cancer samples based on their features, such as gene expression profiles, histopathological images, or radiographic images.
Another approach to cancer classification using machine learning is to use unsupervised learning algorithms, which do not require labeled data. Instead, unsupervised learning algorithms identify patterns and groupings in the data on their own. Clustering algorithms, such as k-means or hierarchical clustering, can be used to group cancer samples based on their similarities, allowing researchers to identify subtypes of cancer or to group patients with similar characteristics for personalized treatment.
Deep learning algorithms, such as convolutional neural networks (CNNs) [
3], have also been used for cancer classification. These algorithms are particularly effective at analyzing images, such as histopathological or radiographic images, and can automatically learn features that are important for distinguishing between different cancer types or stages.
Overall, cancer classification using machine learning has the potential to improve cancer diagnosis, treatment, and research by providing more accurate and personalized predictions. However, it is important to ensure that these algorithms are validated on diverse and representative datasets to avoid biases and ensure their reliability in clinical practice.
Cancer is the leading cause of death worldwide, accounting for millions of deaths every year [
4,
5]. Cancer is a genetic disease, caused by changes to genes that trigger cells to grow uncontrollably [
6]. A gene is a sequence of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) that exists in every cell of the body and carries an individual’s genetic code. RNA acts as a messenger, carrying genetic information from DNA for encoding, transmitting, and expressing genetic information into proteins [
7].
DNA sequencing is the process of determining the nucleic acid sequence in DNA. Comparing healthy and mutated DNA sequences can diagnose different diseases, including various cancers. DNA sequencing used to take years, but now it can be performed in hours with the advent of precision medicine and next-generation sequencing [
8]. RNA-Sequencing (RNA–Seq) is a sequencing technique used to quantify the expression of genes and characterize their sequences at the same time. RNA–Seq gene expressions can be used to classify cancer tumor types [
9].
Previous studies have been conducted using machine learning to classify tumor types based on RNA-Sequencing. For example, in [
10], the authors proposed using a convolutional neural network, CNN, to classify breast cancer. They achieved an accuracy of
% by selecting hyper parameters for the CNN model that would give the best performance for the proposed method. In [
11], the authors converted RNA-Seq values into 2D images and used them to extract features to train a deep learning model. Another deep learning method was proposed in [
12], in which the RNA-Seq values were also converted into 2D images. In [
12], the authors used augmentation techniques to increase the dataset 5-fold. By applying the CNN model, an accuracy of
% was achieved. In [
13], the authors conducted an RNA-Seq analysis for tumor classification. The proposed analysis only focused on five target tumor types (prostate, lung, breast, kidney, and colon), but utilized a similar 60,483 RNA-Seq feature dataset. However, the obtained tumor sample set Bonat was unbalanced across the five tumor classes. Extreme unbalanced classes can result in a skewed classification accuracy in favor of the majority class. The authors applied a popular over-sampling technique called the Synthetic Minority Oversampling Technique (SMOTE) which creates “synthetic” examples rather than over-sampling with replacements. With the application of SMOTE, the tumor sample set was balanced with 1500 total samples (300 for each of the tumor types). They also employed dimensionality reduction, using Primary Component Analysis (PCA). For modeling [
13] used a broad range of classification models for comparison.
The authors of [
14] conducted a study focused on classifying five subtypes of breast cancer (BRCA) using RNA-Seq and machine learning. They used a broad range of classification models for comparison, similar to five of our models. Their dataset consisted of 4731 samples × 19,737 RNA-Seq features. They did not utilize any dimensionality reduction techniques, instead electing to train and test on the full feature set.
In [
15], the authors utilized a deep learning model in their classification of 33 tumor types. Their dataset consisted of 10,267 samples × 20,531 RNA-Seq features. They normalized their RNA–Seq features during preprocessing. They also applied dimensionality reduction using a process similar to a high correlation filter and built a reduced gene feature set based on the top pairing of highly correlated RNA–Seq expressions to target tumor types. This process, they called Mutual information (MI), yielded a reduced feature set of 3600 features. Next, they converted the RNA–Seq feature set information into an image format for modeling utilizing a
Dense Network Convolutional Neural Network (DenseNetCAM) model. The authors of [
16], in their RNA-Seq study, utilized the same dataset as in [
15] minus two tumor types (Esophageal (ESCA) and stomach (STAD) cancer). The authors, however, did not employ any dimensionality reduction. They utilized a single classification model,
k–Nearest Neighbors (
).
In [
17], similar to [
15], utilized two-dimensional image transformations of the RNA–Seq data to classify 33 tumor types. They also used the same 10,267 damples × 60,484 RNA-Seq feature dataset. The author of [
17] normalized the gene feature values and then applied the dimensionality reduction technique of a low variance filter to reduce the RNA-Seq gene features down to 10,381. Next, they reformatted the RNA-Seq gene features as two-dimensional images and classified the tumors using a
convoluted neural network model.
The previous studies, however, have limitations. For example, many studies do not utilize dimensionality reduction to reduce and optimize the number of gene features needed for modeling. In fact, to the best of our knowledge, no previous technique has utilized a factor analysis to reduce the number of features. Although some of these produced a high classification accuracy, they did so at the expense of longer computational times and in the end a more complex model. Other studies did not consider the impact of different machine learning techniques on the classification accuracy.
The method presented in this paper addresses these limitations by not only finding the optimum machine learning model for tumor classification accuracy, but also by minimizing the number of gene RNA sequence features required to achieve that high accuracy. We used different reduction techniques, including a factor analysis, for feature reduction. To top it off, the tumor classification accuracy achieved by the optimum method outlined in this paper beats all similar previous studies.
In particular, this paper presents an effective method to classify cancer tissue samples by tumor type (breast cancer, lung cancer, colon cancer, etc.) using machine learning techniques. Over 60,000 gene RNA–Seq features were analyzed and six different machine learning classification algorithms were explored, individually and as an ensemble. The classification algorithms evaluated were logistic regression, k-Nearest Neighbor, decision tree, random forest, neural network, and support vector machine. To train, test, and validate these techniques, a dataset of 5400 tumor samples was used, representing 18 different cancer tumor types. RNA-Seq gene datasets involve enormous amounts of genetic data, which make analysis and predictive modeling both computationally intensive and time-consuming.
This paper is organized as follows.
Section 2 provides the details of the composition of the dataset. Challenges with the very large feature set are tackled through the preprocessing stages detailed in
Section 3. Classification using individual and ensemble models, along with results validation, is examined in
Section 4. Model accuracy comparisons with related studies are explored in
Section 5. The conclusions of the paper are offered in
Section 6.