Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models

Dalmolin, Matheus; Azevedo, Karolayne S.; Souza, Luísa C. de; de Farias, Caroline B.; Lichtenfels, Martina; Fernandes, Marcelo A. C.

doi:10.3390/ai6010002

Open AccessArticle

Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models

by

Matheus Dalmolin

^1,2,3,†

,

Karolayne S. Azevedo

^1,†

,

Luísa C. de Souza

^1,†

,

Caroline B. de Farias

^3,4,5,6

,

Martina Lichtenfels

⁴

and

Marcelo A. C. Fernandes

^{1,2,3,7,*,†}

¹

InovAI Lab, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

²

Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

³

National Science and Technology Institute for Children’s Cancer Biology and Pediatric Oncology-INCT BioOncoPed, Porto Alegre 90620-110, RS, Brazil

⁴

Ziel Biosciences, Porto Alegre 90650-001, RS, Brazil

⁵

Children’s Cancer Institute (ICI), Porto Alegre 90620-110, RS, Brazil

⁶

Cancer and Neurobiology Laboratory, Experimental Research Center, Clinical Hospital (CPE-HCPA), Federal University of Rio Grande do Sul, Porto Alegre 90035-007, RS, Brazil

⁷

Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2025, 6(1), 2; https://doi.org/10.3390/ai6010002

Submission received: 23 September 2024 / Revised: 18 October 2024 / Accepted: 28 November 2024 / Published: 27 December 2024

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

:

This study investigates the use of machine learning (ML) models combined with explainable artificial intelligence (XAI) techniques to identify the most influential genes in the classification of five recurrent cancer types in women: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Gene expression data from RNA-seq, extracted from The Cancer Genome Atlas (TCGA), were used to train ML models, including decision trees (DTs), random forest (RF), and XGBoost (XGB), which achieved accuracies of 98.69%, 99.82%, and 99.37%, respectively. However, the challenges in this analysis included the high dimensionality of the dataset and the lack of transparency in the ML models. To mitigate these challenges, the SHAP (Shapley Additive Explanations) method was applied to generate a list of features, aiming to understand which characteristics influenced the models’ decision-making processes and, consequently, the prediction results for the five tumor types. The SHAP analysis identified 119, 80, and 10 genes for the RF, XGB, and DT models, respectively, totaling 209 genes, resulting in 172 unique genes. The new list, representing 0.8% of the original input features, is coherent and fully explainable, increasing confidence in the applied models. Additionally, the results suggest that the SHAP method can be effectively used as a feature selector in gene expression data. This approach not only enhances model transparency but also maintains high classification performance, highlighting its potential in identifying biologically relevant features that may serve as biomarkers for cancer diagnostics and treatment planning.

Keywords:

explainable AI; machine learning; feature selection; RNA-seq; cancer; SHAP; gene expression

1. Introduction

Cancer is one of the most common causes of death among women, with breast cancer (BRCA) being one of the most prevalent, ranking second among cancer-related causes of death in women [1]. In the Americas,

30 %

of detected cases correspond to breast cancer, with a mortality rate of 190 per 100,000 cases [2]. Lung cancer, specifically lung adenocarcinoma (LUAD), has the highest incidence among women [3,4]. Ovarian cancer (OV) is considered one of the most fatal types of cancer due to its significant detection challenges [5]. Colon adenocarcinomas (COAD) rank as the third most common cancer worldwide, affecting approximately one million patients each year [6,7]. Globally, thyroid cancer (THCA) is three times more common in women, often diagnosed before the age of 30 [8,9]. Given its global health significance, governmental and scientific actions are urgently needed. Low- and middle-income countries typically face higher cancer burdens, with limited access to cancer prevention and treatment measures, resulting in lower survival rates. Early detection plays a vital role in improving cancer outcomes by identifying the disease in its initial stages [10]. Artificial intelligence (AI) techniques are increasingly employed in various aspects of cancer research and patient care, aiding in the detection, prognosis, monitoring, and analysis of various cancer types [11,12].

Studies have addressed challenges and concerns associated with the effective implementation of these models in oncology. In their study, Chua et al. mentioned that cancer encompasses distinct conditions with unique and complex patterns [13]. The high dimensionality of the data also poses a significant obstacle. A common approach to addressing this problem is feature selection, which involves choosing a subset of data without the need for applying transformations [14]. Another issue highlighted by Moncada et al. is the lack of transparency in some models, as many of them are often considered black boxes operating through complex and difficult-to-interpret algorithms. This lack of transparency limits the trust that patients and physicians place in the predictions of the models [15].

In an effort to mitigate this issue, explainable artificial intelligence (XAI) techniques have been employed to understand how these models make their decisions and which features or inputs have the most influence on model predictions [16,17]. The SHAP (Shapley Additive Explanations) technique is a part of the XAI toolkit and relies on SHAP values to explain the output of some machine learning (ML) and deep learning (DL) models [18].

In this context, this work proposes the use of an XAI technique based on SHAP values to identify the most relevant features in a multi-class classification problem among the five most recurrent types of cancer in women. This classification is based on RNA-seq gene expression data extracted from The Cancer Genome Atlas (TCGA). The data were applied to traditional ML models based on decision trees. Thus, this work makes the following specific contributions:

Application of the SHAP technique as a method for input feature dimensionality reduction.
Utilization of an XAI method to explain the behavior of classifiers based on the SHAP library in the Python programming language.
Development of high-performing models using the most influential RNA-seq gene expression values selected by SHAP.
Analysis of the key genes identified by the SHAP technique.

2. Related Works

Several studies have utilized machine learning techniques to gain insights into the development and characteristics of different types of cancer. In their work, [19] developed an automatic preliminary diagnosis system for breast cancer using Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), decision tree (DT), Naive Bayes, and random forest (RF). They classified breast cancer as benign or malignant using the Wisconsin Breast Cancer Dataset (WBCD), which contains information about the size and shape of tumor cells, achieving accuracy values ranging from

94 %

to

97 %

.

The study by Vural et al. utilized unsupervised machine learning techniques to cluster somatic mutation profiles of breast cancer data from TCGA [20]. They obtained three groups and subsequently investigated them, observing a relationship between the disease stage of patients and each cluster. Then, supervised machine learning techniques were applied to classify unknown breast cancer patients into the previously found clusters, achieving

70 %

accuracy using the random forest model.

In the work by Ram et al. [21], classification and feature selection for colon cancer, prostate cancer, and leukemia were conducted using gene expression data. In the research, the accuracy values obtained using the random forest algorithm were

85.45 %

for colon cancer,

66.66 %

for prostate cancer, and

100 %

for leukemia. The genes identified from these classifications were analyzed to observe their influence on cancer, revealing their significant roles in the progression of the respective pathologies.

A feature selection method, along with the SVM model, was applied to gene expression data with the aim of classifying samples into two subtypes of lung cancer: lung adenocarcinoma and lung squamous cell cancer. The selected feature list contained genes that exhibited differential expression between the two cancer types. This list was then used to train the SVM model, and the authors achieved accuracy values ranging from

91.00 %

to

96.70 %

, depending on the feature selectors used prior to classification [22].

In the study presented in [23], the authors utilized gene expression data to diagnose ovarian cancer using five machine learning algorithms: Generalized Linear Model (GLM), Classification and Regression Trees (CART), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and random forest. Among these algorithms, the random forest exhibited the best performance, achieving a sensitivity of

96 %

and a specificity of

83 %

for ovarian tissue cancer diagnosis.

The majority of studies that use gene expression for cancer prediction are in the field of deep learning [24]. Commonly employed techniques include convolutional neural networks (CNNs), Fully Connected Neural Networks (FCNNs), and Recurrent Neural Networks (RNNs). These studies often involve multiclass classification tasks for different cancer types. In one approach, the authors converted RNA-seq data into 2D images and applied a multi-layer CNN. This approach achieved an overall test accuracy of

96.90 %

[25]. In a similar vein, another study [26] utilized RNA-seq sequences transformed into 2D images for various cancer types and applied a CNN architecture, achieving an accuracy of

95.65 %

.

As mentioned earlier, to gain acceptance and integration into oncology, it is valuable to incorporate techniques that allow for the visualization and understanding of model decision-making processes. In their work, ref. [18] utilized multiclass classification to detect breast cancer subtypes. The authors employed SVM, random forest, Extremely Randomized Trees (ERTs), and extreme gradient boosting (XGB) models to obtain prediction results. Subsequently, they applied the SHAP technique to identify the set of features that influenced these models. The accuracy values obtained ranged from

61 %

to

77 %

.

The SHAP technique has been applied in cancer-related research as well [27,28]. Hassan et al. used machine learning models applied to ultrasound and magnetic resonance images to detect prostate cancer, achieving accuracy values ranging from

80 %

to

97 %

[27]. Subsequently, the results of these models were subjected to the SHAP technique to elucidate the reasons for classifying each sample as benign or malignant. Yap et al. utilized RNA-seq data samples from genotype-tissue expression across 47 different tissues. They applied these data to a multi-layer CNN architecture, obtaining accuracy results ranging from

70 %

to

100 %

[28]. The SHAP technique was employed to identify the most relevant features and understand the biological processes involved in tissue differentiation and function.

3. Materials and Methods

This study utilized gene expression data, which comprise quantitative measurements of messenger RNAs present in a given sample relative to a specific physiological condition. The dataset included the five most recurrent types of cancer in women (BRCA, LUAD, THCA, OV, and COAD), totaling 3057 samples and 21,480 features. Initially, the dataset underwent processing by three widely used machine learning techniques (see Figure 1). Subsequently, the models were subjected to the SHAP library to understand which features were most relevant in the decision-making process of each model. An individual matrix containing SHAP values corresponding to the most influential features in the decision-making process of each model for classifying the five tumors was obtained. Then, features were selected based on their importance in the model’s output prediction, with a threshold of greater than or equal to

0.01 %

. Values below this threshold did not significantly contribute to improving the performance of the techniques used (see Figure 1).

Next, a combined list of unique genes from each SHAP values matrix was generated and processed using decision trees, random forest, XGBoost, Gaussian Naive Bayes (Gaussian NB), and Bernoulli Naive Bayes (Bernoulli NB) to determine whether good prediction results could be achieved from a reduced list. The entire workflow described above is depicted in Figure 1.

3.1. Database

The RNA-seq data were obtained from The Cancer Genome Atlas (TCGA) project through the GDC (Genomic Data Commons) database [29]. The datasets were downloaded using the R statistical software, version 4.2.0, with the TCGAbiolinks package [30]. The GDCquery function in TCGAbiolinks was used to search for and download genomic data from TCGA in the GDC database. This required several parameters, including project, legacy, data.category, data.type, platform, file.type, experimental.strategy, and sample.type.

The project parameter specifies a list of data to be downloaded. For the problem at hand, the five project codes corresponding to the five cancer types were provided: “TCGA-BRCA”, “TCGA-COAD”, “TCGA-OV”, “TCGA-LUAD”, and “TCGA-THCA”. The legacy parameter was set to “FALSE” to obtain harmonized data, meaning that all data originally generated by TCGA were processed through the same GDC pipeline for consistency.

The arguments “data.category” and “data.type” were used to filter the data files to be downloaded, with “Gene expression” and “Gene expression quantification”, respectively. The “Illumina HiSeq” platform was employed for downloading the data. By selecting “results” as the “file.type” argument, the legacy database was filtered, and “RNA-seq” was chosen as the “experimental.strategy” argument for generating expression profiles. Furthermore, tumor samples were selected using “Primary Tumor” as the “sample.type” argument.

The resulting data matrix consisted of samples from five different tumors (primary tumor samples) and was used to create the RNA-seq dataset. The dataset included samples from breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD), totaling 3057 tumor samples obtained from the five cancer types, with 29,488 common genes.

3.2. Data Preprocessing

The preprocessing of expression data was performed using the TCGAbiolinks package. The TCGAanalyzePreprocessing function was used for data normalization, while the TCGAanalyzeNormalization function was employed for normalization.

Lastly, the TCGAanalyzeFiltering function was applied with a qnt.cut of 0.25, meaning that all genes with an average intensity greater than this threshold were retained. After applying this process, it was determined that 22,115 genes were informative, while 7373 genes were considered irrelevant and were excluded from further analysis.

As depicted in Figure 2, BRCA accounts for

36.34 %

of the cancerous tissues in the dataset, followed by LUAD at

17.63 %

and THCA at

16.52 %

. Colon cancer and ovarian cancer are present in smaller numbers, representing only

15.73 %

and

14.75 %

, respectively.

In this context, it is necessary to balance the data not only to improve the network’s performance but also to avoid issues like overfitting due to the disproportionate number of samples compared to other cancer types. Imbalanced data can also lead to biased results and make it challenging for machine learning models to learn and generalize effectively. There are several techniques used to address data imbalance, and in this work, data resampling was used.

Data resampling involves undersampling the majority class to balance the dataset with the minority class (COAD). Thus, 421 samples were randomly extracted from each class present in the dataset. Consequently, the training dataset used for the models contained 2105 samples representing the five cancerous tissue types, labeled from 0 to 4, with each label associated with a class. A portion of the remaining samples was used to test the network’s performance, while others were considered irrelevant for this analysis.

This study aims to identify the most influential genes in classifying different cancer types using the SHAP method to improve the interpretability of machine learning models. While most studies on cancer classification using gene expression data focus on maximizing predictive accuracy, the adopted approach emphasizes using explainable techniques that identify factors contributing to the model’s decisions. This strategy analyzes the underlying biological mechanisms while maintaining model consistency, applying techniques such as undersampling to address class imbalance.

3.3. Machine Learning Algorithms

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms based on statistical models trained on datasets provided to the model. There are various types of ML algorithms, primarily divided into four categories, which vary according to their learning paradigm: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning [31,32,33,34]. Regression and classification models are considered supervised learning techniques. Algorithms such as decision trees, random forest, Support Vector Machine, and Logistic Regression are widely used in classification problems.

3.3.1. Decision Trees

Decision trees are a common way to represent the decision-making process through a branching structure similar to a tree, which simulates human thinking when making decisions [35]. They consist of different nodes initiated by the root node, also known as the decision node. The root node splits into branches, generating other nodes, and consequently, new outputs that will be processed by the cost function. The attribute with the best cost is considered the root node of that branch and undergoes further divisions until it reaches the final branch, also known as a leaf, which represents the final result of the algorithm.

The concepts of entropy and gain are widely used to measure the level of randomness of an attribute. First, the class entropy and the entropy of each attribute are calculated using Equations (1) and (2). Then, the information gain for all attributes can be expressed by

E n t r o p y (S) = \sum_{i = 1}^{n} - p_{i} l o g_{2} p_{i}

(1)

where

p_{i}

corresponds to the probabilities of each class occurring within the set S, belonging to class i, and n corresponds to the number of existing classes. The entropy of each attribute in a respective branch is calculated as follows:

E (A) = \sum_{j = 1}^{n} \frac{| S_{x} |}{| S |} * E n t r o p y (S_{x})

(2)

where

S_{x}

corresponds to the set of children present in the branching. Therefore, the gain equation is expressed as follows:

G (S, A) = E n t r o p y (S) - E (A) .

(3)

3.3.2. Random Forest

The random forest algorithm is computationally effective for regression and multiclass classification tasks. It was initially implemented by Breiman [36] and is based on the concept of ensemble learning, using a set of random decision trees in the learning process [36,37]. The training is initiated by selecting, through bootstrapping, a different random subset of size N from the input data for each unpruned tree, expressed as

D = [(x_{1}, y_{1}), \dots, (x_{N}, y_{N})]

(4)

where each feature vector

x_{i} = {(x_{i, 1}, \dots, x_{i, M})}^{T}

denotes the M predictors, while

y_{i}

is associated with the expected response.

Next, the splitting of nodes in each individual tree is determined by finding the best splits of m associated with each node, where m is a randomly selected subset of predictors from the total available predictor set of size M, where m«M. As a result, the decision trees in the model will have different conditions for their nodes, resulting in different structures [36,38,39]. Finally, a prediction is made by aggregating the results from these multiple decision trees, using a majority vote for classification and calculating the averages of individual results for regression [40].

3.3.3. Extreme Gradient Boosting—XGBoost

Boosting techniques combine several simple learning techniques to create a more robust model. Each weak classifier attempts to improve the classification of samples that were misclassified by the previous weak classifier, aiming to enhance the predictive accuracy of the model compared to a single learning model. Several machine learning algorithms are based on this technique, including AdaBoost, gradient boosting, and XGBoost, with decision tree techniques included in many of these frameworks.

The XGBoost algorithm incorporates the gradient boosting technique along with other features aimed at improving the algorithm itself, including hardware-related computational resources and the concept of regularization. XGBoost seeks to optimize the loss cost function by minimizing its gradient at each iteration in order to obtain the best tree with the lowest possible error. The loss cost function is expressed as

L = \sum_{i = 1}^{N} L (y_{i}, y)) + \sum_{k = 1}^{K} Ω (f_{k})

(5)

where

Ω (f) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w^{2}

corresponds to the regularization term of the objective function L, which measures the complexity of the model. In this equation, T represents the number of leaves in the trees, and w represents the output scores of the leaves, controlling the minimum gain required to split an internal node [41,42].

3.3.4. Naive Bayes—NB

Naive Bayes is a simple yet effective probabilistic classifier widely used in applications such as text classification, data mining, and healthcare [43,44,45]. The algorithm is based on Bayes’ theorem, which calculates the probability of a sample

X

belonging to a specific class

C_{i}

, assuming that all attributes (or features) are independent of each other, a premise known as the “naive” assumption of independence.

The classifier’s decision is based on comparing the conditional probabilities of the classes, as shown below. If

P (C_{1} | X) > P (C_{2} | X),

(6)

then the sample

X

is classified as belonging to class

C_{1}

. Otherwise, if

P (C_{2} | X) > P (C_{1} | X),

(7)

the sample is classified as belonging to class

C_{2}

. In general,

X

is assigned to the class that maximizes the posterior probability

P (C_{i} | X)

. According to Bayes’ theorem, the posterior probability

P (C_{i} | X)

is given by

P (C_{i} | X) = \frac{P (X | C_{i}) P (C_{i})}{P (X)},

(8)

where

P (X | C_{i})

represents the likelihood of observing the sample

X

given that it belongs to class

C_{i}

, and

P (C_{i})

is the prior probability of class

C_{i}

. Because Naive Bayes assumes the independence of attributes, the joint probability

P (X | C_{i})

can be decomposed as the product of the individual probabilities of each attribute, that is,

P (X | C_{i}) = P (X_{1} | C_{i}) P (X_{2} | C_{i}) \dots P (X_{n} | C_{i}),

(9)

where

X_{1}, X_{2}, \dots, X_{n}

are the features of the sample

X

. This simplification allows Naive Bayes to be computationally efficient, making it suitable for large-scale classification problems.

This assumption simplifies the calculation, as it means that we can treat each feature as independent when estimating these probabilities:

P (C_{i} | X) = \prod_{j = 1}^{D} P (x_{j} | C_{i})

(10)

where the probabilities

P (C_{i})

,

P (x_{1} | C_{i})

…

P (x_{D} | C_{i})

can be calculated from the input data during training [44]. There are different types of Naive Bayes algorithms, including the Gaussian model, which is commonly used for continuous data where each class follows a Gaussian probability distribution [45]. On the other hand, the Bernoulli Naive Bayes classifier is ideal for binary cases, as the algorithm assumes that the attribute either occurs or does not occur, meaning the data are discrete and follow a Bernoulli distribution [46].

3.3.5. Explainable Artificial Intelligence

Most AI models, including deep learning, can be seen as black boxes due to the challenge of comprehending their decision-making processes. This lack of transparency can pose a problem in many contexts where understanding how AI models make decisions is crucial.

In this context, the explainability of a model would enhance the reliability, transparency, and interpretability of AI model results, potentially enabling the reduction of the computational costs associated with many of these techniques when applied to large datasets [42]. Explainable artificial intelligence (XAI) is an emerging field within AI, capable of identifying which features are most relevant in the decision-making processes of AI algorithms without compromising their performance.

Based on game theory, the Shapley Additive Explanations (SHAP) is a method capable of interpreting the most relevant features in the prediction outcomes of machine learning models based on SHAP values [18]. SHAP values can be employed to explain the output of any machine learning model, including neural networks, decision trees, and linear models, by calculating the relative importance of each feature in the model’s prediction [47,48]. This enables us to comprehend how the model makes its prediction and which input features have the greatest impact on the output, mathematically described as

f (x) = g (X^{'}) = ϕ_{0} + \sum_{i = 1}^{M} ϕ_{i} x_{i}^{'}

(11)

where M represents the number of input features,

ϕ_{0}

represents a constant when all inputs are absent, and

x_{i}^{'}

represents the observed feature i. The SHAP value for each feature,

ϕ_{i}

, was proposed and elaborated by Lundberg [49].

3.4. Model Training

For model training, the data were randomly split and shuffled into training and validation sets in an 80% to 20% ratio. Cross-validation was employed to assess and validate the models’ performance. The value of k = 10 was selected through a search process that identified the optimal number of folds based on the models’ best performance.

Hyperparameters play a crucial role in the performance, generalization, and interpretability of AI models. A thorough hyperparameter tuning process was conducted to determine the best training parameters.

Thus, manual adjustments were made to the key model parameters based on the model’s training curves. For the DT algorithm, a tree depth of 3 was adopted to prevent deep trees. The maximum number of leaves was set to 5, and the criterion for node selection was entropy. In the case of the RF model, one of the critical parameters to be tuned is the number of individual trees and the minimum number of samples required to split an internal node. Therefore, the chosen values were 100 and 2, respectively, along with a maximum depth parameter of 3. The logarithmic loss function was used for node selection [50].

In XGB, the softmax function was used to optimize class probabilities, as it is a multiclass problem. The maximum depth of the base models’ trees was set to 3 [51]. Learning curves for each model were generated to observe their behaviors during the training process as the sample size increased, as shown in Figure 3.

The training score remained consistently high across all training sets for all models. However, as observed in Figure 3, there was an increase in the validation score as the size of the training data grew. The learning curves show an initial gap between training and validation scores, particularly in the XGBoost curve (Figure 3c) and the decision tree curve (Figure 3a), where a significant difference is evident when fewer than 600 samples were used. This gap is indicative of potential overfitting during the early stages of training. Nevertheless, as the number of samples increased and the models underwent further iterations, the validation performance improved, and the gap between the training and validation scores diminished, especially when there were more than 1000 samples. In contrast, the random forest model (Figure 3b) maintained a more stable relationship between training and validation accuracy throughout the process, with smaller discrepancies, indicating a lower risk of overfitting from the outset.

The figures show no evidence of overfitting in the analyzed models; rather, it is an issue related to the scale of the training curves. In Figure 3a, the y-axis is scaled from 0.94 to 1.00, which may give the impression of a more significant gap between the training and validation curves. However, it can be observed that the validation accuracy remains close to the training accuracy, indicating the good generalization capability of the model. In Figure 3b,c, the y-axis scale is similar, but the dispersion between the curves is smaller, especially for larger sample sets. This demonstrates that the difference between the training and validation curves decreased for all models as the number of samples increased. This suggests that the models generalize well without signs of overfitting.

All the curves indicate a good trade-off between bias and variance and suggest that the sample sizes used for model training were sufficient, as the models stabilized with a considerable number of samples. The diminished difference between the training and validation curves confirms the absence of overfitting, indicating that the model can generalize well. These findings align with the results presented below (see Table 1).

4. Results

4.1. Training Machine Learning Models to Predict Cancer Types Using RNA-Seq Data Based on the Full Gene List

The initial results of this study were generated using a comprehensive list of genes, consisting of 3057 samples and 21,480 gene types. Accuracy, precision, sensitivity, and F1 score metrics were employed to evaluate the performance of the models applied to classify the five most recurrent tumor types in women using the full gene list. Consequently, the performance metrics represent the average values obtained across all folds, as shown in the “Original Features” column in Table 1.

Random forest achieved the best performance, with an accuracy of

99.40 %

and a precision of

99.43 %

. In second place was XGBoost, which also exhibited accuracy, precision, sensitivity, and F1 score values above

99.00 %

. The decision tree algorithm also demonstrated good performance, with values exceeding

97 %

across all adopted metrics. It is worth noting that the analyzed groups belong to highly heterogeneous tumor types, which may explain the high performance of the employed models, including DTs.

The standard deviation can indicate the variance of the models during their training. For 10-fold cross-validation, the standard deviation for the DT model in terms of accuracy was

0.0089

; for RF, it was

0.003709

; and for XGB, it was

0.00356

. Thus, a lower standard deviation implies lower variance.

4.2. Feature Selection Using SHAP and ML Model Performance Evaluation

Subsequently, the SHAP method was applied to three models to identify the most influential features in their decision-making processes. These three models (RF, DT, and XGB) were chosen for the SHAP application because they are more suitable and efficient given the large volume of original input data. Additionally, tree-based models tend to be easier to explain since their hierarchical structure facilitates the interpretation of each feature’s contribution, making SHAP an ideal tool for identifying the influence of each gene in cancer classification and for reducing the number of variables (features) in a dataset through feature selection, helping to pinpoint the most relevant genes.

In this context, after training the classifiers, the SHAP technique was applied to the training data to calculate the Shapley values. The SHAP-based feature selection process ranked genes by Shapley values, retaining those with the highest importance for model prediction. At the end of the selection process, a total of 223 genes were obtained, with 122 genes extracted from the RF model, 11 from the DT model, and 90 from XGB. Among these, 194 genes were unique. The final number of extracted genes represented less than 1% (21,481) of the original gene set. A discussion regarding the key genes identified by the technique is presented in Section 4.3.

As illustrated in Figure 1, the final step of our analysis involved retraining the five models selected for this study using only the reduced and unique gene list selected by SHAP. In Table 1, we highlight the accuracy and precision values of all models when using only the SHAP-selected genes to observe whether this reduced list can provide similar or improved model performance, demonstrating that SHAP can effectively be employed as a feature selection technique.

The models from which we extracted the SHAP genes showed a slight increase in their accuracy and precision values (<1%). Other models that did not undergo the feature selection process using SHAP values also experienced improvements in their metrics. The precision of the Bernoulli NB model increased from

97.94 %

to

98.97 %

, and the accuracy improved from

97.14 %

to

98.96 %

.

Although SHAP was not directly applied to the Naive Bayes models (Gaussian NB and Bernoulli NB), the reduced and unique gene set selected by SHAP proved effective across all models. Notably, even for the Naive Bayes models, which differ from the original SHAP-applied models, the chosen features enabled accurate classification among tumor classes, demonstrating strong performance. These results suggest that the genes selected by SHAP are sufficiently comprehensive to ensure accurate classification at a lower computational cost. This consolidates SHAP as a robust feature selection technique, as the genes it identified performed well even in probabilistic models.

4.3. SHAP Genes

Figure 4, Figure 5 and Figure 6 illustrate the contribution of each gene when applying the SHAP method to the predictions of the DT, RF, and XGB models, ranking them in descending order of influence. The summary plot highlights the most influential features based on the average SHAP values. In the context of a multiclass classification task, the summary plot ranks the features based on their overall contribution to all classes using specific colors representing each class.

It can be observed that the pattern of SHAP gene contributions differs among the three model classes (DT, RF, and XGB). In the DT model, only 11 SHAP genes were considered (see Figure 4). Of these genes, only PAX8 contributed to the classification of all classes, making it the most relevant for this model. The remaining 10 genes contributed to only a few classes, with imbalanced values among them.

On the other hand, the RF model resulted in 122 SHAP genes (see Figure 5). The top 20 genes in this model contributed to all five evaluated classes. Although the contributions were not balanced across all classes, they were still considered relevant. It is noteworthy that some genes had significant contributions to specific classes, often at the expense of others. For example, the genes EMX2 and TSHR contributed more to the OV and THCA classes, respectively.

The XGB model, in turn, revealed SHAP genes that predominantly contributed to a single class (see Figure 6). Among the top 20 genes with the highest impact on classifications, the TG gene (thyroglobulin) stands out, primarily contributing to the THCA class, making it the sole representative of this tumor type. For the COAD class, the identified genes included CDX1 (caudal-type homeobox 1), FABP1 (fatty acid binding protein 1), and GPA33 (glycoprotein A33). In the OV class, the relevant genes were SOX17 (SRY-box transcription factor 17), MEIS1 (MEIS homeobox 1), EMX2 (empty spiracles homeobox 2), and RPL10AP6 (ribosomal protein L10a pseudogene 6). Observations for the BRCA class revealed genes such as HKDC1 (hexokinase domain-containing 1), LMX1B (LIM homeobox transcription factor 1 beta), GATA3 (GATA binding protein 3), TRPS1 (transcriptional repressor GATA binding 1), and FOXA2 (forkhead box A2). For the LUAD class, the identified genes included SFTPA2 (surfactant protein A2), NAPSA (napsin A aspartic peptidase), TBX4 (T-box transcription factor 4), SFTPA1 (surfactant protein A1), HAND2 (heart and neural crest derivatives expressed 2), TFPI (tissue factor pathway inhibitor), and FGG (fibrinogen gamma chain). In summary, the results highlight that the DT, RF, and XGB models exhibit distinct patterns of SHAP gene contributions to sample classification.

5. Discussion

In this study, the SHAP method was employed for feature selection. SHAP values reflect the importance of each feature in the model, enabling the classification of features based on their relevance. The SHAP-based approach for feature selection has proven to be superior to other dimensionality reduction strategies while also enhancing the interpretability of the obtained results [52].

Overall, the SHAP value-based feature selection method proved to be successful in feature reduction, resulting in improvements in performance metrics and generalizability to external models [53,54]. Santos et al. used the SHAP technique to select features for fault detection, classification, and severity estimation using the SVM model. Similar to the reduced gene list approach in cancer classification, SHAP produced a smaller, more focused set of features, leading to improved accuracy [53]. Sadaei et al. introduced a SHAP-based feature selection tool designed for better performance and interpretability of models based on data from different areas of healthcare [54]. Our results can be partially compared to the study by Mohanad Mohammed and colleagues [55], which employed a deep learning approach through stacking ensembles to classify the same five types of tumors. In that work, the authors used LASSO as a feature selection technique, selecting only 173 genes. This led to the best average results, with an accuracy of

99.45 %

, still slightly below the

99.48 %

achieved without the LASSO step. In comparison to our study, feature selection using SHAP had a more significant impact on accuracies and resulted in overall improvements in model performance. For instance, the Gaussian NB model achieved an accuracy of

99.63 %

using only the SHAP-selected genes.

A major advantage of using an XAI technique in biological and clinical data is the ability to interpret the results and relate them to the phenomena classified by the model. A study with an extensive RNA-seq database spanning 47 tissues demonstrated that genes selected based on SHAP values to explain a convolutional neural network reflect the expected biological processes related to the differentiation and function of these tissues [28].

It is equally important to highlight that the genes identified by the SHAP method may not necessarily exhibit a direct association with differentially expressed genes. However, they can still be closely related to the phenomenon under study. This implies that the method can unveil new insights into transcription data [56].

When applying the SHAP technique to the machine learning models, we revealed that genes contribute to predictions differently in each tested model. Let us focus on the SHAP-selected genes for the XGBoost model. The SHAP values assigned to these genes are primarily associated with a single tumor class, suggesting a relationship between the gene and a specific type of tumor. We will investigate the key SHAP-selected genes in the XGB model and determine whether they have any relationship with the tumor class indicated by the SHAP value.

The gene TG encodes the precursor protein of thyroid hormones, which are essential for growth, development, and metabolic regulation [57]. TG is the most abundantly expressed gene product specific to the thyroid [58].

The gene CDX1 plays a critical role in the development of intestinal epithelium [59]. The gene FABP1 encodes a protein that plays a fundamental role in fatty acid metabolism and is specifically expressed in various tissues. It has been observed that approximately

70 %

of COAD—colon adenocarcinoma—cases exhibit positive rates of FABP1 expression [60]. The gene GPA33 is a member of the immunoglobulin superfamily and is present in

95 %

of colon tumors [61]. Therefore, the literature supports the SHAP result that associates these three genes with the COAD class.

The gene SOX17 is a highly expressed master regulator in ovarian cancer [62]. The gene MEIS1 has been studied in various contexts, including its role in the apoptosis of ovarian granulosa cells [63]. The gene EMX2 is a fundamental transcription factor in urogenital system formation and is intensely expressed in the ovary [64].

The gene HKDC1 is overexpressed and promotes the proliferation of breast cancer cells [65]. The gene LMX1B, which encodes a transcription factor, exhibits a methylation signature specific to breast tissue [66]. Furthermore, this gene has been identified as a potent biomarker for the identification and monitoring of localized breast cancer [67]. The gene GATA3 is considered the most highly expressed transcription factor in the luminal epithelial cells of the mammary gland and is mutated in approximately

10 %

of breast cancer cases [68]. The gene TRPS1 is a highly sensitive and specific marker for breast cancer and, unlike GATA3, is also highly expressed in the triple-negative subtype [69]. The gene FOXA2 plays a role in the proliferation and maintenance of tumor stem cells, particularly in triple-negative breast cancer [70].

The genes SFTPA1 and SFTPA2 encode surfactant proteins that are essential for the functioning of the pulmonary alveoli [71]. NAPSA is a gene expressed by type II pneumocytes and alveolar macrophages and has been recognized as a potential biomarker for lung adenocarcinomas [72]. The gene TBX4 is a transcription factor that regulates, among other cells, pulmonary fibroblasts [73]. Although there is no literature directly linking the gene HAND2 to the lung, it is worth noting that its antisense counterpart (HAND2-AS1) plays a repressive role in lung cancer cells [74]. The gene TFPI encodes a potent anticoagulant that has been associated with deep vein thrombosis and metastasis in lung cancer [75]. Another gene that plays a fundamental role in blood coagulation is FGG, which has been recommended as a prognostic biomarker for lung cancer [76].

Based on the previous discussion, we have observed a relationship between SHAP-selected genes and a specific tumor class. This relationship was established by analyzing the known biology of each of the top 20 SHAP-selected genes in the XGBoost model (see Table 2). The analysis using SHAP values allowed for identifying the most influential genes in the classification of cancer types, as listed in Table 2. These genes can be associated with their respective cancers through known biological roles and correlations. For example, TG is a well-established marker for thyroid cancer (THCA) [77,78], while NAPSA is recognized for its association with lung adenocarcinoma (LUAD) [79,80]. Similarly, SOX17 and FABP1 are linked to ovarian cancer (OV) and colon adenocarcinoma (COAD), respectively [81,82,83,84]. The SHAP values provide insights into whether higher expression of these genes correlates positively or negatively with the likelihood of a specific cancer type, adding a meaningful interpretative layer to the gene–cancer associations identified in Table 2.

The analysis of the genes identified by SHAP and presented in Table 2 demonstrates their biological relevance in known cancer pathways, supporting their importance beyond the statistical prominence in the model. For example, the TG gene, widely recognized as a marker for THCA, is essential for producing thyroid hormones, which play a crucial role in regulating metabolism and cell growth, processes often disrupted in thyroid neoplasms [77,78]. The NAPSA gene, associated with LUAD, is reassuringly used as a biomarker to distinguish this type of lung cancer from other neoplasms, thereby enhancing its diagnostic potential [79,80].

Furthermore, the FABP1 gene, involved in fatty acid metabolism, is highly expressed in colon tissues and frequently found in COAD, where its elevated expression may indicate metabolic alterations common to these tumors [81,82]. The SOX17 gene plays a significant role in developmental regulation and is a master factor expressed in OV, influencing cell proliferation and invasiveness [83,84]. These associations demonstrate that the highlighted genes are statistically relevant in the model and have proven biological significance in the pathways and processes associated with specific cancer types.

6. Conclusions

This study applied explainable artificial intelligence to RNA-seq gene expression data, focusing on five tumor types: BRCA, LUAD, THCA, OV, and COAD. The SHAP technique was utilized across various tree-based models, including decision trees, random forest, and XGBoost, to perform feature selection. New models trained exclusively with the genes selected by SHAP maintained their accuracy levels compared to models using all available genes. Interestingly, even Gaussian Naive Bayes and Bernoulli Naive Bayes models, which did not undergo feature selection, performed well when trained with the SHAP-selected genes from the other models. This suggests that the selected features effectively distinguished between classes, regardless of the originating model. Additionally, we assessed the interpretability of the SHAP results for each model. Notably, the SHAP-selected genes from the XGBoost model were associated with only one of the classes, simplifying the interpretation of the results. The employed strategy allowed for the selection of the most important genes without compromising model performance, while also enhancing overall transparency and explainability. Using XGBoost in combination with SHAP appears to be a promising approach for identifying biomarkers in multiclass classifications.

Author Contributions

All authors contributed to varying degrees to ensure the quality of this work (e.g., M.D., K.S.A., L.C.d.S., and M.A.C.F. conceived the idea and experiments; M.D., K.S.A., L.C.d.S., and M.A.C.F. designed and performed the experiments; M.D., K.S.A., L.C.d.S., C.B.d.F., M.L., and M.A.C.F. analyzed the data; and M.D., K.S.A., L.C.d.S., C.B.d.F., M.L., and M.A.C.F. wrote the paper. M.A.C.F. coordinated the project). All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are available in the Genomic Data Commons, (accessed on 28 November 2024). https://gdc.cancer.gov/.

Acknowledgments

The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Conselho Nacional de Desenvolvimento Científico e Tecnológico(CNPq) for their financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fahad Ullah, M. Breast cancer: Current perspectives on the disease status. In Breast Cancer Metastasis and Drug Resistance: Challenges and Progress; Springer: Berlin/Heidelberg, Germany, 2019; pp. 51–64. [Google Scholar]
Wang, X.; Ahmad, I.; Javeed, D.; Zaidi, S.A.; Alotaibi, F.M.; Ghoneim, M.E.; Daradkeh, Y.I.; Asghar, J.; Eldin, E.T. Intelligent Hybrid Deep Learning Model for Breast Cancer Detection. Electronics 2022, 11, 2767. [Google Scholar] [CrossRef]
Fidler-Benaoudia, M.M.; Torre, L.A.; Bray, F.; Ferlay, J.; Jemal, A. Lung cancer incidence in young women vs. young men: A systematic analysis in 40 countries. Int. J. Cancer 2020, 147, 811–819. [Google Scholar] [CrossRef] [PubMed]
Tsai, L.L.; Chu, N.Q.; Blessing, W.A.; Moonsamy, P.; Colson, Y.L. Lung cancer in women. Ann. Thorac. Surg. 2022, 114, 1965–1973. [Google Scholar] [CrossRef] [PubMed]
Stewart, C.; Ralyea, C.; Lockwood, S. Ovarian cancer: An integrated review. In Proceedings of the Seminars in Oncology Nursing; Elsevier: Amsterdam, The Netherlands, 2019; Volume 35, pp. 151–156. [Google Scholar]
Cai, Y.; Rattray, N.J.; Zhang, Q.; Mironova, V.; Santos-Neto, A.; Hsu, K.S.; Rattray, Z.; Cross, J.R.; Zhang, Y.; Paty, P.B.; et al. Sex differences in colon cancer metabolism reveal a novel subphenotype. Sci. Rep. 2020, 10, 4905. [Google Scholar] [CrossRef]
Wen, H.; Li, F.; Bukhari, I.; Mi, Y.; Guo, C.; Liu, B.; Zheng, P.; Liu, S. Comprehensive analysis of colorectal cancer immunity and identification of immune-related prognostic targets. Dis. Markers 2022, 2022, 7932655. [Google Scholar] [CrossRef]
Van Velsen, E.F.; Leung, A.M.; Korevaar, T.I. Diagnostic and treatment considerations for thyroid cancer in women of reproductive age and the perinatal period. Endocrinol. Metab. Clin. 2022, 51, 403–416. [Google Scholar] [CrossRef]
Tang, Z.; Zhang, J.; Zhou, Q.; Xu, S.; Cai, Z.; Jiang, G. Thyroid cancer “epidemic”: A socio-environmental health problem needs collaborative efforts. Environ. Sci. Technol. 2020, 54, 3725–3727. [Google Scholar] [CrossRef]
Mattiuzzi, C.; Lippi, G. Current cancer epidemiology. J. Epidemiol. Glob. Health 2019, 9, 217. [Google Scholar] [CrossRef]
Huang, S.; Yang, J.; Fong, S.; Zhao, Q. Artificial intelligence in cancer diagnosis and prognosis: Opportunities and challenges. Cancer Lett. 2020, 471, 61–71. [Google Scholar] [CrossRef]
Elemento, O.; Leslie, C.; Lundin, J.; Tourassi, G. Artificial intelligence in cancer research, diagnosis and therapy. Nat. Rev. Cancer 2021, 21, 747–752. [Google Scholar] [CrossRef]
Chua, I.S.; Gaziel-Yablowitz, M.; Korach, Z.T.; Kehl, K.L.; Levitan, N.A.; Arriaga, Y.E.; Jackson, G.P.; Bates, D.W.; Hassett, M. Artificial intelligence in oncology: Path to implementation. Cancer Med. 2021, 10, 4138–4149. [Google Scholar] [CrossRef] [PubMed]
Agrawal, P.; Abutarboush, H.F.; Ganesh, T.; Mohamed, A.W. Metaheuristic algorithms on feature selection: A survey of one decade of research (2009–2019). IEEE Access 2021, 9, 26766–26791. [Google Scholar] [CrossRef]
Moncada-Torres, A.; van Maaren, M.C.; Hendriks, M.P.; Siesling, S.; Geleijnse, G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci. Rep. 2021, 11, 6968. [Google Scholar] [CrossRef]
Hauser, K.; Kurz, A.; Haggenmüller, S.; Maron, R.C.; von Kalle, C.; Utikal, J.S.; Meier, F.; Hobelsberger, S.; Gellrich, F.F.; Sergon, M.; et al. Explainable artificial intelligence in skin cancer recognition: A systematic review. Eur. J. Cancer 2022, 167, 54–69. [Google Scholar] [CrossRef]
Zhang, Y.; Weng, Y.; Lund, J. Applications of explainable artificial intelligence in diagnosis and surgery. Diagnostics 2022, 12, 237. [Google Scholar] [CrossRef]
Meshoul, S.; Batouche, A.; Shaiba, H.; AlBinali, S. Explainable Multi-Class Classification Based on Integrative Feature Selection for Breast Cancer Subtyping. Mathematics 2022, 10, 4271. [Google Scholar] [CrossRef]
Ara, S.; Das, A.; Dey, A. Malignant and benign breast cancer classification using machine learning algorithms. In Proceedings of the 2021 International Conference on Artificial Intelligence (ICAI), Islamabad, Pakistan, 5–7 April 2021; pp. 97–101. [Google Scholar]
Vural, S.; Wang, X.; Guda, C. Classification of breast cancer patients using somatic mutation profiles and machine learning approaches. BMC Syst. Biol. 2016, 10, 263–276. [Google Scholar] [CrossRef]
Ram, M.; Najafi, A.; Shakeri, M.T. Classification and biomarker genes selection for cancer gene expression data using random forest. Iran. J. Pathol. 2017, 12, 339. [Google Scholar] [CrossRef]
Yuan, F.; Lu, L.; Zou, Q. Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms. Biochim. Biophys. Acta (BBA) Mol. Basis Dis. 2020, 1866, 165822. [Google Scholar] [CrossRef]
Yeganeh, P.N.; Mostafavi, M.T. Use of Machine Learning for Diagnosis of Cancer in Ovarian Tissues with a Selected mRNA Panel. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 2429–2434. [Google Scholar]
Alharbi, F.; Vakanski, A. Machine learning methods for cancer classification using gene expression data: A review. Bioengineering 2023, 10, 173. [Google Scholar] [CrossRef]
Khalifa, N.E.M.; Taha, M.H.N.; Ezzat Ali, D.; Slowik, A.; Hassanien, A.E. Artificial Intelligence Technique for Gene Expression by Tumor RNA-Seq Data: A Novel Optimized Deep Learning Approach. IEEE Access 2020, 8, 22874–22883. [Google Scholar] [CrossRef]
De Guia, J.M.; Devaraj, M.; Leung, C.K. DeepGx: Deep Learning Using Gene Expression for Cancer Classification. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Vancouver, BC, Canada, 27–30 August 2019; pp. 913–920. [Google Scholar]
Hassan, M.R.; Islam, M.F.; Uddin, M.Z.; Ghoshal, G.; Hassan, M.M.; Huda, S.; Fortino, G. Prostate cancer classification from ultrasound and MRI images using deep learning based Explainable Artificial Intelligence. Future Gener. Comput. Syst. 2022, 127, 462–472. [Google Scholar] [CrossRef]
Yap, M.; Johnston, R.L.; Foley, H.; MacDonald, S.; Kondrashova, O.; Tran, K.A.; Nones, K.; Koufariotis, L.T.; Bean, C.; Pearson, J.V.; et al. Verifying explainability of a deep learning tissue classifier trained on RNA-seq data. Sci. Rep. 2021, 11, 2641. [Google Scholar] [CrossRef] [PubMed]
Grossman, R.L.; Heath, A.P.; Ferretti, V.; Varmus, H.E.; Lowy, D.R.; Kibbe, W.A.; Staudt, L.M. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 2016, 375, 1109–1112. [Google Scholar] [CrossRef]
Colaprico, A.; Silva, T.C.; Olsen, C.; Garofano, L.; Cava, C.; Garolini, D.; Sabedot, T.S.; Malta, T.M.; Pagnotta, S.M.; Castiglioni, I.; et al. TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016, 44, e71. [Google Scholar] [CrossRef]
Mahesh, B. Machine learning algorithms-a review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar] [CrossRef]
Yan, J.; Wang, X. Unsupervised and semi-supervised learning: The next frontier in machine learning for plant systems biology. Plant J. 2022, 111, 1527–1538. [Google Scholar] [CrossRef]
Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef]
Kaiser, L.; Babaeizadeh, M.; Milos, P.; Osinski, B.; Campbell, R.H.; Czechowski, K.; Erhan, D.; Finn, C.; Kozakowski, P.; Levine, S.; et al. Model-based reinforcement learning for atari. arXiv 2019, arXiv:1903.00374. [Google Scholar]
Somvanshi, M.; Chavan, P.; Tambade, S.; Shinde, S. A review of machine learning techniques using decision tree and support vector machine. In Proceedings of the 2016 International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 12–13 August 2016; pp. 1–7. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cutler, A.; Cutler, D.; Stevens, J. Random forests. In Ensemble Machine Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 157–175. [Google Scholar]
Reis, I.; Baron, D.; Shahaf, S. Probabilistic random forest: A machine learning algorithm for noisy data sets. Astron. J. 2018, 157, 16. [Google Scholar] [CrossRef]
Schonlau, M.; Zou, R.Y. The random forest algorithm for statistical learning. Stata J. 2020, 20, 3–29. [Google Scholar] [CrossRef]
Ali, J.; Khan, R.; Ahmad, N.; Maqsood, I. Random forests and decision trees. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 272. [Google Scholar]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Meng, Y.; Yang, N.; Qian, Z.; Zhang, G. What makes an online review more helpful: An interpretation framework using XGBoost and SHAP values. J. Theor. Appl. Electron. Commer. Res. 2020, 16, 466–490. [Google Scholar] [CrossRef]
Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Volume 3, pp. 41–46. [Google Scholar]
Zhang, H.; Li, D. Naïve Bayes text classifier. In Proceedings of the 2007 IEEE International Conference on Granular Computing (GRC 2007), San Jose, CA, USA, 2–4 November 2007; p. 708. [Google Scholar]
Kamel, H.; Abdulah, D.; Al-Tuwaijari, J.M. Cancer classification using gaussian naive bayes algorithm. In Proceedings of the 2019 International Engineering Conference (IEC), Erbil, Iraq, 23–25 June 2019; pp. 165–170. [Google Scholar]
Singh, G.; Kumar, B.; Gaur, L.; Tyagi, A. Comparison between multinomial and Bernoulli naïve Bayes for text classification. In Proceedings of the 2019 International Conference on Automation, Computational and Technology Management (ICACTM), London, UK, 24–26 April 2019; pp. 593–596. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Mangalathu, S.; Hwang, S.H.; Jeon, J.S. Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Eng. Struct. 2020, 219, 110927. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.G.; Lee, S.I. Consistent individualized feature attribution for tree ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
Probst, P.; Boulesteix, A.L. To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 2017, 18, 6673–6690. [Google Scholar]
Quinto, B. Next-Generation Machine Learning with Spark: Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More; Apress: New York, NY, USA, 2020. [Google Scholar]
Marcílio, W.E.; Eler, D.M. From explanations to feature selection: Assessing SHAP values as feature selection mechanism. In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Recife/Porto de Galinhas, Brazil, 7–10 November 2020; pp. 340–347. [Google Scholar]
Santos, M.R.; Guedes, A.; Sanchez-Gendriz, I. SHapley Additive exPlanations (SHAP) for Efficient Feature Selection in Rolling Bearing Fault Diagnosis. Mach. Learn. Knowl. Extr. 2024, 6, 316–341. [Google Scholar] [CrossRef]
Sadaei, H.J.; Loguercio, S.; Shafiei Neyestanak, M.; Torkamani, A.; Prilutsky, D. Zoish: A Novel Feature Selection Approach Leveraging Shapley Additive Values for Machine Learning Applications in Healthcare. In Proceedings of the Pacific Symposium on Biocomputing, Kohala Coast, HI, USA, 3–7 January 2024; World Scientific: Singapore, 2023; pp. 81–95. [Google Scholar]
Mohammed, M.; Mwambi, H.; Mboya, I.B.; Elbashir, M.K.; Omolo, B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci. Rep. 2021, 11, 15626. [Google Scholar] [CrossRef] [PubMed]
Li, Q.; Yu, Y.; Kossinna, P.; Lun, T.; Liao, W.; Zhang, Q. XA4C: eXplainable representation learning via Autoencoders revealing Critical genes. PLoS Comput. Biol. 2023, 19, e1011476. [Google Scholar] [CrossRef] [PubMed]
Citterio, C.E.; Targovnik, H.M.; Arvan, P. The role of thyroglobulin in thyroid hormonogenesis. Nat. Rev. Endocrinol. 2019, 15, 323–338. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Young, C.; Morishita, Y.; Kim, K.; Kabil, O.O.; Clarke, O.B.; Di Jeso, B.; Arvan, P. Defective thyroglobulin: Cell biology of disease. Int. J. Mol. Sci. 2022, 23, 13605. [Google Scholar] [CrossRef]
Guo, R.J.; Suh, E.R.; Lynch, J.P. The role of Cdx proteins in intestinal development and cancer. Cancer Biol. Ther. 2004, 3, 593–601. [Google Scholar] [CrossRef]
Dum, D.; Ocokoljic, A.; Lennartz, M.; Hube-Magg, C.; Reiswich, V.; Höflmayer, D.; Jacobsen, F.; Bernreuther, C.; Lebok, P.; Sauter, G.; et al. FABP1 expression in human tumors: A tissue microarray study on 17,071 tumors. Virchows Arch. 2022, 481, 945–961. [Google Scholar] [CrossRef]
Frey, D.; Coelho, V.; Petrausch, U.; Schaefer, M.; Keilholz, U.; Thiel, E.; Deckert, P.M. Surface expression of gpA33 is dependent on culture density and cell-cycle phase and is regulated by intracellular traffic rather than gene transcription. Cancer Biother. Radiopharm. 2008, 23, 65–73. [Google Scholar] [CrossRef]
Shaker, N.; Chen, W.; Sinclair, W.; Parwani, A.V.; Li, Z. Identifying SOX17 as a sensitive and specific marker for ovarian and endometrial carcinomas. Mod. Pathol. 2023, 36, 100038. [Google Scholar] [CrossRef]
Wang, S.; Wang, Y.; Chen, Y.; Li, Y.; Du, X.; Li, Y.; Li, Q. MEIS1 Is a Common Transcription Repressor of the miR-23a and NORHA Axis in Granulosa Cells. Int. J. Mol. Sci. 2023, 24, 3589. [Google Scholar] [CrossRef]
Pellegrini, M.; Pantano, S.; Lucchini, F.; Fumi, M.; Forabosco, A. Emx 2 developmental expression in the primordia of the reproductive and excretory systems. Anat. Embryol. 1997, 196, 427–433. [Google Scholar] [CrossRef]
Chen, X.; Lv, Y.; Sun, Y.; Zhang, H.; Xie, W.; Zhong, L.; Chen, Q.; Li, M.; Li, L.; Feng, J.; et al. PGC1β regulates breast tumor growth and metastasis by SREBP1-mediated HKDC1 expression. Front. Oncol. 2019, 9, 290. [Google Scholar] [CrossRef] [PubMed]
Kjær, I.M.; Kahns, S.; Timm, S.; Andersen, R.F.; Madsen, J.S.; Jakobsen, E.H.; Tabor, T.P.; Jakobsen, A.; Bechmann, T. Phase II trial of delta-tocotrienol in neoadjuvant breast cancer with evaluation of treatment response using ctDNA. Sci. Rep. 2023, 13, 8419. [Google Scholar] [CrossRef] [PubMed]
Moss, J.; Zick, A.; Grinshpun, A.; Carmon, E.; Maoz, M.; Ochana, B.; Abraham, O.; Arieli, O.; Germansky, L.; Meir, K.; et al. Circulating breast-derived DNA allows universal detection and monitoring of localized breast cancer. Ann. Oncol. 2020, 31, 395–403. [Google Scholar] [CrossRef] [PubMed]
Takaku, M.; Grimm, S.A.; Wade, P.A. GATA3 in breast cancer: Tumor suppressor or oncogene? Gene Expr. J. Liver Res. 2015, 16, 163–168. [Google Scholar] [CrossRef] [PubMed]
Ai, D.; Yao, J.; Yang, F.; Huo, L.; Chen, H.; Lu, W.; Soto, L.M.S.; Jiang, M.; Raso, M.G.; Wang, S.; et al. TRPS1: A highly sensitive and specific marker for breast carcinoma, especially for triple-negative breast cancer. Mod. Pathol. 2021, 34, 710–719. [Google Scholar] [CrossRef]
Perez-Balaguer, A.; Ortiz-Martínez, F.; García-Martínez, A.; Pomares-Navarro, C.; Lerma, E.; Peiró, G. FOXA2 mRNA expression is associated with relapse in patients with triple-negative/basal-like breast carcinoma. Breast Cancer Res. Treat. 2015, 153, 465–474. [Google Scholar] [CrossRef]
Floros, J.; Thorenoor, N.; Tsotakos, N.; Phelps, D.S. Human surfactant protein SP-A1 and SP-A2 variants differentially affect the alveolar microenvironment, surfactant structure, regulation and function of the alveolar macrophage, and animal and human survival under various conditions. Front. Immunol. 2021, 12, 681639. [Google Scholar] [CrossRef]
Kim, M.Y.; Go, H.; Koh, J.; Lee, K.; Min, H.S.; Kim, M.A.; Jeon, Y.K.; Lee, H.S.; Moon, K.C.; Park, S.Y.; et al. Napsin A is a useful marker for metastatic adenocarcinomas of pulmonary origin. Histopathology 2014, 65, 195–206. [Google Scholar] [CrossRef]
Horie, M.; Miyashita, N.; Mikami, Y.; Noguchi, S.; Yamauchi, Y.; Suzukawa, M.; Fukami, T.; Ohta, K.; Asano, Y.; Sato, S.; et al. TBX4 is involved in the super-enhancer-driven transcriptional programs underlying features specific to lung fibroblasts. Am. J. Physiol.-Lung Cell. Mol. Physiol. 2018, 314, L177–L191. [Google Scholar] [CrossRef]
Ghafouri-Fard, S.; Hussen, B.M.; Abdullah, S.R.; Dadyar, M.; Taheri, M.; Kiani, A. A review on the role of HAND2-AS1 in cancer. Clin. Exp. Med. 2023, 23, 3179–3188. [Google Scholar] [CrossRef]
Fei, X.; Wang, H.; Yuan, W.; Wo, M.; Jiang, L. Tissue factor pathway inhibitor-1 is a valuable marker for the prediction of deep venous thrombosis and tumor metastasis in patients with lung cancer. BioMed Res. Int. 2017, 2017, 8983763. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Gao, Z.; Zeng, G.; Xie, H.; Liu, J.; Liu, N.; Wang, G. Clinical significance of urinary plasminogen and fibrinogen gamma chain as novel potential diagnostic markers for non-small-cell lung cancer. Clin. Chim. Acta 2020, 502, 55–65. [Google Scholar] [CrossRef] [PubMed]
Giovanella, L.; D’Aurizio, F.; Petranović Ovčariček, P.; Görges, R. Diagnostic, Theranostic and Prognostic Value of Thyroglobulin in Thyroid Cancer. J. Clin. Med. 2024, 13, 2463. [Google Scholar] [CrossRef]
Kołodziej, M.; Saracyn, M.; Lubas, A.; Brodowska-Kania, D.; Mazurek, A.; Dziuk, M.; Durma, A.D.; Niemczyk, S.; Kamiński, G. TSH Stimulation before PET/CT as Our Frenemy in Detecting Thyroid Cancer Metastases—Final Results of a Retrospective Analysis. Cancers 2024, 16, 3413. [Google Scholar] [CrossRef] [PubMed]
Hadi, R.; Xu, H. Primary Lung Versus Metastatic Adenocarcinoma. In Practical Lung Pathology: Frequently Asked Questions; Springer: Berlin/Heidelberg, Germany, 2022; pp. 101–105. [Google Scholar]
Xu, Y. Single-cell landscape of the immune microenvironment of leptomeningeal metastases in non-small cell lung cancer treated with pemetrexed sheath injection. J. Clin. Oncol. 2024, 42, e20026. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, W.; Min, X.; Zhujun, X.; Fangmei, A.; Qiang, Z.; Wenying, T.; Tianyue, Z. High expression of FABP4 in colorectal cancer and its clinical significance. J. Zhejiang Univ. Sci. B 2021, 22, 136. [Google Scholar] [CrossRef]
Prayugo, F.B.; Kao, T.J.; Anuraga, G.; Ta, H.D.K.; Chuang, J.Y.; Lin, L.C.; Wu, Y.F.; Wang, C.Y.; Lee, K.H. Expression profiles and prognostic value of FABPs in colorectal adenocarcinomas. Biomedicines 2021, 9, 1460. [Google Scholar] [CrossRef]
Lin, L.; Shi, K.; Zhou, S.; Cai, M.C.; Zhang, C.; Sun, Y.; Zang, J.; Cheng, L.; Ye, K.; Ma, P.; et al. SOX17 and PAX8 constitute an actionable lineage-survival transcriptional complex in ovarian cancer. Oncogene 2022, 41, 1767–1779. [Google Scholar] [CrossRef]
Chaves-Moreira, D.; Mitchell, M.A.; Arruza, C.; Rawat, P.; Sidoli, S.; Nameki, R.; Reddy, J.; Corona, R.I.; Afeyan, L.K.; Klein, I.A.; et al. The transcription factor PAX8 promotes angiogenesis in ovarian cancer through interaction with SOX17. Sci. Signal. 2022, 15, eabm2496. [Google Scholar] [CrossRef]

Figure 1. Flowchart of activities to obtain the most important characteristics in the classification.

Figure 2. Number of samples existing in the database before applying the undersampling balancing technique.

Figure 3. Training curves for different models: (a) Training curve for the DT model; (b) Training curve for the RF model; and (c) Training curve for the XGB model.

Figure 4. SHAP summary plot for the decision tree multiclass classification model. The plot displays the contributions of features (genes) to the prediction of cancer types: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Features are ranked by maximum average SHAP values, highlighting the most important genes for distinguishing between the classes.

Figure 5. SHAP summary plot for the random forest multiclass classification model. The plot displays the contributions of features (genes) to the prediction of cancer types: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Features are ranked by maximum average SHAP values, highlighting the most important genes for distinguishing between the classes.

Figure 6. SHAP summary plot for the XGBoost multiclass classification model. The plot displays the contributions of features (genes) to the prediction of cancer types: breast cancer (BRCA), lung adenocarcinoma (LUAD), thyroid cancer (THCA), ovarian cancer (OV), and colon adenocarcinoma (COAD). Features are ranked by maximum average SHAP values, highlighting the most important genes for distinguishing between the classes.

Table 1. Classification results: Original features vs. SHAP-selected features.

Classifier	Metric	Original Features	SHAP-Selected Features
Decision Tree	Accuracy	97.60%	98.69%
	Precision	97.74%	98.74%
	Recall	97.70%	98.70%
	F1 Score	97.64%	98.70%
Random Forest	Accuracy	99.40%	99.76%
	Precision	99.43%	99.77%
	Recall	99.40%	99.87%
	F1 Score	99.40%	99.86%
XGBoost	Accuracy	99.34%	99.64%
	Precision	99.36%	99.66%
	Recall	99.34%	99.79%
	F1 Score	99.34%	99.80%
Gaussian Naive Bayes	Accuracy	98.93%	99.63%
	Precision	98.88%	99.55%
	Recall	98.82%	99.65%
	F1 Score	98.84%	99.60%
Bernoulli Naive Bayes	Accuracy	97.94%	99.22%
	Precision	97.14%	99.19%
	Recall	97.41%	99.26%
	F1 Score	97.20%	99.22%

Table 2. SHAP genes related to XGboosting.

Gene Abbreviation	Gene Name
TG	Thyroglobulin
CDX1	Caudal-type homeobox 1
SOX17	SRY-box transcription factor 17
HKDC1	Hexokinase domain-containing 1
SFTPA2	Surfactant protein A2
FABP1	Fatty acid binding protein 1
LMX1B	LIM homeobox transcription factor 1 beta
NAPSA	Napsin A aspartic peptidase
MEIS1	Meis homeobox 1
TBX4	T-box transcription factor 4
SFTPA1	Surfactant protein A1
GATA3	GATA binding protein 3
EMX2	Empty spiracles homeobox 2
TRPS1	Transcriptional repressor GATA binding 1
FOXA2	Forkhead box A2
HAND2	Heart and neural crest derivatives expressed 2
GPA33	Glycoprotein A33
TFPI	Tissue factor pathway inhibitor
RPL10AP6	Ribosomal protein L10a pseudogene 6
FGG	Fibrinogen gamma chain

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dalmolin, M.; Azevedo, K.S.; Souza, L.C.d.; de Farias, C.B.; Lichtenfels, M.; Fernandes, M.A.C. Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models. AI 2025, 6, 2. https://doi.org/10.3390/ai6010002

AMA Style

Dalmolin M, Azevedo KS, Souza LCd, de Farias CB, Lichtenfels M, Fernandes MAC. Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models. AI. 2025; 6(1):2. https://doi.org/10.3390/ai6010002

Chicago/Turabian Style

Dalmolin, Matheus, Karolayne S. Azevedo, Luísa C. de Souza, Caroline B. de Farias, Martina Lichtenfels, and Marcelo A. C. Fernandes. 2025. "Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models" AI 6, no. 1: 2. https://doi.org/10.3390/ai6010002

APA Style

Dalmolin, M., Azevedo, K. S., Souza, L. C. d., de Farias, C. B., Lichtenfels, M., & Fernandes, M. A. C. (2025). Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models. AI, 6(1), 2. https://doi.org/10.3390/ai6010002

Article Menu

Feature Selection in Cancer Classification: Utilizing Explainable Artificial Intelligence to Uncover Influential Genes in Machine Learning Models

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Database

3.2. Data Preprocessing

3.3. Machine Learning Algorithms

3.3.1. Decision Trees

3.3.2. Random Forest

3.3.3. Extreme Gradient Boosting—XGBoost

3.3.4. Naive Bayes—NB

3.3.5. Explainable Artificial Intelligence

3.4. Model Training

4. Results

4.1. Training Machine Learning Models to Predict Cancer Types Using RNA-Seq Data Based on the Full Gene List

4.2. Feature Selection Using SHAP and ML Model Performance Evaluation

4.3. SHAP Genes

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI