Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique
Abstract
:1. Introduction
1.1. Literature Review and Problem Statement
1.2. Contributions
- We have reduced the feature dimension from the gene expression dataset and retrieved the most significant and relevant features used for cancer classification by applying the mutual information (MI) algorithm.
- We have applied the bagging method as an ensemble technique, where a Multilayer Perceptron (MLP) is used as the base learner. The MLP is implemented with three hidden layers, where the input and output layer nodes are chosen to match the selected features and the number of cancer classes, respectively.
- We have assessed the efficacy of our proposed model, MI-Bagging, which achieves higher accuracy for each of the five benchmark gene expression datasets to classify cancers of various types.
2. Materials and Methods
2.1. Data Collection
2.2. Data Preprocessing
2.3. Feature Selection
2.4. Ensemble Classifier
- 1.
- Bootstrapping: Bootstrapping samples are created by row sampling with replacement. Hence, a bootstrap sample can choose the same instance multiple times. Specifically, ten bootstrap samples are created at this stage.
- 2.
- Parallel Training: The created bootstrap samples are then trained independently in parallel using the base MLP classifier. Here, MLP consists of an input layer, three hidden layers, and an output layer. The sizes of the input and output layers are fixed based on the target datasets, i.e., the feature number and class number, respectively. The basic structure of MLP is shown in Figure 7.
- 3.
- Voting: The target class is predicted from every weak learner, MLP. Then, hard voting, also referred to as majority voting, is applied. It counts the class that obtains the most votes.
3. Experimental Setup
4. Performance Evaluation
Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Global Cancer Burden Growing, Amidst Mounting Need for Services. Available online: https://www.who.int/news/item/01-02-2024-global-cancer-burden-growing–amidst-mounting-need-for-services (accessed on 19 February 2024).
- Cancer. Available online: https://en.wikipedia.org/wiki/Cancer (accessed on 19 February 2024).
- Alromema, N.; Syed, A.H.; Khan, T. A hybrid machine learning approach to screen optimal predictors for the classification of primary breast tumors from gene expression microarray data. Diagnostics 2023, 13, 708. [Google Scholar] [CrossRef] [PubMed]
- AbdElNabi, M.L.R.; Wajeeh Jasim, M.; El-Bakry, H.M.; Taha, M.H.N.; Khalifa, N.E.M. Breast and colon cancer classification from gene expression profiles using data mining techniques. Symmetry 2020, 12, 408. [Google Scholar] [CrossRef]
- De Souza, J.T.; De Francisco, A.C.; De Macedo, D.C. Dimensionality reduction in gene expression data sets. IEEE Access 2019, 7, 61136–61144. [Google Scholar] [CrossRef]
- Japan Cancer Survivorship Country Profile. Available online: https://cancersurvivorship.eiu.com/countries/japan/ (accessed on 22 December 2023).
- Cancer Statistics in Japan. 2023. Available online: https://ganjoho.jp/public/qa_links/report/statistics/2023_en.html (accessed on 20 February 2024).
- Salem, H.; Attiya, G.; El-Fishawy, N. Classification of human cancer diseases by gene expression profiles. Appl. Soft Comput. 2017, 50, 124–134. [Google Scholar] [CrossRef]
- Ayyad, S.M.; Saleh, A.I.; Labib, L.M. Gene expression cancer classification using modified K-Nearest Neighbors technique. Biosystems 2019, 176, 41–51. [Google Scholar] [CrossRef] [PubMed]
- Yeganeh, P.N.; Mostafavi, M.T. Use of machine learning for diagnosis of cancer in ovarian tissues with a selected mRNA panel. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 3–6 June 2018; pp. 2429–2434. [Google Scholar]
- Dey, U.K.; Islam, M.S. Genetic expression analysis to detect type of leukemia using machine learning. In Proceedings of the 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), Dhaka, Bangladesh, 3–5 May 2019; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
- Akhand, M.A.H.; Miah, M.A.; Kabir, M.H.; Rahman, M.M.H. Cancer Classification from DNA Microarray Data using mRMR and Artificial Neural Network. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 106–111. [Google Scholar] [CrossRef]
- Rukhsar, L.; Bangyal, W.H.; Ali Khan, M.S.; Ag Ibrahim, A.A.; Nisar, K.; Rawat, D.B. Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Appl. Sci. 2022, 12, 1850. [Google Scholar] [CrossRef]
- Erkal, B.; Başak, S.; Çiloğlu, A.; Şener, D.D. Multiclass classification of brain cancer with machine learning algorithms. In Proceedings of the 2020 Medical Technologies Congress (TIPTEKNO), Antalya, Turkey, 19–20 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–4. [Google Scholar]
- Almutairi, S.; Manimurugan, S.; Kim, B.G.; Aborokbah, M.M.; Narmatha, C. Breast cancer classification using Deep Q Learning (DQL) and gorilla troops optimization (GTO). Appl. Soft Comput. 2023, 142, 110292. [Google Scholar] [CrossRef]
- Mallick, P.K.; Mohapatra, S.K.; Chae, G.S.; Mohanty, M.N. Convergent learning–based model for leukemia classification from gene expression. Pers. Ubiquitous Comput. 2023, 27, 1103–1110. [Google Scholar] [CrossRef] [PubMed]
- Joshi, A.A.; Aziz, R.M. Deep learning approach for brain tumor classification using metaheuristic optimization with gene expression data. Int. J. Imaging Syst. Technol. 2023, 34, e23007. [Google Scholar] [CrossRef]
- Leukemia Data. Available online: https://hastie.su.domains/CASI_files/DATA/leukemia.html (accessed on 9 January 2024).
- Feltes, B.C.; Chandelier, E.B.; Grisci, B.I.; Dorn, M. CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. J. Comput. Biol. 2019, 26, 376–386. [Google Scholar] [CrossRef] [PubMed]
- Srividya-Sundaravadivelu/Cancer-Classification-Using-Machine-Learning. Available online: https://github.com/srividya-sundaravadivelu/Cancer-Classification-Using-Machine-Learning (accessed on 7 January 2024).
- Simonorozcoarias/ML_DL_microArrays. Available online: https://github.com/simonorozcoarias/ML_DL_microArrays/blob/master/data11tumors2.csv (accessed on 9 January 2024).
- Sayers, E.W.; Bolton, E.E.; Brister, J.R.; Canese, K.; Chan, J.; Comeau, D.C.; Connor, R.; Funk, K.; Kelly, C.; Kim, S.; et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022, 50, D20. [Google Scholar] [CrossRef] [PubMed]
- Khalsan, M.; Machado, L.R.; Al-Shamery, E.S.; Ajit, S.; Anthony, K.; Mu, M.; Agyeman, M.O. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access 2022, 10, 27522–27534. [Google Scholar] [CrossRef]
- Wei, Y.; Gao, M.; Xiao, J.; Liu, C.; Tian, Y.; He, Y. Research and implementation of cancer gene data classification based on deep learning. J. Softw. Eng. Appl. 2023, 16, 155–169. [Google Scholar] [CrossRef]
- Tabares-Soto, R.; Orozco-Arias, S.; Romero-Cano, V.; Bucheli, V.S.; Rodríguez-Sotelo, J.L.; Jiménez-Varón, C.F. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Comput. Sci. 2020, 6, e270. [Google Scholar] [CrossRef]
- Yang, P.; Hwa Yang, Y.; Zhou, B.B.; Y Zomaya, A. A review of ensemble methods in bioinformatics. Curr. Bioinform. 2010, 5, 296–308. [Google Scholar] [CrossRef]
- Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C.; de Schaetzen, V.; Duque, R.; Bersini, H.; Nowe, A. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1106–1119. [Google Scholar] [CrossRef] [PubMed]
- Torkkola, K.; Campbell, W.M. Mutual information in learning feature transformations. In Proceedings of the ICML, San Francisco, CA, USA, 29 June–2 July 2000; Citeseer: Princeton, NJ, USA, 2001; pp. 1015–1022. [Google Scholar]
- Shadvar, A. Dimension reduction by mutual information feature extraction. arXiv 2012, arXiv:1207.3394. [Google Scholar] [CrossRef]
Dataset | Total Features | Total Samples | Number of Classes |
---|---|---|---|
DS 1 | 7128 | 72 | 2 |
DS 2 | 16,382 | 130 | 5 |
DS 3 | 16,383 | 801 | 5 |
DS 4 | 12,533 | 174 | 11 |
DS 5 | 4656 | 81 | 4 |
Dataset | Total Samples (after SMOTE Operation) | Sample Number per Class |
---|---|---|
DS 1 | 94 | 47 |
DS 2 | 230 | 46 |
DS 3 | 1500 | 300 |
DS 4 | 297 | 27 |
DS 5 | 124 | 31 |
Datasets | Gene Number (with All Features) | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
DS 1 | 7128 | 100% | 100% | 100% | 100% |
DS 2 | 16,382 | 96.36% | 97.10% | 95.65% | 95.87% |
DS 3 | 16,383 | 99.66% | 99.67% | 99.67% | 99.67% |
DS 4 | 12,533 | 95.15% | 97.08% | 96.67% | 96.69% |
DS 5 | 4656 | 87.50% | 92.00% | 88.00% | 88.22% |
Datasets | Gene Number (with Selected Features) | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
DS 1 | 100 | 100% | 100% | 100% | 100% |
DS 2 | 500 | 100% | 100% | 100% | 100% |
DS 3 | 500 | 100% | 100% | 100% | 100% |
DS 4 | 1500 | 98.48% | 98.75% | 98.33% | 98.38% |
DS 5 | 2000 | 95.83% | 96.57% | 96.00% | 95.97% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tabassum, N.; Kamal, M.A.S.; Akhand, M.A.H.; Yamada, K. Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique. BioMedInformatics 2024, 4, 1275-1288. https://doi.org/10.3390/biomedinformatics4020070
Tabassum N, Kamal MAS, Akhand MAH, Yamada K. Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique. BioMedInformatics. 2024; 4(2):1275-1288. https://doi.org/10.3390/biomedinformatics4020070
Chicago/Turabian StyleTabassum, Nusrath, Md Abdus Samad Kamal, M. A. H. Akhand, and Kou Yamada. 2024. "Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique" BioMedInformatics 4, no. 2: 1275-1288. https://doi.org/10.3390/biomedinformatics4020070