Feature Selection Techniques For Microarray Dataset: A Review
Feature Selection Techniques For Microarray Dataset: A Review
Corresponding Author:
Avinash Nagaraja Rao
Department of Computer Science and Engineering, Rabindranath Tagore University
Bhopal, India
Email: avi003@gmail.com
1. INTRODUCTION
Feature is an important element in the classification problem addressed in machine learning (ML)
techniques. Selecting the best feature or relevant feature will provide better accuracy in optimization algorithm
[1]. Microarray dataset contains minimum number of samples and maximum number of features. Conventionally,
most of the microarray datasets include more than 60,000 characteristics or attributes with fewer samples, not
surpassing 100 [2]. Due to this scenario important or relevant features are identified using feature selection (FS)
techniques viz., filter, wrapper, and embedded methods. This leads to significant reduction in obtaining the
optimal results with classification accuracy and computational time [3]. The primary intention of this review is:
i) to provide the various FS algorithms to find the best accuracy in microarray dataset analysis, ii) to introduce the
various ML techniques involved in the microarray dataset, iii) to examine the publication trends in the domain
related to FS using microarray dataset, iv) to consolidate the various research work in different FS techniques by
various authors and experts, and v) to provide future scope of research in real time microarray dataset.
Optimum results are obtained in the microarray dataset using appropriate FS techniques. The
conventional FS techniques like filter method, wrapper method, embedded method, and hybrid methods are
favoured for the subset selection to obtain optimal results. Figure 1 depicts the standard procedure of a FS
process. The first step is to use a reliable search strategy to create a subset of the microarray dataset. The second
stage involves evaluating the list of subgroups and contrasting the best subset with the one that came before it.
If the newly updated subset is highly suggested than the previous one, then the change is left unmodified. Until
the termination condition is met, the procedure will be repeated. Then finally, the best subset score is selected
and given to the next classification technique. Selecting a feature subset is achieved using the Algorithm 1 [4].
Figure 2 shows the complete work flow of filter, wrapper, and embedded model, which finds the best
subset in all the features available in the dataset. Both the wrapper and embedded methods have the stopping
condition to validate the best subset [5]. Filter method does not involve with particular learning algorithm and
validation of best subset whereas wrapper method selects the features using the learning algorithm and
validating the optimal feature subset [6]. In embedded methods combines the both filter and wrapper method.
As per the study we have represented inferences and future scope of each method in the Table 1.
2. LITERATURE REVIEW
This section dives into two key aspects of analyzing microarray data: FS techniques and the overall
review technology roadmap. We'll explore the fundamental methods for selecting the most informative genes
from these VAST datasets, followed by a roadmap outlining the steps involved in effectively reviewing and
analyzing this technology.
optimization, harmony [11], differential evolution, whale optimization, artificial bee colony, and bacterial
colony optimization [12].
2.3. Embedded method
The embedded FS technique is a classifier dependent FS method [13]. The learning algorithm is
playing a vital role in the embedded method. The researchers prefer the embedded method commonly due to
the low computational cost. Irrelevant features are removed using widely used techniques like weight vector
of support vector machine (SVM), decision tree, weighted Naive Bayes (NB). The following embedded
methods, namely first-order inductive learner (FOIL) feature subset selection algorithm, probably
approximately correct (PAC) Bayes, kernel-penalized support vector machine (KP-SVM), least absolute
shrinkage and selection operator (LASSO).
The committed writing uncovers an assortment of approaches managing with critical resource
questions (CRQ) issues within this section details about review methodology employed throughout this paper.
As per the intension framed for this study along with CRQ is framed and the road map of the review
methodology is represented in Table 2. In the initial stage we have framed the CRQ based FS techniques used
in microarray dataset. While in the second stage we have selected 41 manuscripts related to the topic in recent
years. In the next stage FS techniques used in these articles are analyzed. Then we have focused on the
classification accuracy using various ML techniques. The final stage of the review methodology is to provide
the future directions to the researchers, from the insights. The review is based on the key questions which help
to form a study of this paper is represented in Table 2.
2.10 Challenges
The following obersrvation are the challenges faced during the exection features and dataset: i) the
computational time is high due to a greater number of features in the gene expression with the noise, and ii)
the unbalanced dataset will affect the training and test dataset so that the accuracy will be another issue.
Table 4 (see an Appendix) explains review methodology, for this different journals were refered from each
different methods were proposed for FS and accuracy is measured from each method and feature
enhanchements that can be done from each were dissussed.
3.6. Outliers
One of the most important yet under-discussed aspects of microarray data is the identification of
outliers. Polluted database samples are known as outliers, and they occur when instruments or humans make
mistakes during data collection or analysis [34]. The learning process is hindered by outliers because they
prevent the useful genes from being chosen.
APPENDIX
REFERENCES
[1] M. A. Hambali, T. O. Oladele, and K. S. Adewole, “Microarray cancer feature selection: review, challenges and research directions,”
International Journal of Cognitive Computing in Engineering, vol. 1, pp. 78–97, 2020, doi: 10.1016/j.ijcce.2020.11.001.
[2] V. Bolón-Canedo and B. Remeseiro, “Feature selection in image analysis: a survey,” Artificial Intelligence Review, vol. 53, no. 4,
pp. 2905–2931, 2020, doi: 10.1007/s10462-019-09750-3.
[3] R. C. Chen, C. Dewi, S. W. Huang, and R. E. Caraka, “Selecting critical features for data classification based on machine learning
methods,” Journal of Big Data, vol. 7, no. 1, pp. 1–26, 2020, doi: 10.1186/s40537-020-00327-4.
[4] B. Remeseiro and V. Bolon-Canedo, “A review of feature selection methods in medical applications,” Computers in Biology and
Medicine, vol. 112, pp. 1–9, 2019, doi: 10.1016/j.compbiomed.2019.103375.
[5] S. Shadravan, H. R. Naji, and V. K. Bardsiri, “The sailfish optimizer: a novel nature-inspired metaheuristic algorithm for solving
constrained engineering optimization problems,” Engineering Applications of Artificial Intelligence, vol. 80, pp. 20–34, 2019, doi:
10.1016/j.engappai.2019.01.001.
[6] K. Tadist, S. Najah, N. S. Nikolov, F. Mrabti, and A. Zahi, “Feature selection methods and genomic big data: a systematic review,”
Journal of Big Data, vol. 6, no. 1, pp. 1–24, 2019, doi: 10.1186/s40537-019-0241-0.
[7] T. Saw and P. Hnin, “Swarm intelligence based feature selection for high dimensional classification: a literature survey,”
International Journal of Computer (IJC), vol. 33, no. 1, pp. 69–83, 2019.
[8] Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19,
pp. 2507–2517, 2007, doi: 10.1093/bioinformatics/btm344.
[9] A. Mangal and E. A. Holm, “A comparative study of feature selection methods for stress hotspot classification in materials,”
Integrating Materials and Manufacturing Innovation, vol. 7, no. 3, pp. 87–95, 2018, doi: 10.1007/s40192-018-0109-8.
[10] R. R. Rani and D. Ramyachitra, “Microarray cancer gene feature selection using spider monkey optimization algorithm and cancer
classification using SVM,” Procedia Computer Science, vol. 143, pp. 108–116, 2018, doi: 10.1016/j.procs.2018.10.358.
[11] Y. Prasad, K. K. Biswas, and M. Hanmandlu, “A recursive PSO scheme for gene selection in microarray data,” Applied Soft
Computing Journal, vol. 71, pp. 213–225, 2018, doi: 10.1016/j.asoc.2018.06.019.
[12] B. Sahu, S. Dehuri, and A. K. Jagadev, “Feature selection model based on clustering and ranking in pipeline for microarray data,”
Informatics in Medicine Unlocked, vol. 9, pp. 107–122, 2017, doi: 10.1016/j.imu.2017.07.004.
[13] M. K. Ebrahimpour, H. Nezamabadi-pour, and M. Eftekhari, “CCFS: a cooperating coevolution technique for large scale feature selection
on microarray datasets,” Computational Biology and Chemistry, vol. 73, pp. 171–178, 2018, doi: 10.1016/j.compbiolchem.2018.02.006.
[14] C. Arunkumar and S. Ramakrishnan, “Attribute selection using fuzzy rough set based customized similarity measure for lung cancer
microarray gene expression data,” Future Computing and Informatics Journal, no. 3, pp. 131–142, 2018.
[15] H. Dong, T. Li, R. Ding, and J. Sun, “A novel hybrid genetic algorithm with granular information for feature selection and
optimization,” Applied Soft Computing Journal, vol. 65, pp. 33–46, 2018, doi: 10.1016/j.asoc.2017.12.048.
[16] S. Maldonado, R. Weber, and F. Famili, “Feature selection for high-dimensional class-imbalanced data sets using support vector
machines,” Information Sciences, vol. 286, pp. 228–246, 2014, doi: 10.1016/j.ins.2014.07.015.
[17] R. J. Urbanowicz, M. Meeker, W. La Cava, R. S. Olson, and J. H. Moore, “Relief-based feature selection: introduction and review,”
Journal of Biomedical Informatics, vol. 85, pp. 189–203, 2018, doi: 10.1016/j.jbi.2018.07.014.
[18] E. H. Houssein, M. E. Hosney, M. Elhoseny, D. Oliva, W. M. Mohamed, and M. Hassaballah, “Hybrid Harris hawks optimization
with cuckoo search for drug design and discovery in chemoinformatics,” Scientific Reports, vol. 10, no. 1, pp. 1–22, 2020, doi:
10.1038/s41598-020-71502-z.
[19] A. E. Hegazy, M. A. Makhlouf, and G. S. El-Tawel, “Improved salp swarm algorithm for feature selection,” Journal of King Saud
University - Computer and Information Sciences, vol. 32, no. 3, pp. 335–344, 2020, doi: 10.1016/j.jksuci.2018.06.003.
[20] Y. Gao, Y. Zhou, and Q. Luo, “An efficient binary equilibrium optimizer algorithm for feature selection,” IEEE Access, vol. 8, pp.
140936–140963, 2020, doi: 10.1109/ACCESS.2020.3013617.
[21] A. E. Hegazy, M. A. Makhlouf, and G. S. El-Tawel, “Feature selection using chaotic salp swarm algorithm for data classification,”
Arabian Journal for Science and Engineering, vol. 44, no. 4, pp. 3801–3816, 2019, doi: 10.1007/s13369-018-3680-6.
[22] R. Storn and K. Price, “Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces,”
Journal of Global Optimization, no. 11, pp. 341–359, 1997.
[23] S. Mirjalili and A. Lewis, “The whale optimization algorithm,” Advances in Engineering Software, vol. 95, pp. 51–67, 2016, doi:
10.1016/j.advengsoft.2016.01.008.
[24] W. Gao, S. Liu, and L. Huang, “A global best artificial bee colony algorithm for global optimization,” Journal of Computational
and Applied Mathematics, vol. 236, no. 11, pp. 2741–2753, 2012, doi: 10.1016/j.cam.2012.01.013.
[25] S. H. Bouazza, K. Auhmani, A. Zeroual, and N. Hamdi, “Selecting significant marker genes from microarray data by filter approach
for cancer diagnosis,” Procedia Computer Science, vol. 127, pp. 300–309, 2018, doi: 10.1016/j.procs.2018.01.126.
[26] A. K. Shukla, P. Singh, and M. Vardhan, “A hybrid gene selection method for microarray recognition,” Biocybernetics and
Biomedical Engineering, vol. 38, no. 4, pp. 975–991, 2018, doi: 10.1016/j.bbe.2018.08.004.
[27] M. J. Rani and D. Devaraj, “Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data
classification,” Journal of Medical Systems, vol. 43, no. 8, 2019, doi: 10.1007/s10916-019-1372-8.
[28] S. Maldonado and J. López, “Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM
classification,” Applied Soft Computing Journal, vol. 67, pp. 94–105, 2018, doi: 10.1016/j.asoc.2018.02.051.
[29] J. R. Anaraki and H. Usefi, “A comparative study of feature selection methods on genomic datasets,” in Proceedings - IEEE
Symposium on Computer-Based Medical Systems, 2019, pp. 471–476, doi: 10.1109/CBMS.2019.00097.
[30] R. Aziz, C. K. Verma, and N. Srivastava, “Dimension reduction methods for microarray data: a review,” AIMS Bioengineering, vol.
4, no. 1, pp. 179–197, 2017, doi: 10.3934/bioeng.2017.1.179.
[31] K. Balakrishnan, R. Dhanalakshmi, and U. M. Khaire, “Improved salp swarm algorithm based on the levy flight for feature
selection,” Journal of Supercomputing, vol. 77, no. 11, pp. 12399–12419, 2021, doi: 10.1007/s11227-021-03773-w.
[32] M. Liu, X. Yao, and Y. Li, “Hybrid whale optimization algorithm enhanced with Lévy flight and differential evolution for job shop
scheduling problems,” Applied Soft Computing Journal, vol. 87, 2020, doi: 10.1016/j.asoc.2019.105954.
[33] B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, and A. Alonso-Betanzos, “Ensemble feature selection: homogeneous and
heterogeneous approaches,” Knowledge-Based Systems, vol. 118, pp. 124–139, 2017, doi: 10.1016/j.knosys.2016.11.017.
[34] F. Yang and K. Z. Mao, “Robust feature selection for microarray data based on multicriterion fusion,” IEEE/ACM Transactions on
Computational Biology and Bioinformatics, vol. 8, no. 4, pp. 1080–1092, 2011, doi: 10.1109/TCBB.2010.103.
BIIOGRAPHIES OF AUTHORS
Dr. Sitesh Kumar Sinha completed his Ph.D. from degree from BRAB MIT,
Muzaffarpur, Bihar. Currently, he working as registrar (administration) Dr C. V. Raman
University Vaishali Bihar. He is completed government funding project during his Ph.D.
work related on computer network. He is published more than 40 research paper in various
international and national journals. His main research work focuses on network security,
image processing, and software engineering. He can be contacted at email:
siteshkumarsinha@gmail.com.