Literature Review On Feature Selection Methods For HighDimensional Data
Literature Review On Feature Selection Methods For HighDimensional Data
S. Appavu alias
Balamurugan
ABSTRACT
Feature selection plays a significant role in improving the
performance of the machine learning algorithms in terms of
reducing the time to build the learning model and increasing
the accuracy in the learning process. Therefore, the
researchers pay more attention on the feature selection to
enhance the performance of the machine learning algorithms.
Identifying the suitable feature selection method is very
essential for a given machine learning task with highdimensional data. Hence, it is required to conduct the study on
the various feature selection methods for the research
community especially dedicated to develop the suitable
feature selection method for enhancing the performance of the
machine learning tasks on high-dimensional data. In order to
fulfill this objective, this paper devotes the complete literature
review on the various feature selection methods for highdimensional data.
General Terms
Literature review on feature selection methods, study on
feature selection, wrapper-based feature selection, embeddedbased feature selection, hybrid feature selection, filter-based
feature selection, feature subset-based feature selection,
feature ranking-based feature selection, attribute selection,
dimensionality reduction, variable selection, survey on feature
selection, feature selection for high-dimensional data,
introduction to variable and feature selection, feature selection
for classification.
Keywords
Introduction to variable and feature selection, information
gain-based feature selection, gain ratio-based feature
selection, symmetric uncertainty-based feature selection,
subset-based feature selection, ranking-based feature
selection, wrapper-based feature selection, embedded-based
feature selection, filter-based feature selection, hybrid feature
selection, selecting feature from high-dimensional data.
1. INTRODUCTION
In the digital era, handling the massive data is a challenging
task among the researchers since the data are accumulated
through various data acquisition techniques, methods, and
devices. These accumulated massive raw data degrade the
performance of the machine learning algorithms in terms of
causing overfitting, spending more time to develop the
machine learning modes and degrading their accuracy since
the raw data are noisy in nature and have more number of
features known as high-dimensional data. In general, the highdimensional data contains irrelevant and the redundant
features. The irrelevant features cannot involve in the learning
process and the redundant features contain same information
hence thy miss lead the learning process. Therefore, these
issues can be tackled by the feature selection. The feature
E. Jebamalar Leavline
Anna University, BIT Campus,
Tiruchirappalli, India.
2. FEATURE SELECTION
Feature selection is a process of removing the irrelevant and
redundant features from a dataset in order to improve the
performance of the machine learning algorithms in terms of
accuracy and time to build the model. The process of feature
selection is classified into two categories namely feature
subset selection and feature ranking methods based on how
the features are combined for evaluation. The feature subset
selection approach generates the possible number of
combinations of the feature subsets using any one of the
searching strategies such as a greedy forward selection,
greedy backward elimination, etc. to evaluate the individual
feature subset with a feature selection metric such as
correlation, consistency, etc. In this method, space and the
computational complexity involved are more due to the subset
generation and evaluation [2].
In feature ranking method, each feature is ranked by a
selection metric such as information gain, symmetric
uncertainty, gain ratio, etc. and the top ranked features are
selected as relevant features by a pre-defined threshold value.
This approach is computationally cheaper and space
complexity is less compared to subset approach. However, it
does not deal with redundant features.
Further, the process of feature selection is classified into four
categories namely wrapper, embedded, filter, and hybrid
10
3.1.2
3.2.1
Wrapper-based methods
11
3.2.2
Embedded-based methods
3.2.3
Filter-based methods
12
3.2.3
Hybrid Methods
4. SUMMARY
This section summarizes the feature selection methods that are
categorized based on how the features are combined in the
selection process namely feature subset-based and feature
ranking-based and based on how the supervised learning
algorithm used namely wrapper, embedded, hybrid, and filter.
The subset-based methods generate the feature subsets using
any one of the searching strategies for evaluation. The
exhaustive or complete search is used to generate the subset
that leads to high computational complexity since maximum
2N number of possible combination of the subsets to be
generated from the N number of features to evaluate them.
This is a brute force method so this is not suitable for highdimensional space. The heuristic search such as SA, TS,
ACO, GA, and PSO are employed to reduce the number of
feature subset generation for evaluation using the heuristic
function. The subset-based feature selection methods using
the heuristic search lead to more computational complexity,
because they need prior knowledge and each generated subset
need to develop a classification model to evaluate them.
13
5. CONCLUSION
This paper analyzed several feature selection methods that are
proposed by various researchers. From the earlier research
works, it is observed that the feature ranking-based methods
are better than the subset-based methods in terms of memory
space and computational complexity and the ranking-based
methods do not reduce the redundancy. Further, the wrapper,
embedded, and hybrid methods are computationally
inefficient than the filter method and they have poor
6. REFERENCES
[1] Saeys, Y, Inza, I & Larraaga, P 2007, A review of
feature selection techniques in bioinformatics.
Bioinformatics, vol. 23, no. 19, pp.2507-2517
[2] Boln-Canedo, V, Snchez-Maroo, N & AlonsoBetanzos, A, 2013, A review of feature selection
methods on synthetic data, Knowledge and information
systems, vol. 34, no.3, pp.483-519.
[3] Hall, MA 1999, Correlation-based feature selection for
machine learning, Ph.D. thesis, The University of
Waikato, NewZealand.
[4] Liu, H & Setiono, R 1996, A probabilistic approach to
feature selection-a filter solution, Proceedings of
Eighteenth International Conference on Machine
Learning, Italy, pp. 319-327.
[5] Lisnianski, A, Frenkel, I & Ding, Y, 2010, Multi-state
system reliability analysis
and optimization for
engineers and industrial managers, Springer, New York.
[6] Lin, S.W, Tseng, TY, Chou, SY & Chen, SC 2008, A
simulated-annealing-based approach for simultaneous
parameter optimization and feature selection of backpropagation
networks,
Expert
Systems
with
Applications, vol. 34, no.2, pp.1491-1499.
[7] Meiri, R & Zahavi, J 2006, Using simulated annealing
to optimize the feature selection problem in marketing
applications, European Journal of Operational Research,
vol.171, no.3, pp.842-858.
[8] Zhang, H & Sun, G 2002, Feature selection using tabu
search method, Pattern recognition, vol. 35, no.3,
pp.701-711.
[9] Tahir, MA, Bouridane, A & Kurugollu, F 2007,
Simultaneous feature selection and feature weighting
using Hybrid Tabu Search/K-nearest neighbor classifier
Pattern Recognition Letters, vol. 28, no.4, pp.438-446.
[10] Aghdam, MH, Ghasem-Aghaee, N & Basiri, ME 2009,
Text feature selection using ant colony optimization,
Expert systems with applications, vol. 36, no.3, pp.68436853.
[11] Kanan, HR & Faez, K 2008, An improved feature
selection method based on ant colony optimization
(ACO) evaluated on face recognition system, Applied
Mathematics and Computation, vol. 205, no.2, pp.716725.
[12] Sivagaminathan, RK & Ramakrishnan, S 2007, A
hybrid approach for feature subset selection using neural
networks and ant colony optimization, Expert systems
with applications, vol. 33, no.1, pp.49-60.
[13] Sreeja, NK & Sankar, A 2015, Pattern Matching based
Classification using Ant Colony Optimization based
Feature Selection, Applied Soft Computing, vol. 31,
pp.91-102.
14
[23] Lin, SW, Ying, KC, Chen, SC & Lee, ZJ 2008, Particle
swarm optimization for parameter determination and
feature selection of support vector machines, Expert
systems with applications, vol. 35, no. 4, pp.1817-1824.
15
16
IJCATM : www.ijcaonline.org
17