Abstract
The methods for variable importance measures and feature selection in the task of classification/regression in data mining and Big Data enable the removal of noise caused by irrelevant or redundant variables, the reduction of computational cost in the construction of models and facilitate the understanding of these models. This paper presents a proposal to measure the importance of the input variables in a classification/regression problem, taking as input the solutions evaluated by a wrapper and the performance information (quality of classification expressed for example in accuracy, precision, recall, F measure, among others) associated with each of these solutions. The proposed method quantifies the effect on the classification/regression performance produced by the presence or absence of each input variable in the subsets evaluated by the wrapper. This measure has the advantage of being specific for each classifier, which makes it possible to differentiate the effects each input variable can generate depending on the model built. The proposed method was evaluated using the results of three wrappers - one based on genetic algorithms (GA), another on particle swarm optimization (PSO), and a new proposal based on covering arrays (CA) - and compared with two filters and the variable importance in Random Forest. The experiments were performed on three classifiers (Naive Bayes, Random Forest and Multi-Layer Perception) and seven data sets from the UCI repository. The comparisons were made using Friedman’s Aligned Ranks test and the results indicate that the proposed measure stands out for maintaining in the first input variables a higher quality in the classification, approximating better to the variables found by the feature selection methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cui, L., Lu, Z., Wang, P., Wang, W.: The ordering importance measure of random variable and its estimation. Math. Comput. Simul. 105, 132–143 (2014)
Li, L., Lu, Z.: Importance analysis for models with correlated variables and its sparse grid solution. Reliab. Eng. Syst. Saf. 119, 207–217 (2013)
Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40, 16–28 (2014)
Aggarwal, C.C.: Feature selection for classification: a review. In: Tang, J., Alelyani, S., Liu, H. (eds.) Data Classification: Algorithms and Applications, 1st edn., pp. 37-64. Chapman & Hall/CRC (2014)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Wei, P., Lu, Z., Song, J.: Variable importance analysis: a comprehensive review. Reliab. Eng. Syst. Saf. 142, 399–432 (2015)
Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010)
Kotsiantis, S.B.: Feature selection for machine learning classification problems : a recent overview. (2011)
Jovi, A., Brki, K., Bogunovi, N.: A review of feature selection methods with applications, pp. 25–29 (2015)
Abd-Alsabour, N.: A review on evolutionary feature selection. In: Proceedings - UKSim-AMSS 8th European Modelling Symposium on Computer Modelling and Simulation, EMS 2014, pp. 20–26 (2014)
Khan, G.M.: Evolutionary computation. In: Evolution of Artificial Neural Development, pp. 29–37. Springer, Boston (2018). https://doi.org/10.1007/978-1-4899-7687-1
Sastry, K., Goldberg, D.E., Kendall, G.: Genetic algorithms. In: Search Methodologies, pp. 93–117. Springer, Boston (2014). https://doi.org/10.1007/0-387-28356-0_4
Wan, Y., Wang, M., Ye, Z., Lai, X.: A feature selection method based on modified binary coded ant colony optimization algorithm. Appl. Soft Comput. 49, 248–258 (2016)
Xue, B., Zhang, M., Browne, W.N.: Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans. Cybern. 43, 1656–1671 (2013)
Kennedy, J., Eberhart, R.C.: A discrete binary version of the particle swarm algorithm. IEEE Int. Conf. Syst. Man, Cybern. Comput. Cybern. Simul. 5, 4104–4108 (1997)
Tzanakis, G., Moura, L., Panario, D., Stevens, B.: Constructing new covering arrays from LFSR sequences over finite fields. Discrete Math. 339, 1158–1171 (2016)
Dheeru, D., Karra Taniskidou, E.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
Lin, S.-W., Lee, Z.-J., Chen, S.-C., Tseng, T.-Y.: Parameter determination of support vector machine and feature selection using simulated annealing approach. Appl. Soft Comput. 8, 1505–1512 (2008)
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. (Ny) 180, 2044–2064 (2010)
Zarshenas, A., Suzuki, K.: Binary coordinate ascent: an efficient optimization technique for feature subset selection for machine learning. Knowl.-Based Syst. 110, 191–201 (2016)
Strobl, C., Boulesteix, A.-L., Zeileis, A., Hothorn, T.: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 8, 25 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Dorado, H., Cobos, C., Torres-Jimenez, J., Jimenez, D., Mendoza, M. (2018). A Proposal to Estimate the Variable Importance Measures in Predictive Models Using Results from a Wrapper. In: Groza, A., Prasath, R. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2018. Lecture Notes in Computer Science(), vol 11308. Springer, Cham. https://doi.org/10.1007/978-3-030-05918-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-05918-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05917-0
Online ISBN: 978-3-030-05918-7
eBook Packages: Computer ScienceComputer Science (R0)