Abstract
The volume of available data in recent years has rapidly increased. In consequence, datasets commonly end up with many irrelevant features. That increase may disturb human understanding and even lead to poor machine learning models. This research proposes a novel feature ranking method that employs trees from a Random Forest to transform a dataset into a complex network to which centrality measures are applied to rank the features. That process takes place by representing each tree as a graph where all the tree features are vertices on this graph, and the links within the nodes (father \(\rightarrow \) child) of the tree are represented by a weighted edge between the two respective vertices. The union of all graphs from individual trees leads to the complex network. Then, three centrality measures are applied to rank the features in the complex network. Experiments were performed in eighty-five supervised classification datasets, with a variation in the feature noise level, to evaluate our novel method. Results show that centrality measures in non-oriented complex networks are comparable and may be correlated to the Random Forest’s variable importance ranking algorithm. Vertex strength and eigenvector outperformed the Random Forest in 40% noise datasets, with a not statistically different result at a 95% confidence level.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47
Baranauskas, J.A., Netto, O.P., Nozawa, S.R., Macedo, A.A.: A tree-based algorithm for attribute selection. Appl. Intell. 48(4), 821–833 (2017). https://doi.org/10.1007/s10489-017-1008-y
Bertini, J.R., Zhao, L., Lopes, A.A.: An incremental learning algorithm based on the k-associated graph for non-stationary data classification. Inf. Sci. 246(Supplement C), 52–68 (2013). https://doi.org/10.1016/j.ins.2013.05.016
Bonacich, P.: Power and centrality: a family of measures. Am. J. Sociol. 92(5), 1170–1182 (1987). https://doi.org/10.1086/228631
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Breiman, L.: Random forests. Mac. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth & Books, Pacific Grove (1984). https://doi.org/10.1201/9781315139470
Cacciatore, S., Tenori, L., Luchinat, C., Bennett, P.R., MacIntyre, D.A.: KODAMA: an R package for knowledge discovery and data mining. Bioinformatics 33(4), 621–623 (2016). https://doi.org/10.1093/bioinformatics/btw705
Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014). https://doi.org/10.1016/j.compeleceng.2013.11.024
Costa, L.F., Rodrigues, F.A., Travieso, G., Boas, P.R.V.: Characterization of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007). https://doi.org/10.1080/00018730601170527
Cupertino, T., Carneiro, M., Zheng, Q., Zhang, J., Zhao, L.: A scheme for high level data classification using random walk and network measures. Exp. Syst. Appl. 9 (2017)
Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961). https://doi.org/10.1080/01621459.1961.10482090
Erdős, P., Rényi, A.: On the evolution of random graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pp. 17–61 (1960)
Ferreira, L.N., Zhao, L.: Time series clustering via community detection in networks. Inf. Sci. 326, 227–242 (2016). https://doi.org/10.1016/j.ins.2015.07.046
Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11(1), 86–92 (1940). https://doi.org/10.1214/aoms/1177731944
Hashemi, A., Dowlatshahi, M.B., Nezamabadi-pour, H.: MGFS: a multi-label graph-based feature selection algorithm via pagerank centrality. Exp. Syst. Appl. 142 (2020). https://doi.org/10.1016/j.eswa.2019.113024
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. Springer New York (2009). http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Khoshgoftaar, T.M., Hulse, J.V.: Empirical case studies in attribute noise detection. IEEE Trans. Syst. Man Cybern. Part C (Applications and Reviews) 39(4), 379–388 (2009). https://doi.org/10.1109/TSMCC.2009.2013815
Leisch, F., Dimitriadou, E.: mlbench: machine learning benchmark problems (2010), R package version 2.1-1(2010)
Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. 50(6) (2017). https://doi.org/10.1145/3136625
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, vol. 26, pp. 431–439. Curran Associates, Inc. (2013). https://dl.acm.org,https://doi.org/10.5555/2999611.2999660
Ma, Y., Guo, L., Cukic, B., Lane: a statistical framework for the prediction of fault-proneness. In: AAdvances in Machine Learning Applications in Software Engineering. pp. 237–265. IGI Global (2006). https://doi.org/10.4018/978-1-59140-941-1.ch010
Mairal, J., Yu, B.: Supervised feature selection in graphs with path coding penalties and network flows. J. Mach. Learn. Res. 14(39), 2449–2485 (2013). http://jmlr.org/papers/v14/mairal13a
METZ, J.e.a.: Redes complexas: conceitos e aplicações. Tech. Rep. 290, ICMC-USP January 2007. http://repositorio.icmc.usp.br//handle/RIICMC/6720
Miao, J., Niu, L.: A survey on feature selection. Procedia Comput. Sci. 91, 919–926 (2016)
Moradi, P., Rostami, M.: A graph theoretic approach for unsupervised feature selection. Eng. Appl. Artif. Intell. 44, 33–45 (2015). j.engappai.2015.05.005
Neto, F.A., Zhao, L.: Random Walk in Feature - Sample Networks for Semi-Supervised Classification, pp. 1–6 (2016). https://doi.org/10.1109/BRACIS.2016.41
Ni, B., Yan, S., Kassim, A.: Learning a propagable graph for semisupervised learning: classification and regression. IEEE Trans. Knowl. Data Eng., 24(1), 114–126 (2012). https://doi.org/10.1109/TKDE.2010.209
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How Many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS (LNAI), vol. 7376, pp. 154–168. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31537-4_13
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Quinlan, J.R.: C4.5: Programs for Machine Learning. MA, San Francisco (1993). https://doi.org/10.1007/BF00993309
Rathkopf, C.: Network representation and complex systems. Synthese 195(1), 55–78 (2015). https://doi.org/10.1007/s11229-015-0726-0
Silva, T.C., Zhao, L.: Machine Learning in Complex Networks. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-17290-3
Venkatesh, B., Anuradha, J.: A review of feature selection and its methods. Cybern. Inf. Technol. 19(1), 3–26 (2019). https://doi.org/10.2478/cait-2019-0001
Zhang, Z., Hancock, E.R.: A graph-based approach to feature selection. In: Graph-Based Representations in Pattern Recognition. pp. 205–214. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20844-7_21
Zhao, Y., Zhang, Y.: Comparison of decision tree methods for finding active objects. Adv. Space Res. 41(12), 1955–1959 (2008). https://doi.org/10.1016/j.asr.2007.07.020
Zheng, W., Zhu, X., Zhu, Y., Hu, R., Lei, C.: Dynamic graph learning for spectral feature selection. Multim. Tools Appl; 77(22), 29739–29755 (2017). https://doi.org/10.1007/s11042-017-5272-y
Zhu, Y., Zhong, Z., Cao, W., Cheng, D.: Graph feature selection for dementia diagnosis. Neurocomputing 195, 19–22 (2016). https://doi.org/10.1016/j.neucom.2015.09.126
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Cantão, A.H., Macedo, A.A., Zhao, L., Baranauskas, J.A. (2022). Feature Ranking from Random Forest Through Complex Network’s Centrality Measures. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2022. Lecture Notes in Computer Science, vol 13389. Springer, Cham. https://doi.org/10.1007/978-3-031-15740-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-15740-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15739-4
Online ISBN: 978-3-031-15740-0
eBook Packages: Computer ScienceComputer Science (R0)