Feature Ranking from Random Forest Through Complex Network’s Centrality Measures

Cantão, Adriano Henrique; Macedo, Alessandra Alaniz; Zhao, Liang; Baranauskas, José Augusto

doi:10.1007/978-3-031-15740-0_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13389))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

696 Accesses

Abstract

The volume of available data in recent years has rapidly increased. In consequence, datasets commonly end up with many irrelevant features. That increase may disturb human understanding and even lead to poor machine learning models. This research proposes a novel feature ranking method that employs trees from a Random Forest to transform a dataset into a complex network to which centrality measures are applied to rank the features. That process takes place by representing each tree as a graph where all the tree features are vertices on this graph, and the links within the nodes (father \(\rightarrow \) child) of the tree are represented by a weighted edge between the two respective vertices. The union of all graphs from individual trees leads to the complex network. Then, three centrality measures are applied to rank the features in the complex network. Experiments were performed in eighty-five supervised classification datasets, with a variation in the feature noise level, to evaluate our novel method. Results show that centrality measures in non-oriented complex networks are comparable and may be correlated to the Random Forest’s variable importance ranking algorithm. Vertex strength and eigenvector outperformed the Random Forest in 40% noise datasets, with a not statistically different result at a 95% confidence level.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Feature-ranked self-growing forest: a tree ensemble based on structure diversity for classification and regression

Article 27 January 2023

Reliable Attribute Selection Based on Random Forest (RASER)

Bringing a Feature Selection Metric from Machine Learning to Complex Networks

References

Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47
Article MathSciNet MATH Google Scholar
Baranauskas, J.A., Netto, O.P., Nozawa, S.R., Macedo, A.A.: A tree-based algorithm for attribute selection. Appl. Intell. 48(4), 821–833 (2017). https://doi.org/10.1007/s10489-017-1008-y
Article Google Scholar
Bertini, J.R., Zhao, L., Lopes, A.A.: An incremental learning algorithm based on the k-associated graph for non-stationary data classification. Inf. Sci. 246(Supplement C), 52–68 (2013). https://doi.org/10.1016/j.ins.2013.05.016
Bonacich, P.: Power and centrality: a family of measures. Am. J. Sociol. 92(5), 1170–1182 (1987). https://doi.org/10.1086/228631
Article Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Breiman, L.: Random forests. Mac. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth & Books, Pacific Grove (1984). https://doi.org/10.1201/9781315139470
Cacciatore, S., Tenori, L., Luchinat, C., Bennett, P.R., MacIntyre, D.A.: KODAMA: an R package for knowledge discovery and data mining. Bioinformatics 33(4), 621–623 (2016). https://doi.org/10.1093/bioinformatics/btw705
Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014). https://doi.org/10.1016/j.compeleceng.2013.11.024
Article Google Scholar
Costa, L.F., Rodrigues, F.A., Travieso, G., Boas, P.R.V.: Characterization of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007). https://doi.org/10.1080/00018730601170527
Article Google Scholar
Cupertino, T., Carneiro, M., Zheng, Q., Zhang, J., Zhao, L.: A scheme for high level data classification using random walk and network measures. Exp. Syst. Appl. 9 (2017)
Google Scholar
Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961). https://doi.org/10.1080/01621459.1961.10482090
Article MathSciNet MATH Google Scholar
Erdős, P., Rényi, A.: On the evolution of random graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pp. 17–61 (1960)
Google Scholar
Ferreira, L.N., Zhao, L.: Time series clustering via community detection in networks. Inf. Sci. 326, 227–242 (2016). https://doi.org/10.1016/j.ins.2015.07.046
Article MathSciNet MATH Google Scholar
Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11(1), 86–92 (1940). https://doi.org/10.1214/aoms/1177731944
Article MathSciNet MATH Google Scholar
Hashemi, A., Dowlatshahi, M.B., Nezamabadi-pour, H.: MGFS: a multi-label graph-based feature selection algorithm via pagerank centrality. Exp. Syst. Appl. 142 (2020). https://doi.org/10.1016/j.eswa.2019.113024
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. Springer New York (2009). http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Khoshgoftaar, T.M., Hulse, J.V.: Empirical case studies in attribute noise detection. IEEE Trans. Syst. Man Cybern. Part C (Applications and Reviews) 39(4), 379–388 (2009). https://doi.org/10.1109/TSMCC.2009.2013815
Leisch, F., Dimitriadou, E.: mlbench: machine learning benchmark problems (2010), R package version 2.1-1(2010)
Google Scholar
Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. 50(6) (2017). https://doi.org/10.1145/3136625
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, vol. 26, pp. 431–439. Curran Associates, Inc. (2013). https://dl.acm.org,https://doi.org/10.5555/2999611.2999660
Ma, Y., Guo, L., Cukic, B., Lane: a statistical framework for the prediction of fault-proneness. In: AAdvances in Machine Learning Applications in Software Engineering. pp. 237–265. IGI Global (2006). https://doi.org/10.4018/978-1-59140-941-1.ch010
Mairal, J., Yu, B.: Supervised feature selection in graphs with path coding penalties and network flows. J. Mach. Learn. Res. 14(39), 2449–2485 (2013). http://jmlr.org/papers/v14/mairal13a
METZ, J.e.a.: Redes complexas: conceitos e aplicações. Tech. Rep. 290, ICMC-USP January 2007. http://repositorio.icmc.usp.br//handle/RIICMC/6720
Miao, J., Niu, L.: A survey on feature selection. Procedia Comput. Sci. 91, 919–926 (2016)
Google Scholar
Moradi, P., Rostami, M.: A graph theoretic approach for unsupervised feature selection. Eng. Appl. Artif. Intell. 44, 33–45 (2015). j.engappai.2015.05.005
Google Scholar
Neto, F.A., Zhao, L.: Random Walk in Feature - Sample Networks for Semi-Supervised Classification, pp. 1–6 (2016). https://doi.org/10.1109/BRACIS.2016.41
Ni, B., Yan, S., Kassim, A.: Learning a propagable graph for semisupervised learning: classification and regression. IEEE Trans. Knowl. Data Eng., 24(1), 114–126 (2012). https://doi.org/10.1109/TKDE.2010.209
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How Many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS (LNAI), vol. 7376, pp. 154–168. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31537-4_13
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. MA, San Francisco (1993). https://doi.org/10.1007/BF00993309
Rathkopf, C.: Network representation and complex systems. Synthese 195(1), 55–78 (2015). https://doi.org/10.1007/s11229-015-0726-0
Silva, T.C., Zhao, L.: Machine Learning in Complex Networks. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-17290-3
Venkatesh, B., Anuradha, J.: A review of feature selection and its methods. Cybern. Inf. Technol. 19(1), 3–26 (2019). https://doi.org/10.2478/cait-2019-0001
Zhang, Z., Hancock, E.R.: A graph-based approach to feature selection. In: Graph-Based Representations in Pattern Recognition. pp. 205–214. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20844-7_21
Zhao, Y., Zhang, Y.: Comparison of decision tree methods for finding active objects. Adv. Space Res. 41(12), 1955–1959 (2008). https://doi.org/10.1016/j.asr.2007.07.020
Zheng, W., Zhu, X., Zhu, Y., Hu, R., Lei, C.: Dynamic graph learning for spectral feature selection. Multim. Tools Appl; 77(22), 29739–29755 (2017). https://doi.org/10.1007/s11042-017-5272-y
Zhu, Y., Zhong, Z., Cao, W., Cheng, D.: Graph feature selection for dementia diagnosis. Neurocomputing 195, 19–22 (2016). https://doi.org/10.1016/j.neucom.2015.09.126

Download references

Author information

Authors and Affiliations

Department of Computer Science and Mathematics, Faculty of Philosophy, Sciences and Letters at Ribeirao Preto, University of Sao Paulo, Bandeirantes Avenue, 3900, Ribeirao Preto, SP, 14040-901, Brazil
Adriano Henrique Cantão, Alessandra Alaniz Macedo, Liang Zhao & José Augusto Baranauskas

Authors

Adriano Henrique Cantão
View author publications
You can also search for this author in PubMed Google Scholar
Alessandra Alaniz Macedo
View author publications
You can also search for this author in PubMed Google Scholar
Liang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
José Augusto Baranauskas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Adriano Henrique Cantão or José Augusto Baranauskas .

Editor information

Editors and Affiliations

Politecnico di Torino, Turin, Italy
Silvia Chiusano
Politecnico di Torino, Turin, Italy
Tania Cerquitelli
Poznań University of Technology, Poznań, Poland
Robert Wrembel

A Artificial Dataset Generators

Datasets were generated by the packages Scikit-learn [30], MLBench [19], and KODAMA [8].

Table 1. Description of the artificial dataset generators.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cantão, A.H., Macedo, A.A., Zhao, L., Baranauskas, J.A. (2022). Feature Ranking from Random Forest Through Complex Network’s Centrality Measures. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2022. Lecture Notes in Computer Science, vol 13389. Springer, Cham. https://doi.org/10.1007/978-3-031-15740-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-15740-0_24
Published: 29 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15739-4
Online ISBN: 978-3-031-15740-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Feature Ranking from Random Forest Through Complex Network’s Centrality Measures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Feature-ranked self-growing forest: a tree ensemble based on structure diversity for classification and regression

Reliable Attribute Selection Based on Random Forest (RASER)

Bringing a Feature Selection Metric from Machine Learning to Complex Networks

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Artificial Dataset Generators

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Feature Ranking from Random Forest Through Complex Network’s Centrality Measures

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Feature-ranked self-growing forest: a tree ensemble based on structure diversity for classification and regression

Reliable Attribute Selection Based on Random Forest (RASER)

Bringing a Feature Selection Metric from Machine Learning to Complex Networks

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Artificial Dataset Generators

A Artificial Dataset Generators

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation