Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Feature Ranking from Random Forest Through Complex Network’s Centrality Measures

A Robust Ranking Method Without Using Out-of-Bag Examples

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2022)

Abstract

The volume of available data in recent years has rapidly increased. In consequence, datasets commonly end up with many irrelevant features. That increase may disturb human understanding and even lead to poor machine learning models. This research proposes a novel feature ranking method that employs trees from a Random Forest to transform a dataset into a complex network to which centrality measures are applied to rank the features. That process takes place by representing each tree as a graph where all the tree features are vertices on this graph, and the links within the nodes (father \(\rightarrow \) child) of the tree are represented by a weighted edge between the two respective vertices. The union of all graphs from individual trees leads to the complex network. Then, three centrality measures are applied to rank the features in the complex network. Experiments were performed in eighty-five supervised classification datasets, with a variation in the feature noise level, to evaluate our novel method. Results show that centrality measures in non-oriented complex networks are comparable and may be correlated to the Random Forest’s variable importance ranking algorithm. Vertex strength and eigenvector outperformed the Random Forest in 40% noise datasets, with a not statistically different result at a 95% confidence level.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47

    Article  MathSciNet  MATH  Google Scholar 

  2. Baranauskas, J.A., Netto, O.P., Nozawa, S.R., Macedo, A.A.: A tree-based algorithm for attribute selection. Appl. Intell. 48(4), 821–833 (2017). https://doi.org/10.1007/s10489-017-1008-y

    Article  Google Scholar 

  3. Bertini, J.R., Zhao, L., Lopes, A.A.: An incremental learning algorithm based on the k-associated graph for non-stationary data classification. Inf. Sci. 246(Supplement C), 52–68 (2013). https://doi.org/10.1016/j.ins.2013.05.016

  4. Bonacich, P.: Power and centrality: a family of measures. Am. J. Sociol. 92(5), 1170–1182 (1987). https://doi.org/10.1086/228631

    Article  Google Scholar 

  5. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655

  6. Breiman, L.: Random forests. Mac. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  7. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth & Books, Pacific Grove (1984). https://doi.org/10.1201/9781315139470

  8. Cacciatore, S., Tenori, L., Luchinat, C., Bennett, P.R., MacIntyre, D.A.: KODAMA: an R package for knowledge discovery and data mining. Bioinformatics 33(4), 621–623 (2016). https://doi.org/10.1093/bioinformatics/btw705

  9. Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014). https://doi.org/10.1016/j.compeleceng.2013.11.024

    Article  Google Scholar 

  10. Costa, L.F., Rodrigues, F.A., Travieso, G., Boas, P.R.V.: Characterization of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007). https://doi.org/10.1080/00018730601170527

    Article  Google Scholar 

  11. Cupertino, T., Carneiro, M., Zheng, Q., Zhang, J., Zhao, L.: A scheme for high level data classification using random walk and network measures. Exp. Syst. Appl. 9 (2017)

    Google Scholar 

  12. Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961). https://doi.org/10.1080/01621459.1961.10482090

    Article  MathSciNet  MATH  Google Scholar 

  13. Erdős, P., Rényi, A.: On the evolution of random graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences, pp. 17–61 (1960)

    Google Scholar 

  14. Ferreira, L.N., Zhao, L.: Time series clustering via community detection in networks. Inf. Sci. 326, 227–242 (2016). https://doi.org/10.1016/j.ins.2015.07.046

    Article  MathSciNet  MATH  Google Scholar 

  15. Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11(1), 86–92 (1940). https://doi.org/10.1214/aoms/1177731944

    Article  MathSciNet  MATH  Google Scholar 

  16. Hashemi, A., Dowlatshahi, M.B., Nezamabadi-pour, H.: MGFS: a multi-label graph-based feature selection algorithm via pagerank centrality. Exp. Syst. Appl. 142 (2020). https://doi.org/10.1016/j.eswa.2019.113024

  17. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd edn. Springer New York (2009). http://www-stat.stanford.edu/~tibs/ElemStatLearn/

  18. Khoshgoftaar, T.M., Hulse, J.V.: Empirical case studies in attribute noise detection. IEEE Trans. Syst. Man Cybern. Part C (Applications and Reviews) 39(4), 379–388 (2009). https://doi.org/10.1109/TSMCC.2009.2013815

  19. Leisch, F., Dimitriadou, E.: mlbench: machine learning benchmark problems (2010), R package version 2.1-1(2010)

    Google Scholar 

  20. Li, J., et al.: Feature selection: a data perspective. ACM Comput. Surv. 50(6) (2017). https://doi.org/10.1145/3136625

  21. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, vol. 26, pp. 431–439. Curran Associates, Inc. (2013). https://dl.acm.org,https://doi.org/10.5555/2999611.2999660

  22. Ma, Y., Guo, L., Cukic, B., Lane: a statistical framework for the prediction of fault-proneness. In: AAdvances in Machine Learning Applications in Software Engineering. pp. 237–265. IGI Global (2006). https://doi.org/10.4018/978-1-59140-941-1.ch010

  23. Mairal, J., Yu, B.: Supervised feature selection in graphs with path coding penalties and network flows. J. Mach. Learn. Res. 14(39), 2449–2485 (2013). http://jmlr.org/papers/v14/mairal13a

  24. METZ, J.e.a.: Redes complexas: conceitos e aplicações. Tech. Rep. 290, ICMC-USP January 2007. http://repositorio.icmc.usp.br//handle/RIICMC/6720

  25. Miao, J., Niu, L.: A survey on feature selection. Procedia Comput. Sci. 91, 919–926 (2016)

    Google Scholar 

  26. Moradi, P., Rostami, M.: A graph theoretic approach for unsupervised feature selection. Eng. Appl. Artif. Intell. 44, 33–45 (2015). j.engappai.2015.05.005

    Google Scholar 

  27. Neto, F.A., Zhao, L.: Random Walk in Feature - Sample Networks for Semi-Supervised Classification, pp. 1–6 (2016). https://doi.org/10.1109/BRACIS.2016.41

  28. Ni, B., Yan, S., Kassim, A.: Learning a propagable graph for semisupervised learning: classification and regression. IEEE Trans. Knowl. Data Eng., 24(1), 114–126 (2012). https://doi.org/10.1109/TKDE.2010.209

  29. Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How Many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS (LNAI), vol. 7376, pp. 154–168. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31537-4_13

  30. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  31. Quinlan, J.R.: C4.5: Programs for Machine Learning. MA, San Francisco (1993). https://doi.org/10.1007/BF00993309

  32. Rathkopf, C.: Network representation and complex systems. Synthese 195(1), 55–78 (2015). https://doi.org/10.1007/s11229-015-0726-0

  33. Silva, T.C., Zhao, L.: Machine Learning in Complex Networks. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-17290-3

  34. Venkatesh, B., Anuradha, J.: A review of feature selection and its methods. Cybern. Inf. Technol. 19(1), 3–26 (2019). https://doi.org/10.2478/cait-2019-0001

  35. Zhang, Z., Hancock, E.R.: A graph-based approach to feature selection. In: Graph-Based Representations in Pattern Recognition. pp. 205–214. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20844-7_21

  36. Zhao, Y., Zhang, Y.: Comparison of decision tree methods for finding active objects. Adv. Space Res. 41(12), 1955–1959 (2008). https://doi.org/10.1016/j.asr.2007.07.020

  37. Zheng, W., Zhu, X., Zhu, Y., Hu, R., Lei, C.: Dynamic graph learning for spectral feature selection. Multim. Tools Appl; 77(22), 29739–29755 (2017). https://doi.org/10.1007/s11042-017-5272-y

  38. Zhu, Y., Zhong, Z., Cao, W., Cheng, D.: Graph feature selection for dementia diagnosis. Neurocomputing 195, 19–22 (2016). https://doi.org/10.1016/j.neucom.2015.09.126

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Adriano Henrique Cantão or José Augusto Baranauskas .

Editor information

Editors and Affiliations

A Artificial Dataset Generators

A Artificial Dataset Generators

Datasets were generated by the packages Scikit-learn [30], MLBench [19], and KODAMA [8].

Table 1. Description of the artificial dataset generators.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cantão, A.H., Macedo, A.A., Zhao, L., Baranauskas, J.A. (2022). Feature Ranking from Random Forest Through Complex Network’s Centrality Measures. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2022. Lecture Notes in Computer Science, vol 13389. Springer, Cham. https://doi.org/10.1007/978-3-031-15740-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15740-0_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15739-4

  • Online ISBN: 978-3-031-15740-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics