Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2001858.2002059acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
tutorial

Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression

Published: 12 July 2011 Publication History

Abstract

Feature selection in high-dimensional data sets is an open problem with no universal satisfactory method available. In this paper we discuss the requirements for such a method with respect to the various aspects of feature importance and explore them using regression random forests and symbolic regression. We study 'conventional' feature selection with both methods on several test problems and a case study, compare the results, and identify the conceptual differences in generated feature importances.
We demonstrate that random forests might overlook important variables (significantly related to the response) for various reasons, while symbolic regression identifies all important variables if models of sufficient quality are found. We explain the results by the fact that variable importances obtained by these methods have different semantics.

References

[1]
H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64:115--123, Mar. 2008.
[2]
L. Breiman. Bagging predictors. Machine Learning, 24:123--140, August 1996.
[3]
L. Breiman. Random forests. Machine Learning, 45:5--32, 2001. 10.1023/A:1010933404324.
[4]
Evolved Analytics LLC. DataModeler Release 1.0. Evolved Analytics LLC, 2010.
[5]
M. Gashler, C. Giraud-Carrier, and T. Martinez. Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous. In 2008 Seventh International Conference on Machine Learning and Applications, pages 900--905. IEEE, Dec. 2008.
[6]
R. Genuer, J.-M. Poggi, and C. Tuleau-Malot. Variable selection using random forests. Pattern Recogn. Lett., 31:2225--2236, October 2010.
[7]
U. Grömping. Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 64:308--319, Nov. 2009.
[8]
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, March 2003.
[9]
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389--422, 2002.
[10]
Human Development Reports. http://www.hdr.undp.org/.
[11]
Human Development Research Papers. http://hdr.undp.org/en/reports/global/hdr2010/papers/.
[12]
H. Ishwaran. Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519--537, 2007.
[13]
Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578--590, June 2006.
[14]
T. McConaghy. Latent Variable Symbolic Regression for High-Dimensional Inputs, pages 103--118. Springer, 2010.
[15]
R. K. McRee. Symbolic regression using nearest neighbor indexing. In Proceedings of the 12th annual conference companion on Genetic and evolutionary computation, GECCO '10, pages 1983--1990, New York, NY, USA, 2010. ACM.
[16]
H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226--1238, Aug. 2005.
[17]
A. Rakotomamonjy. Variable selection using svm-based criteria. Journal of Machine Learning Research, 3:1357--1370, 2003.
[18]
D. S. Siroky. Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3:147--163, 2009.
[19]
G. Smits and M. Kotanchek. Pareto-front exploitation in symbolic regression. In U.-M. O'Reilly, T. Yu, R. L. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, chapter 17, pages 283--299. Springer, Ann Arbor, 13--15 May 2004.
[20]
C. Strobl, A. L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9(1):307+, July 2008.
[21]
C. Strobl, A. L. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1):25+, Jan. 2007.
[22]
K. Vladislavleva. Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming. PhD thesis, Tilburg University, 2008.
[23]
K. Vladislavleva, K. Veeramachaneni, M. Burland, J. Parcon, and U.-M. O'Reilly. Knowledge mining with genetic programming methods for variable selection in flavor design. In GECCO, pages 941--948, 2010.
[24]
Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research - asu feature selection repository. Technical report, Arizona State University, June 2010.
[25]
H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society B, 67:301--320, 2005.

Cited By

View all
  • (2024)Learning Effective Good Variables from Physical DataMachine Learning and Knowledge Extraction10.3390/make60300776:3(1597-1618)Online publication date: 12-Jul-2024
  • (2024)Bankruptcy Prediction Using a GAN-based Data Augmentation Hybrid ModelGenerative AI: Current Trends and Applications10.1007/978-981-97-8460-8_19(407-426)Online publication date: 10-Dec-2024
  • (2023)Genetic Algorithm and GAN based Hybrid Model for Bankruptcy Prediction2023 OITS International Conference on Information Technology (OCIT)10.1109/OCIT59427.2023.10430554(473-478)Online publication date: 13-Dec-2023
  • Show More Cited By

Index Terms

  1. Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      GECCO '11: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation
      July 2011
      1548 pages
      ISBN:9781450306904
      DOI:10.1145/2001858
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 July 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. feature selection
      2. genetic programming
      3. random forests
      4. symbolic regression
      5. variable importance
      6. variable selection

      Qualifiers

      • Tutorial

      Conference

      GECCO '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,669 of 4,410 submissions, 38%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 31 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Learning Effective Good Variables from Physical DataMachine Learning and Knowledge Extraction10.3390/make60300776:3(1597-1618)Online publication date: 12-Jul-2024
      • (2024)Bankruptcy Prediction Using a GAN-based Data Augmentation Hybrid ModelGenerative AI: Current Trends and Applications10.1007/978-981-97-8460-8_19(407-426)Online publication date: 10-Dec-2024
      • (2023)Genetic Algorithm and GAN based Hybrid Model for Bankruptcy Prediction2023 OITS International Conference on Information Technology (OCIT)10.1109/OCIT59427.2023.10430554(473-478)Online publication date: 13-Dec-2023
      • (2023)Development of machine learning models to predict cancer-related fatigue in Dutch breast cancer survivors up to 15 years after diagnosisJournal of Cancer Survivorship10.1007/s11764-023-01491-1Online publication date: 7-Dec-2023
      • (2022)Bacteremia detection from complete blood count and differential leukocyte count with machine learning: complementary and competitive with C-reactive protein and procalcitonin testsBMC Infectious Diseases10.1186/s12879-022-07223-722:1Online publication date: 26-Mar-2022
      • (2022)Spatial and temporal dynamics of Mexican spotted owl habitat in the southwestern USLandscape Ecology10.1007/s10980-022-01418-838:1(23-37)Online publication date: 21-Nov-2022
      • (2022)Ensemble Models Using Symbolic Regression and Genetic Programming for Uncertainty Estimation in ESG and Alternative InvestmentsBig Data in Finance10.1007/978-3-031-12240-8_5(69-91)Online publication date: 4-Oct-2022
      • (2022)Distilling Financial Models by Symbolic RegressionMachine Learning, Optimization, and Data Science10.1007/978-3-030-95470-3_38(502-517)Online publication date: 2-Feb-2022
      • (2022)Battery Materials Discovery and Smart Grid Management using Machine LearningBatteries & Supercaps10.1002/batt.2022003095:11Online publication date: 15-Sep-2022
      • (2021)Predicting the Non-Linear Conveying Behavior in Single-Screw Extrusion: A Comparison of Various Data-Based Modeling Approaches used with CFD SimulationsInternational Polymer Processing10.1515/ipp-2020-409436:5(529-544)Online publication date: 16-Nov-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media