research-article

Bias-Variance Decomposition: An Effective Tool to Improve Generalization of Genetic Programming-based Evolutionary Feature Construction for Regression

Authors:

Wolfgang Banzhaf,

Mengjie ZhangAuthors Info & Claims

GECCO '24: Proceedings of the Genetic and Evolutionary Computation Conference

Pages 998 - 1006

https://doi.org/10.1145/3638529.3654075

Published: 14 July 2024 Publication History

GECCO '24: Proceedings of the Genetic and Evolutionary Computation Conference

Bias-Variance Decomposition: An Effective Tool to Improve Generalization of Genetic Programming-based Evolutionary Feature Construction for Regression

Pages 998 - 1006

Abstract
Supplemental Material
References

Abstract

Evolutionary feature construction is a technique that has been widely studied in the domain of automated machine learning. A key challenge that needs to be addressed in feature construction is its tendency to overfit the training data. Instead of the traditional approach to control overfitting by reducing model complexity, this paper proposes to control overfitting based on bias-variance decomposition. Specifically, this paper proposes reducing the variance of a model, i.e., reducing the variance of predictions when exposed to data with injected noise, to improve its generalization performance within a multi-objective optimization framework. Experiments conducted on 42 datasets demonstrate that the proposed method effectively controls overfitting and outperforms six model complexity measures for overfitting control. Moreover, further analysis reveals that controlling overfitting adhering to bias-variance decomposition outperforms several plausible variants, highlighting the importance of controlling overfitting based on solid machine learning theory.

Supplemental Material

PDF File

Supplementary Material

Download
429.85 KB

References

[1]

Harith Al-Sahaf, Ying Bi, Qi Chen, Andrew Lensen, Yi Mei, Yanan Sun, Binh Tran, Bing Xue, and Mengjie Zhang. 2019. A survey on evolutionary machine learning. Journal of the Royal Society of New Zealand 49, 2 (2019), 205--228.

[2]

Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. 2019. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research 20, 1 (2019), 2285--2301.

Digital Library

[3]

Geoffrey F Bomarito, Patrick E Leser, NCM Strauss, Karl M Garbrecht, and Jacob D Hochhalter. 2022. Bayesian model selection for reducing bloat and overfitting in genetic programming for symbolic regression. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. 526--529.

Digital Library

[4]

Qi Chen, Bing Xue, and Mengjie Zhang. 2020. Improving symbolic regression based on correlation between residuals and variables. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference. 922--930.

Digital Library

[5]

Qi Chen, Bing Xue, and Mengjie Zhang. 2022. Rademacher Complexity for Enhancing the Generalization of Genetic Programming for Symbolic Regression. IEEE Transactions on Cybernetics 52, 4 (2022), 2382--2395.

[6]

Qi Chen, Mengjie Zhang, and Bing Xue. 2017. Feature selection to improve generalization of genetic programming for high-dimensional symbolic regression. IEEE Transactions on Evolutionary Computation 21, 5 (2017), 792--806.

Digital Library

[7]

Qi Chen, Mengjie Zhang, and Bing Xue. 2018. Structural risk minimization-driven genetic programming for enhancing generalization in symbolic regression. IEEE Transactions on Evolutionary Computation 23, 4 (2018), 703--717.

[8]

Ran Cheng, Yaochu Jin, Markus Olhofer, and Bernhard Sendhoff. 2016. A reference vector guided evolutionary algorithm for many-objective optimization. IEEE Transactions on Evolutionary Computation 20, 5 (2016), 773--791.

[9]

Wei-Yu Chiu, Gary G Yen, and Teng-Kuei Juan. 2016. Minimum manhattan distance approach to multiple criteria decision making in multiobjective optimization problems. IEEE Transactions on Evolutionary Computation 20, 6 (2016), 972--985.

[10]

Fabrício Olivetti de França. 2023. Alleviating overfitting in transformation-interaction-rational symbolic regression with multi-objective optimization. Genetic Programming and Evolvable Machines 24, 2 (2023), 13.

Digital Library

[11]

Fabricio Olivetti de Franca and Gabriel Kronberger. 2023. Reducing Overparameterization of Symbolic Regression Models with Equality Saturation. In Proceedings of the Genetic and Evolutionary Computation Conference. 1064--1072.

Digital Library

[12]

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 2 (2002), 182--197.

Digital Library

[13]

Junlan Dong, Jinghui Zhong, Wei-Neng Chen, and Jun Zhang. 2022. An efficient federated genetic programming framework for symbolic regression. IEEE Transactions on Emerging Topics in Computational Intelligence (2022).

[14]

Jeannie Fitzgerald, R Muhammad Atif Azad, and Conor Ryan. 2013. A bootstrapping approach to reduce over-fitting in genetic programming. In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation. 1113--1120.

Digital Library

[15]

Ivo Gonçalves and Sara Silva. 2013. Balancing learning and overfitting in genetic programming with interleaved sampling of training data. In Genetic Programming: 16th European Conference, EuroGP 2013, Vienna, Austria, April 3-5, 2013. Proceedings 16. Springer, 73--84.

Digital Library

[16]

Thomas Helmuth, Lee Spector, and James Matheson. 2014. Solving uncompromising problems with lexicase selection. IEEE Transactions on Evolutionary Computation 19, 5 (2014), 630--643.

Digital Library

[17]

Ting Hu. 2023. Genetic Programming for Interpretable and Explainable Machine Learning. In Genetic Programming Theory and Practice XIX. Springer, 81--90.

[18]

Maarten Keijzer and Vladan Babovic. 2000. Genetic programming, ensemble methods and the bias/variance tradeoff-introductory investigations. In European Conference on Genetic Programming. Springer, 76--90.

[19]

Krzysztof Krawiec. 2002. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3 (2002), 329--343.

Digital Library

[20]

William La Cava, Thomas Helmuth, Lee Spector, and Jason H Moore. 2019. A probabilistic and multi-objective analysis of lexicase selection and ϵ-lexicase selection. Evolutionary Computation 27, 3 (2019), 377--402.

Digital Library

[21]

William La Cava, Sara Silva, Kourosh Danai, Lee Spector, Leonardo Vanneschi, and Jason H Moore. 2019. Multidimensional genetic programming for multiclass classification. Swarm and Evolutionary Computation 44 (2019), 260--272.

[22]

Jason Liang, Elliot Meyerson, Babak Hodjat, Dan Fink, Karl Mutch, and Risto Miikkulainen. 2019. Evolutionary neural automl for deep learning. In Proceedings of the Genetic and Evolutionary Computation Conference. 401--409.

Digital Library

[23]

Luis Muñoz, Leonardo Trujillo, Sara Silva, Mauro Castelli, and Leonardo Vanneschi. 2019. Evolving multidimensional transformations for symbolic regression with M3GP. Memetic Computing 11 (2019), 111--126.

[24]

Kourosh Neshatian, Mengjie Zhang, and Peter Andreae. 2012. A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Transactions on Evolutionary Computation 16, 5 (2012), 645--661.

Digital Library

[25]

Ji Ni, Russ H Drieberg, and Peter I Rockett. 2012. The use of an analytic quotient operator in genetic programming. IEEE Transactions on Evolutionary Computation 17, 1 (2012), 146--152.

Digital Library

[26]

Ji Ni and Peter Rockett. 2014. Tikhonov regularization as a complexity measure in multiobjective genetic programming. IEEE Transactions on Evolutionary Computation 19, 2 (2014), 157--166.

Digital Library

[27]

Miguel Nicolau and Alexandros Agapitos. 2021. Choosing function sets with better generalisation performance for symbolic regression models. Genetic programming and evolvable machines 22, 1 (2021), 73--100.

[28]

Randal S Olson, William La Cava, Patryk Orzechowski, Ryan J Urbanowicz, and Jason H Moore. 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData mining 10, 1 (2017), 1--13.

[29]

Caitlin A Owen, Grant Dick, and Peter A Whigham. 2020. Characterizing genetic programming error through extended bias and variance decomposition. IEEE Transactions on Evolutionary Computation 24, 6 (2020), 1164--1176.

Digital Library

[30]

Caitlin A Owen, Grant Dick, and Peter A Whigham. 2022. Standardization and Data Augmentation in Genetic Programming. IEEE Transactions on Evolutionary Computation 26, 6 (2022), 1596--1608.

[31]

Michael O'Neill, Leonardo Vanneschi, Steven Gustafson, and Wolfgang Banzhaf. 2010. Open issues in genetic programming. Genetic Programming and Evolvable Machines 11 (2010), 339--363.

Digital Library

[32]

Sara Silva, Leonardo Vanneschi, Ana IR Cabral, and Maria J Vasconcelos. 2018. A semi-supervised Genetic Programming method for dealing with noisy labels and hidden overfitting. Swarm and Evolutionary Computation 39 (2018), 323--338.

[33]

Matthew G Smith and Larry Bull. 2005. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines 6 (2005), 265--281.

Digital Library

[34]

Keith M Sullivan and Sean Luke. 2007. Evolving kernels for support vector machine classification. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation. 1702--1707.

Digital Library

[35]

Clíodhna Tuite, Alexandros Agapitos, Michael O'Neill, and Anthony Brabazon. 2011. Early stopping criteria to counteract overfitting in genetic programming. In Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation. 203--204.

Digital Library

[36]

Leonardo Vanneschi and Mauro Castelli. 2021. Soft target and functional complexity reduction: A hybrid regularization method for genetic programming. Expert Systems with Applications 177 (2021), 114929.

Digital Library

[37]

Leonardo Vanneschi, Mauro Castelli, and Sara Silva. 2010. Measuring bloat, overfitting and functional complexity in genetic programming. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation. 877--884.

Digital Library

[38]

Marco Virgolin, Tanja Alderliesten, and Peter AN Bosman. 2020. On explaining machine learning models by evolving crucial and compact features. Swarm and Evolutionary Computation 53 (2020), 100640.

[39]

Chunyu Wang, Qi Chen, Bing Xue, and Mengjie Zhang. 2023. Shapley Value Based Feature Selection to Improve Generalization of Genetic Programming for High-Dimensional Symbolic Regression. In Australasian Conference on Data Science and Machine Learning. Springer, 163--176.

[40]

Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, and Yi Ma. 2020. Rethinking bias-variance trade-off for generalization of neural networks. In International Conference on Machine Learning. PMLR, 10767--10777.

[41]

Byoung-Tak Zhang and Heinz Mühlenbein. 1995. Balancing accuracy and parsimony in genetic programming. Evolutionary Computation 3, 1 (1995), 17--38.

Digital Library

[42]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 3 (2021), 107--115.

Digital Library

[43]

Hengzhe Zhang, Qi Chen, Bing Xue, Wolfgang Banzhaf, and Mengjie Zhang. 2023. Modular Multi-Tree Genetic Programming for Evolutionary Feature Construction for Regression. IEEE Transactions on Evolutionary Computation (2023).

[44]

Hengzhe Zhang, Qi Chen, Bing Xue, Wolfgang Banzhaf, and Mengjie Zhang. 2023. A Semantic-Based Hoist Mutation Operator for Evolutionary Feature Construction in Regression. IEEE Transactions on Evolutionary Computation (2023).

[45]

Hengzhe Zhang, Aimin Zhou, Qi Chen, Bing Xue, and Mengjie Zhang. 2023. SR-Forest: A Genetic Programming based Heterogeneous Ensemble Learning Method. IEEE Transactions on Evolutionary Computation (2023).

[46]

Hengzhe Zhang, Aimin Zhou, and Hu Zhang. 2022. An Evolutionary Forest for Regression. IEEE Transactions on Evolutionary Computation 26, 4 (2022), 735--749.

Index Terms

Bias-Variance Decomposition: An Effective Tool to Improve Generalization of Genetic Programming-based Evolutionary Feature Construction for Regression
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Bio-inspired approaches
        Genetic programming

Recommendations

A Semantic-based Hoist Mutation Operator for Evolutionary Feature Construction in Regression [Hot off the Press]
GECCO '24 Companion: Proceedings of the Genetic and Evolutionary Computation Conference Companion

This Hot-off-the-Press paper summarizes our recently published work, "A Semantic-based Hoist Mutation Operator for Evolutionary Feature Construction in Regression" [9] published in IEEE Transactions on Evolutionary Computation. Our study introduces a ...
Genetic Programming-based Evolutionary Feature Construction for Heterogeneous Ensemble Learning [Hot of the Press]
GECCO '23 Companion: Proceedings of the Companion Conference on Genetic and Evolutionary Computation

This Hof-off-the-Press paper summarizes our recently published work, "SR-Forest: A Genetic Programming based Heterogeneous Ensemble Learning Method," published in IEEE Transactions on Evolutionary Computation [4]. This paper presents SR-Forest, a ...
Semi-supervised genetic programming for classification
GECCO '11: Proceedings of the 13th annual conference on Genetic and evolutionary computation

Learning from unlabeled data provides innumerable advantages to a wide range of applications where there is a huge amount of unlabeled data freely available. Semi-supervised learning, which builds models from a small set of labeled examples and a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GECCO '24: Proceedings of the Genetic and Evolutionary Computation Conference

July 2024

1657 pages

ISBN:9798400704949

DOI:10.1145/3638529

Chair:
Xiaodong Li,
Program Chair:
Julia Handl

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGEVO: ACM Special Interest Group on Genetic and Evolutionary Computation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GECCO '24

Sponsor:

SIGEVO

GECCO '24: Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

VIC, Melbourne, Australia

Acceptance Rates

Overall Acceptance Rate 1,669 of 4,410 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
109
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)28

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten