Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Evolutionary Symbolic Regression from a Probabilistic Perspective

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

We examine the genetic evolution-based algorithm for symbolic regression from a probabilistic dynamical perspective. This approach permits us to follow the evolution of the search candidate functions from generation to generation as they improve their fitness and finally converge to the best function that matches a given data set. In particular, we use this statistical framework to explore the optimal external parameters that govern a special mutation operator, which can systematically improve the numerical value of constants contained in each candidate formula of the search space. We then apply symbolic regression to the chaotic logistic map and the Lorenz system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Goodfellow I, Bengio Y, Courville A. Deep learning. Cambridge: MIT Press; 2016.

    MATH  Google Scholar 

  2. Sarachik ES, Cane MA. The El-Nino southern oscillation phenomena. Cambridge, UK: Cambridge University Press; 2010.

    Book  Google Scholar 

  3. Vladislavleva E, Friedrich T, Neumann F, Wagner M. Predicting the energy output of wind farms based on weather data: important variables and their correlation. Renew Energy. 2013;50:236.

    Article  Google Scholar 

  4. Fitzsimmons J, Moscato P. Symbolic regression modelling of drug responses. In: First IEEE Conference on Artificial Intelligence for Industries; 2018.

  5. Graham MJ, Djorgovski SG, Mahabal A, Donalek C, Drake A, Longo G. Data challenges of time domain astronomy. Distr Parallel Databases. 2012;30(5):371.

    Article  Google Scholar 

  6. Schmidt M, Lipson H. Distilling free-form natural laws from experimental data. Science. 2009;324(5923):81.

    Article  Google Scholar 

  7. Udrescu SM, Tegmark M. The Feynman database for symbolic regression. https://space.mit.edu/home/tegmark/aifeynman.html; 2020

  8. Udrescu SM, Tegmark M. AI Feynman: a physics-inspired method for symbolic regression. Sci Adv. 2019;6(16):2631.

    Article  Google Scholar 

  9. Durasevic M, Domagoj J, Scoczynski Ribeiro Martins M, Stjepan Picek P, Wagner M. Fitness landscape analysis of dimensionally-aware genetic programming featuring Feynman equations. arXiv:2004.12762v1 [cs.NE]; 2020.

  10. Quade M, Abel M, Shafi K, Niven RK, Noack BR. Prediction of dynamical systems by symbolic regression. Phys Rev E. 2016;94:012214.

    Article  MathSciNet  Google Scholar 

  11. Gautier N, Aider JL, Duriez T, Noack B, Segond M, Abel M. Closed-loop separation control using machine learning. J Fluid Mech. 2015;770:442.

    Article  Google Scholar 

  12. Qin H. Machine learning and serving of discrete field theories - when artificial intelligence meets the discrete universe. arXiv:1910.10147; 2019.

  13. Gong C, Su Q, Grobe R. Machine learning techniques in the examination of the electron-positron pair creation process. J Opt Soc Am B. 2021;38:3582–91.

    Article  Google Scholar 

  14. Zimmermann RS, Parlitz U. Observing spatio-temporal dynamics of excitable media using reservoir computing. Chaos. 2018;28:043118.

    Article  Google Scholar 

  15. Tanaka G, Yamane T, HšŠroux JB, Nakane R, Kanazawa N, Takeda S, Numata H, Nakano D, Hirose A. Recent advances in physical reservoir computing: a review. Neural Netw. 2019;115:100.

    Article  Google Scholar 

  16. Lu Z, Hunt BR, Ott E. Attractor reconstruction by machine learning. Chaos. 2018;28:061104.

    Article  MathSciNet  Google Scholar 

  17. Symbolic regression is a relatively young research field and there are no extensive reviews for direct applications in physics yet. Two interesting early articles are [17,18].

  18. Vladislavleva K. Model-based problem solving through symbolic regression via Pareto genetic programming. PhD thesis, Tilburg University; 2008.

  19. Minnebo W, Stijven S. Empowering knowledge computing with variable selection. M Sc thesis: University of Antwerp; 2011.

  20. Bruneton JP, Cazenille L, Douin A, Reverdy V. Exploration and exploitation in symbolic regression using quality-diversity and evolutionary strategies algorithms. arXiv:1906.03959v1 [cs.NE]; 2019.

  21. Koza JR. Genetic programming: on the programming of computers by means of natural selection. Cambridge: MIT Press; 1992.

    MATH  Google Scholar 

  22. Koza JR. Genetic programming. Cambridge: MIT Press; 1998.

    Google Scholar 

  23. Lambora A, Gupta K, Chopra K. Genetic algorithm—a literature review. In: International conference on machine learning, big data, cloud and parallel computing (COMITCon); 2019, p 380.

  24. Miller B, Goldberg D. Genetic algorithms, tournament selection and the effects of noise. Complex Syst. 1995;9:193.

    MathSciNet  Google Scholar 

  25. Blickle T, Thiele L. A comparison of selection schemes used in evolutionary algorithms. Evol Comput. 1996;4:361.

    Article  Google Scholar 

  26. Goldberg D, Deb K. A comparative analysis of selection schemes used in genetic algorithms. Found Genet Algor. 1991;1:69.

    MathSciNet  Google Scholar 

  27. Holland JH. Adaptation in natural and artificial systems. Cambridge: MIT Press; 1975.

    Google Scholar 

  28. Gavrilets S. Fitness landscapes and the origin of species. Princeton: Princeton University Press; 2004.

    Book  Google Scholar 

  29. McCandlish DM. Visualizing fitness landscapes. Evolution. 2011;65:1544.

    Article  Google Scholar 

  30. Wright S. The roles of mutation, inbreeding, crossbreeding, and selection in evolution. Proc Six Int Congr Genet. 1932;1:355.

    Google Scholar 

  31. Richter H, Engelbrecht A. Recent advances in the theory and application of fitness landscapes. Heidelberg: Springer; 2014.

    Book  Google Scholar 

  32. May R. Simple mathematical models with very complicated dynamic. Nature. 1976;261:459.

    Article  Google Scholar 

  33. Tan JPL. Simulated extrapolated dynamics with parametrization networks. arXiv:1902.03440v1 [nlin.CD]; 2019.

  34. Lorenz EN. Deterministic nonperiodic flow. J Atmos Sci. 1963;20(2):130.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank Profs. B.K. Clark, X. Fang, Z.L. Li, Y.J. Li and G.H. Rutherford, and G. Jacob, Z. Smozhanyk, and T. Walsh for many helpful discussions and suggestions. This work has been supported by the NSF. C.G. would like to thank ILP for the nice hospitality during his visit to Illinois State University and acknowledges the China Scholarship Council program for his PhD research. We also acknowledge access to the HPC cluster provided by Illinois State University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qichang Su.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

(A) \(P_{m}(t)\) for a model system of M classes

To get some first insight into the time scales of the temporal evolution of the proportions \(P_{m}(t)\) of each class under the tournament selection, we present in this appendix an oversimplified system of M classes, where the corresponding initial fitness densities \(\rho _{m}(f,t=0)\) are so narrow that they do not overlap with each other. This permits us to derive a universal iteration scheme, where the proportions \(P_{m}(t)\) can be computed directly from the set of \(P_{m}(t=0)\) without specifying the shape of \(\rho _{m}(f,t=0)\).

In general, the new set of proportions \(P_{m}(t+1)\) after the application of all \(N_\mathrm{pop}\) tournaments can be obtained via the expression

$$\begin{aligned} P_{m}(t+1)&=\int \limits _{0}^{\infty }\text {d}fP_{m}(t)\rho _{m}(f,t)n_{T}\nonumber \\&\quad \left[ 1-\sum _{m'=1}^{M}P_{m'}(t)\int \limits _{0}^{f}\text {d}f'\rho _{m'}(f',t)\right] ^{n_{T}-1} \end{aligned}$$
(14)

The assumption of non-overlapping densities means that we can assign each class a unique mean fitness value, defined as \(\int _{0}^{\infty }\text {d}f'\ f'\rho _{m}(f',t)\equiv f_{m}\). This permits us to order the class labels such that their associated mean fitness increases with increasing label m, i.e., \(f_{m}<f_{m+1}\). If we further assume that each \(\rho _{m}(f,t)\) is basically non-zero only in the interval \([f_{m}-\varDelta f/2, f_{m}+\varDelta f/2]\), then the integration range of first integral \(\int _{0}^{\infty }\text {d}f\) of Eq. (14) can be approximated by \(\int _{f_{m}-\varDelta f/2}^{f_{m}+\varDelta f/2}\text {d}f\). This means that the largest upper integration value f of the second integral \(\int _{0}^{f}\text {d}f'\) is at most \(f=f_{m}+\varDelta f/2\). As a result, some of the integrals in \(S(f,t)=n_{T}[1-\sum _{m'=1}^{M}P_{m'}(t)\int _{0}^{f}\text {d}f'\rho _{m'} (f',t)]^{n_{T}-1}\) can be partially evaluated and therefore simplify significantly. The densities \(\rho _{m'}(f',t)\) with a fitness lower than \(f_{m}\) are integrated over their entire extent and we can use \(\int _{f_{m}-\varDelta f/2}^{f_{m}+\varDelta f/2}\text {d}f\rho _{m'}(f',t)=1\). As a result, we obtain \(P_{m'}(t)\int _{0}^{f}\text {d}f'\rho _{m'}(f',t) =\sum _{m'=1}^{m-1}P_{m'}(t)+\int _{0}^{f}\text {d}f'\rho _{m'}(f,t)\). This permits us to represent the entire integrand in S(ft) as a total derivative and we obtain

$$\begin{aligned} P_{m}(t+1)&=\int \limits _{f_{m}-\varDelta f/2}^{f_{m}-\varDelta f/2}\text {d}f\,P_{m}(t) \rho _{m}(f,t)n_{T}\nonumber \\&\quad \left[ 1-\sum _{m'=1}^{M}P_{m'}(t)-\int \limits _{0}^{f}\text {d}f'P_{m}(t) \rho _{m}(f',t)\right] ^{n_{T}-1} \nonumber \\&=\int \limits _{f_{m}-\varDelta f/2}^{f_{m}+\varDelta f/2}\text {d}f(-)\text {d}/\text {d}f\nonumber \\&\quad \left[ 1-\sum \limits _{m'=1}^{M}P_{m'}(t)-\int \limits _{0}^{f}\text {d}f'P_{m} (t)\rho _{m}(f',t)\right] ^{n_{T}}\nonumber \\&=(-)\left[ 1-\sum \limits _{m'=1}^{m-1}P_{m'}(t)-\int \limits _{0}^{f}\text {d}f'P_{m}(t) \rho _{m}(f',t)\right] ^{n_{T}}\nonumber \\&\qquad |_{f_{m} -\varDelta f/2}^{f_{m}+\varDelta f/2}\nonumber \\&=\left[ 1-\sum _{m'=1}^{m-1}P_{m'}(t)\right] ^{n_{T}} -\left[ 1-\sum _{m'=1}^{m}P_{m'}(t)\right] ^{n_{T}} \end{aligned}$$
(15)

If we expand this set of equations, we obtain the sequences of mutually coupled iterations

$$\begin{aligned} P_{1}(t+1)= & {} 1-[1-P_{1}(t)]^{n_{T}} \nonumber \\ P_{2}(t+1)= & {} [1-P_{1}(t)]^{n_{T}} -[1-P_{1}(t)-P_{2}(t)]^{n_{T}}\nonumber \\ P_{3}(t+1)= & {} [1-P_{1}(t)-P_{2}(t)]^{n_{T}} -[1-P_{1}(t)-P_{2}(t)-P_{3}(t)]^{n_{T}}\nonumber \\&\ldots \nonumber \\ P_{M}(t+1)= & {} [1-P_{1}(t)-P_{2}(t)-P_{3}(t) -\cdots -P_{M-1}(t)]^{n_{T}} \end{aligned}$$
(16)

This means that we have obtained an iteration scheme to calculate the class populations \(P_{m}(t+1)\) of the next generation solely from the set of \(P_{m}(t)\), which have a lesser (or equal fitness). One can easily convince oneself that the norm is preserved by this set of maps. i.e., \(\sum _{m=1}^{M}P_{m}(t+1)=\sum _{m=1}^{M} P_{m}(t)=1\).

Fig. 5
figure 5

Evolution of the proportions \(P_{m}(t)\) for \({M} =10\) classes for the first five generations according to the model given by Eq. (16). The initial proportion were chosen \(P_{m}(t=0)=1/M\). We have only labeled the three proportions with the lowest three fitnesses

As an interesting side-note, we remark that despite the nonlinear feature of these iterative maps, for the class with the lowest fitness \({m}=1\), we have the simpler iterative scheme \(P_{1}(t+1)=1-[1-P_{1}(t)]^{n_{T}}\), which converges consistently to \(P_{1}(t\rightarrow \infty )\rightarrow 1\). If we introduce the complementary proportion \(Q_{1}(t+1)\equiv 1-P_{1}(t+1)\), we have \(1-P_{1}(t+1)=[1-P_{1}(t)]^{n_{T}}\) such that \(Q_{1}(t+1)=Q_{1}(t)^{n_{T}}\). This has the solution \(Q_{1}(t)=Q_{1}(0)^{tn_{T}}\) such that we have \(P_{1}(t)=1-[1-P_{1}(0)]^{tn_{T}}\), so \(P_{1}(t)\) grows monotonically on the time scale proportional to \(n_{T}^{-1}\) and independent of the proportions \(P_{m}\) of the other classes, as one might expect. While the decay is monotonic, its time scale depends not only on \(n_{T}\), but also very sensitively on its initial value \(P_{1}(0)\). If \(P_{1}(0)\ll 1\), then for short times \(P_{1}(t)\) grows linearly in time with a slope proportional to \(n_{T}P_{1}(0)\), i.e., \(P_{1}(t) = n_{T}P_{1}(0)t\).

On the opposite side, if m matches the total number of classes, i.e., m = M, then the iteration scheme for the class with the largest fitness \(f_{M}\) simplifies to

$$\begin{aligned} P_{M}(t+1) =\left[ 1-\sum \limits _{m'=1}^{M-1}P_{m'}(t)\right] ^{n_{T}} =[1-\{1-P_{M}(t)\}]^{n_{T}}=P_{M}(t)^{n_{T}} \end{aligned}$$
(17)

This permits us to find the complete time evolution for \(t=1,2,\ldots ,\) as \(P_{M}(t)=P_{M}(0)^{tn_{T}}\) following a universal monotonic exponential decay with decay time proportional to \(n_{T}^{-1}\).

The time evolution of all the other proportions \(P_{M}(t)\) for \(m\ne 1\) and \(m\ne M\) can be non-momotonic. As an example, in Fig. 5 we show the evolution of \(M=10\) classes and \(P_{m}(t=0)=1/M\) for the first five generations with a tournament size \(n_{T}=2\). We see that the low-fitness proportions, \(P_{m}(t)\) (for \({m}=1\), ..., 5) increase first and then decay, except \(P_{1}(t)\), which approaches monotonically 1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gong, C., Bryan, J., Furcoiu, A. et al. Evolutionary Symbolic Regression from a Probabilistic Perspective. SN COMPUT. SCI. 3, 209 (2022). https://doi.org/10.1007/s42979-022-01094-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01094-0

Keywords