Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A generalized decision tree ensemble based on the NeuralNetworks architecture: Distributed Gradient Boosting Forest (DGBF)

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Tree ensemble algorithms as RandomForest and GradientBoosting are currently the dominant methods for modeling discrete or tabular data, however, they are unable to perform a hierarchical representation learning from raw data as NeuralNetworks does thanks to its multi-layered structure, which is a key feature for DeepLearning problems and modeling unstructured data. This limitation is due to the fact that tree algorithms can not be trained with back-propagation because of their mathematical nature. However, in this work, we demonstrate that the mathematical formulation of bagging and boosting can be combined together to define a graph-structured-tree-ensemble algorithm with a distributed representation learning process between trees naturally (without using back-propagation). We call this novel approach Distributed Gradient Boosting Forest (DGBF) and we demonstrate that both RandomForest and GradientBoosting can be expressed as particular graph architectures of DGBT. Finally, we see that the distributed learning outperforms both RandomForest and GradientBoosting in 7 out of 9 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

All the data and materials are available at https://doi.org/10.5281/zenodo.7236216.

Notes

  1. https://doi.org/10.5281/zenodo.7236216

References

  1. Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G (2022) Deep Neural Networks and Tabular Data: A Survey. IEEE Trans Neural Netw Learn Syst 1–21. https://doi.org/10.1109/TNNLS.2022.3229161

  2. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? J. Mach. Learn. Technol 15(90):3133–3181

    MathSciNet  MATH  Google Scholar 

  3. Breiman L, Friedman JH, Olshen RA, Stone CJ (1983) Classification and Regression Trees

  4. Bengio Y, Mesnil G, Dauphin Y, Rifai S (2013) Better Mixing via Deep Representations. In: Dasgupta S, McAllester D, editors. Proceedings of the 30th International Conference on Machine Learning. vol. 28 of Proceedings of Machine Learning Research. Atlanta, Georgia, USA: PMLR; p. 552–560. Available from: https://proceedings.mlr.press/v28/bengio13.html

  5. Bengio Y, Courville A, Vincent P (2013) Representation Learning: A Review and New Perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. https://doi.org/10.1109/TPAMI.2013.50

    Article  Google Scholar 

  6. Kontschieder P, Fiterau M, Criminisi A, Bulo SR (2015) Deep Neural Decision Forests. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

  7. Biau G, Scornet E, Welbl J (2016) Neural Random Forests. Sankhya A. 04:81. https://doi.org/10.1007/s13171-018-0133-y

    Article  Google Scholar 

  8. Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  9. Friedman JH (2000) Greedy Function Approximation: A Gradient Boosting Machine. Ann Stat 29:1189–1232

    MathSciNet  MATH  Google Scholar 

  10. Dorogush AV, Gulin A, Gusev G, Kazeev N, Prokhorenkova LO, Vorobev A (2017) Fighting biases with dynamic boosting. CoRR. arXiv:1706.09516

  11. Zhang G, Lu Y (2012) Bias-corrected random forests in regression. J Appl Stat 39(1):151–160. https://doi.org/10.1080/02664763.2011.578621

    Article  MathSciNet  MATH  Google Scholar 

  12. Mentch L, Hooker G (2016) Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J Mach Learn Res 17(1):841–881

    MathSciNet  MATH  Google Scholar 

  13. Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc

  14. Pavlov DY, Gorodilov A, Brunk CA (2010) BagBoo: A Scalable Hybrid Bagging-the-Boosting Model. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. CIKM’10. New York, NY, USA: Association for Computing Machinery; p. 1897–1900

  15. Jafarzadeh H, Mahdianpari M, Gill E, Mohammadimanesh F, Homayouni S (2021) Bagging and Boosting Ensemble Classifiers for Classification of Multispectral, Hyperspectral and PolSAR Data: A Comparative Evaluation. Remote Sensing. 13(21). https://doi.org/10.3390/rs13214405

  16. Ghosal I, Hooker G (2021) Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and Its Variance Estimate. J Comput Graph Stat 30(2):493–502. https://doi.org/10.1080/10618600.2020.1820345

    Article  MathSciNet  MATH  Google Scholar 

  17. Chatterjee S, Das A (2022) An ensemble algorithm integrating consensusclustering with feature weighting based ranking and probabilistic fuzzy logic-multilayer perceptron classifier for diagnosis and staging of breast cancer using heterogeneous datasets. Appl Intell. https://doi.org/10.1007/s10489-022-04157-0

    Article  Google Scholar 

  18. Rashid M, Kamruzzaman J, Imam T, Wibowo S, Gordon S (2022) A tree-based stacking ensemble technique with feature selection for network intrusion detection. Appl Intell 52(9):9768–9781. https://doi.org/10.1007/s10489-021-02968-1

  19. Feng J, Yu Y, Zhou ZH (2018) Multi-Layered Gradient Boosting Decision Trees. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18. Red Hook, NY, USA: Curran Associates Inc. p. 3555–3565

  20. Morid MA, Kawamoto K, Ault T, Dorius J, Abdelrahman S (2018) Supervised Learning Methods for Predicting Healthcare Costs: Systematic Literature Review and Empirical Evaluation. AMIA Annual Symposium proceedings AMIA Symposium 2017:1312–1321

    Google Scholar 

  21. Yang H, Luo Y, Ren X, Wu M, He X, Peng B et al (2021) Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators. Information Fusion. https://doi.org/10.1016/j.inffus.2021.02.015

    Article  Google Scholar 

  22. Iwendi C, Bashir AK, Peshkar A, Sujatha R, Chatterjee JM, Pasupuleti S et al (2020) COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm. Frontiers in Public Health. 8. https://doi.org/10.3389/fpubh.2020.00357

  23. Hew KF, Hu X, Qiao C, Tang Y (2020) What predicts student satisfaction with MOOCs: A gradient boosting trees supervised machine learning and sentiment analysis approach. Comput Educ 145:103724. https://doi.org/10.1016/j.compedu.2019.103724

  24. Lu H, Cheng F, Ma X, Hu G (2020) Short-term prediction of building energy consumption employing an improved extreme gradient boosting model: A case study of an intake tower. Energy 203:117756. https://doi.org/10.1016/j.energy.2020.117756

  25. Karasu S, Altan A (2019) Recognition Model for Solar Radiation Time Series based on Random Forest with Feature Selection Approach. In: 2019 11th International Conference on Electrical and Electronics Engineering (ELECO) p. 8–11

  26. Lee TH, Ullah A, Wang R (2020) In: Fuleky P, editor. Bootstrap Aggregating and Random Forest. Cham: Springer International Publishing p. 389–429. Available from: https://doi.org/10.1007/978-3-030-31150-6_13

  27. Carmona P, Climent F, Momparler A (2019) Predicting failure in the U.S. banking sector: An extreme gradient boosting approach. Int Rev Econ Finance 61:304–323. https://doi.org/10.1016/j.iref.2018.03.008

  28. Ángel Delgado-Panadero, Hernández-Lorca B, García-Ordás MT, Benítez-Andrades JA (2022) Implementing local-explainability in Gradient Boosting Trees: Feature Contribution. Inf Sci 589:199–212. https://doi.org/10.1016/j.ins.2021.12 111

  29. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group

Download references

Acknowledgements

I want to thank Sara San Luís Rodriguez for her selfless support and her perfectionism and also Bea Hernández Lorca because the help and questions she planted two years ago are the seeds of the trees from today.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José Alberto Benítez-Andrades.

Ethics declarations

Competing interests

The authors declare that they have no conflicts of interest to this work. The people involved in the experiment have been informed and formally accepted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

José Alberto Benítez-Andrades and María Teresa García-Ordás contributed equally to this work.

Appendices

Appendix A

Appendix B background in CART, RandomForest and GradientBoosting bias

1.1 B.1 CART bias

Bias type 1

During the learning process, node regions are computed iteratively until an end condition is reached. Because of its exhaustive nature, the predictions tend to be biased producing high variance predictions overfitting the dataset \(\{x_i\}\). This bias can be defined as

$$\begin{aligned} Bias_1(x) = E[Y \mid X=x]-E[y_i \mid x_i \in R_j] \mid _{x \in R_j} . \end{aligned}$$
(16)

Bias type 2

During the learning process, the splitting thresholds to produce the nodes are computed using only the middle point between two contiguous points from the dataset. No other value can be a threshold candidate. This produces a bias in the prediction due to two reasons. First because of lack of learning capacity in the underpopulated areas and second producing a high variance in the overpopulated areas. This can be visually understood in Fig. 6.

1.2 B.2 Ensemble algorithms

Ensemble algorithms are able to solve the biased prediction of CARTs by combining the predictions of many models trained separately. There are different ensemble algorithms but the main two are GradientBoosting and RandomForest.

RandomForest-Bagging

One way of combining the prediction of many trees to reduce bias is to aggregate the results by the mean. Given a dataset, the learning process of a tree is deterministic, so to make multiple trees produce different predictions, each of them is trained in a different bootstrap subsample of the dataset. This technique is called \(bagging\) [8].

$$\begin{aligned} F(x) = \frac{1}{n_{trees}} \sum _{j=0}^{n_{trees}} h_j(x) , \end{aligned}$$
(17)

where \(h_j(x)\) is trained to minimize the loss function, \(L(y,f(x))\), over a \(bagging\) subsample \(\{x\}_j\)

$$\begin{aligned} h_j(x) = \underset{h}{\mathrm {arg\ min}}\ L(y_i, h(x_i)) / x_i \in \{x\}_j . \end{aligned}$$
(18)

This technique is based on the central limit theorem from statistics where we expect that the variance manifest of the CART biases can be reduced by averaging over enough tree predictions. All the trees learn in parallel, and this kind of learning is called "horizontal learning".

GradientBoosting-Boosting

In contrast, in GradientBoosting each tree does not try to learn over a different dataset sample to minimize the loss, but it follows a "stage-wise" optimization approach, where each tree is added to the ensemble to reduce the global loss from the previous trees. The final prediction is the sum of all the trees in the ensemble. This process is called \(boosting\) [9]

$$\begin{aligned} F(x) = \sum _{m=0}^M h_m(x) , \end{aligned}$$
(19)

where \(M\) is the number of trees of the ensemble. To reduce the loss function from the previous trees, \(L(y,F_{t-1}(x))\), each of those trees is fitted with the gradient of the loss function from the previous predictors:

$$\begin{aligned} h_m(x)= & {} \rho _m g_m (x) , \end{aligned}$$
(20)
$$\begin{aligned} g_m(x)= & {} E_y \left[ \frac{\partial L(y,F(x)}{\partial F(x)}\right] _{F=F_{m-1}} , \nonumber \\ \rho _m= & {} \underset{\rho '}{\mathrm {arg\ min}}\ E_{y,x} \left[ L(y,F_{m-1}(x)-{\rho '} g_m(x))\right] \end{aligned}$$
(21)

In the previous formula, the value of \(E_y[.]\) can only be computed knowing the density function \(P(y \mid x)\). In real-world problems, we never have the density function, consequently, the \(boosting\) is approximated using finite data by assuming regularity in the \(P(y \mid x)\) distribution. To do so, each tree is trained with pseudo-responses that are computed (in the case of the \(RMSE\) loss) as the difference between the target and the predictions of a current ensemble of trees (residual errors).

$$\begin{aligned} h_m(x) \simeq \underset{h'}{\mathrm {arg\ min}}\ E \left[ L(y-F_{m-1}(x),h'(x))\right] , \end{aligned}$$
(22)

While CART tries to reduce the loss during the training process by splitting leaf nodes into children nodes, the \(boosting\) algorithm tries to reduce the loss from each tree by adding another tree. However, the \(boosting\) algorithm generalizes better than the optimization from trees because the trees in each state, optimize the loss function using the entire dataset and not only the subsample from the previous node

1.3 B.3 Bagging and boosting bias

Both, \(bagging\) and \(boosting\), are different ensemble techniques that rely on different mathematical approaches. \(Bagging\) relies on generalization by reducing the variance by averaging the prediction from multiple predictors trained to predict the same response. Meanwhile, \(boosting\) relies on reducing the global training loss from the previous tree by training a new tree over the pseudo-responses. Even though these techniques are used separately, the two approaches are not incompatible.

GradientBoosting Bias

In the finite data approximation, we suppose that training over the pseudo-responses is representative of the gradient. However, computing the pseudo-responses over the same training data from one tree to another produces biased pseudo-responses. It is demonstrated in [10] that the bias induced by the finite data approximation is:

$$\begin{aligned} Bias_{GB}(x)\equiv & {} E[F_{t-1}(x')]_{x'=x}\nonumber \\{} & {} -E[F_{t-1} (x') \mid x'=x_k]_{x'=x} , \end{aligned}$$
(23)

RandomForest Bias

The reduction of bias from CART using the RandomForest ensemble is based on the reduction of the prediction variance by averaging the predictions from a high number of trees. However, averaging over all the trees does not ensure the convergence to a global minimum of the loss function everywhere (i.e. in all the areas of the feature space). This is due to the robustness of the decision trees to learn outlier regions, which is lost by averaging over the trees which have not been trained with the outliers.

$$\begin{aligned} Bias_{RF}(x) \equiv E \left[ h(x) \mid x \in R_j \right] -E \left[ h'(x) \mid x \in R'_j \right] , \end{aligned}$$
(24)

where \(h(x)\) and \(h'(x)\) are trees trained with different subsamples. If these subsamples have different statistics (for instance, because of a small outlier region), the ensemble is not flexible enough to learn this characteristic behavior of that region.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Delgado-Panadero, Á., Benítez-Andrades, J.A. & García-Ordás, M.T. A generalized decision tree ensemble based on the NeuralNetworks architecture: Distributed Gradient Boosting Forest (DGBF). Appl Intell 53, 22991–23003 (2023). https://doi.org/10.1007/s10489-023-04735-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04735-w

Keywords