Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/775047.775117acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

SECRET: a scalable linear regression tree algorithm

Published: 23 July 2002 Publication History

Abstract

Developing regression models for large datasets that are both accurate and easy to interpret is a very important data mining problem. Regression trees with linear models in the leaves satisfy both these requirements, but thus far, no truly scalable regression tree algorithm is known. This paper proposes a novel regression tree construction algorithm (SECRET) that produces trees of high quality and scales to very large datasets. At every node, SECRET uses the EM algorithm for Gaussian mixtures to find two clusters in the data and to locally transform the regression problem into a classification problem based on closeness to these clusters. Goodness of split measures, like the gini gain, can then be used to determine the split variable and the split point much like in classification tree construction. Scalability of the algorithm can be achieved by employing scalable versions of the EM and classification tree construction algorithms. An experimental evaluation on real and artificial data shows that SECRET has accuracy comparable to other linear regression tree algorithms but takes orders of magnitude less computation time for large datasets.

References

[1]
W. P. Alexander and S. D. Grimshaw. Treed regression. Journal of Computational and Graphical Statistics, (5):156--175, 1996.
[2]
P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Knowledge Discovery and Data Mining, pages 9--15, 1998.
[3]
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.
[4]
P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao. Piecewise-polynomial regression trees. Statistica Sinica, 4:143--167, 1994.
[5]
J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1--141 (with discussion), 1991.
[6]
K. Fukanaga. Introduction to Statistical Pattern Recognition, Second edition. Academic Press, 1990.
[7]
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest -- a framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Databases, pages 416--427. Morgan Kanfmarn, August 1998.
[8]
G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins, 1996.
[9]
A. Karalic. Linear regression in regression tree leaves. In International School for Synthesis of Expert Knowledge, Bled, Slovenia, 1992.
[10]
K.-C. Li, H.-H. Lue, and C.-H. Chen. Interactive tree-structured regression via principal hessian directions. journal of the American Statistical Association, (95):547--560, 2000.
[11]
W.-Y. Loh. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 2002. in press.
[12]
W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 7(4), October 1997.
[13]
S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 1997.
[14]
J. R. Quinlan. Learning with Continuous Classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343--348, 1992.
[15]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
[16]
L. Torgo. Functional models for regression tree leaves. In Proc. l4th International Conference on Machine Learning, pages 385--393. Morgan Kaufmann, 1997.
[17]
L. Torgo. Kernel regression trees. In European Conference on Machine Learning, 1997. Poster paper.
[18]
L. Torgo. A comparative study of reliable error estimators for pruning regression trees. Iberoamerican Conf. on Artificial Intelligence. Springer-Verlag, 1998.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

KDD02
Sponsor:

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fast linear model trees by PILOTMachine Learning10.1007/s10994-024-06590-3Online publication date: 8-Jul-2024
  • (2023)Frequent Itemset Mining Algorithm Based on Linear TableJournal of Database Management10.4018/JDM.31845034:1(1-21)Online publication date: 24-Feb-2023
  • (2023)Machine Learning Approaches in Brillouin Distributed Fiber Optic SensorsSensors10.3390/s2313618723:13(6187)Online publication date: 6-Jul-2023
  • (2022)The Bigger PictureManagement Science10.1287/mnsc.2020.391168:1(189-210)Online publication date: 1-Jan-2022
  • (2022)INN: An Interpretable Neural Network for AI Incubation in ManufacturingACM Transactions on Intelligent Systems and Technology10.1145/351931313:5(1-23)Online publication date: 21-Jun-2022
  • (2022)A hybrid approach to enhance the lifespan of WSNs in nuclear power plant monitoring systemScientific Reports10.1038/s41598-022-08075-612:1Online publication date: 14-Mar-2022
  • (2021)Curvature-Oriented Splitting for Multivariate Model Trees2021 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI50451.2021.9659858(01-09)Online publication date: 5-Dec-2021
  • (2021)Learning with continuous piecewise linear decision treesExpert Systems with Applications10.1016/j.eswa.2020.114214168(114214)Online publication date: Apr-2021
  • (2020)Cracking the Black BoxProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403367(3154-3162)Online publication date: 23-Aug-2020
  • (2020)On the Functional Equivalence of TSK Fuzzy Systems to Neural Networks, Mixture of Experts, CART, and Stacking Ensemble RegressionIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2019.294169728:10(2570-2580)Online publication date: Oct-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media