article

Free access

BOAT—optimistic decision tree construction

Authors:

Johannes Gehrke,

Venkatesh Ganti,

Raghu Ramakrishnan,

Wei-Yin LohAuthors Info & Claims

ACM SIGMOD Record, Volume 28, Issue 2

Pages 169 - 180

https://doi.org/10.1145/304181.304197

Published: 01 June 1999 Publication History

Abstract

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree.

We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost.

Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.

References

[1]

A.C.Davison and D.V.Hinkley. Bootstrap Methods and their Applications. Cambridge Series in Statistical and Probabilistie Mathematics. Cambridge University Press, 1997.

[2]

R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classifier for database mining applications. VLDB 1992.

Digital Library

[3]

R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-925, December 1993.

Digital Library

[4]

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.

[5]

B. Eft'on and R. J. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.

[6]

T. Fukuda, Y. Morimoto, and S. Morishita. Constructing efficient decision trees by using optimized numeric association rules. VLDB 1996.

Digital Library

[7]

T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD 1996

Digital Library

[8]

T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokalyama. Mining optimized association rules for numeric attributes. PODS 1996.

Digital Library

[9]

U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Da,ta Mining. AAAI/MIT Press, 1996.

Digital Library

[10]

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainfor~:st - A framework for fast decision tree construction of large datasets. VLDB 1996.

Digital Library

[11]

D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester, England, 1997.

[12]

Tjen-Sien Lira, Wei-Y'm Loll, and Yu-Shan Shih. An empirical comparison of decision trees and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, Madison, June 1997.

[13]

Wei-Y'm Loh and Yu-Shan Shih. Split selection meth(~s for classification trees. Statistica Sinica, 7(4), October 1997.

[14]

O.L. Mangasarian. Nonlinear Programming. Classics in Applied Mathematics. Society for industrial and Applied Mathematics, 1994.

[15]

M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth lnt'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996.

Digital Library

[16]

Y. Morimoto, T. Fukuda, H. Matsuzawa, T. Tokuyama, and K. Yoda. Algorithms for mining association rules for binary segmentations of huge categorical databases. VLDB 1998.

Digital Library

[17]

N. Megiddo and R. Srikant. Discovering predictive association rules. KDD 1998.

[18]

D. Miehie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Hor wood, 1994.

Digital Library

[19]

S. IC Murthy. On growing better decision trees from data. Phl) thesis, Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, 1995.

Digital Library

[20]

E Olken. Random Sampling from Databases. PhD lthesis, University of California at Berkeley, 1993.

[21]

j. Ross Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.

[22]

R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB 1998.

Digital Library

[23]

J. Shafer, IL Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. VLDB 1996.

Digital Library

[24]

P.E. Utgoff, N. C. Berkman, and J. A. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, 29:5-44, 1997.

Digital Library

[25]

P.E. Utgoff. Incremental induction of decision trees. Machine Learning, 4:16 i-186, 1989.

Digital Library

Cited By

Milutinović VMitić NKartelj AKotlar M(2022)Classification Algorithms and Dataflow ImplementationImplementation of Machine Learning Algorithms Using Control-Flow and Dataflow Paradigms10.4018/978-1-7998-8350-0.ch003(46-77)Online publication date: 11-Mar-2022
https://doi.org/10.4018/978-1-7998-8350-0.ch003
Gouda HHassan FEl-Araby EMoawed S(2022)Comparison of machine learning models for bluetongue risk prediction: a seroprevalence study on small ruminantsBMC Veterinary Research10.1186/s12917-022-03486-z18:1Online publication date: 9-Nov-2022
https://doi.org/10.1186/s12917-022-03486-z
Nanfack GTemple PFrénay B(2022)Constraint Enforcement on Decision Trees: A SurveyACM Computing Surveys10.1145/350673454:10s(1-36)Online publication date: 13-Sep-2022
https://dl.acm.org/doi/10.1145/3506734
Show More Cited By

Index Terms

BOAT—optimistic decision tree construction

Recommendations

BOAT—optimistic decision tree construction
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A ...
Sparse alternating decision tree

Alternating decision tree (ADTree) brings interpretability to boosting.A novel sparse version of multivariate ADTree is presented.Sparse ADTree is a better generalization of existing univariate ADTree.The decision nodes are designed based on modified ...
Efficient Steiner tree construction based on spanning graphs
ISPD '03: Proceedings of the 2003 international symposium on Physical design

Steiner Minimal Tree (SMT) problem is a very important problem in VLSI CAD. Given n points on a plane, a Steiner minimal tree connects these points through some extra points (called Steiner points) to achieve a minimal total length. Even though there ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 28, Issue 2

June 1999

599 pages

ISSN:0163-5808

DOI:10.1145/304181

Chairmen:
Susan Davidson
Univ. of Pennsylvania
,
Christos Faloutsos
Carnegie Mellon Univ.

Issue’s Table of Contents

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
June 1999
604 pages
ISBN:1581130848
DOI:10.1145/304182
Chairmen:
Susan B. Davidson
Univ. of Pennsylvania, Philidelphia
,
Christos Faloutsos
Carnegie Mellon Univ., Pittsburgh

Copyright © 1999 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1999

Published in SIGMOD Volume 28, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

226
Total Citations
View Citations
1,850
Total Downloads

Downloads (Last 12 months)169
Downloads (Last 6 weeks)26

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Milutinović VMitić NKartelj AKotlar M(2022)Classification Algorithms and Dataflow ImplementationImplementation of Machine Learning Algorithms Using Control-Flow and Dataflow Paradigms10.4018/978-1-7998-8350-0.ch003(46-77)Online publication date: 11-Mar-2022
https://doi.org/10.4018/978-1-7998-8350-0.ch003
Gouda HHassan FEl-Araby EMoawed S(2022)Comparison of machine learning models for bluetongue risk prediction: a seroprevalence study on small ruminantsBMC Veterinary Research10.1186/s12917-022-03486-z18:1Online publication date: 9-Nov-2022
https://doi.org/10.1186/s12917-022-03486-z
Nanfack GTemple PFrénay B(2022)Constraint Enforcement on Decision Trees: A SurveyACM Computing Surveys10.1145/350673454:10s(1-36)Online publication date: 13-Sep-2022
https://dl.acm.org/doi/10.1145/3506734
Bishara DXie YLiu WLi S(2022)A State-of-the-Art Review on Machine Learning-Based Multiscale Modeling, Simulation, Homogenization and Design of MaterialsArchives of Computational Methods in Engineering10.1007/s11831-022-09795-830:1(191-222)Online publication date: 5-Aug-2022
https://doi.org/10.1007/s11831-022-09795-8
Strecht PMendes‐Moreira JSoares C(2021)Inmplode: A framework to interpret multiple related rule‐based modelsExpert Systems10.1111/exsy.1270238:6Online publication date: 18-May-2021
https://doi.org/10.1111/exsy.12702
Mu YWang LLiu X(2020)Dynamic programming based fuzzy partition in fuzzy decision tree inductionJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-19149739:5(6757-6772)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/JIFS-191497
Pillania ASingh PGupta V(2020)Optimizing Stream Data Classification Using Improved Hoeffding BoundAdvances in Communication and Computational Technology10.1007/978-981-15-5341-7_19(235-243)Online publication date: 14-Aug-2020
https://doi.org/10.1007/978-981-15-5341-7_19
Shan JChang CChen HPan J(2020)Improvement of Chromatographic Peaks Qualitative Analysis for Power Transformer Base on Decision TreeGenetic and Evolutionary Computing10.1007/978-981-15-3308-2_46(429-436)Online publication date: 13-Mar-2020
https://doi.org/10.1007/978-981-15-3308-2_46
Li MXu HDeng Y(2019)Evidential Decision Tree Based on Belief EntropyEntropy10.3390/e2109089721:9(897)Online publication date: 16-Sep-2019
https://doi.org/10.3390/e21090897
Moroz OMoroz H(2019)Application of machine learning in software engineering: an overviewPROBLEMS IN PROGRAMMING10.15407/pp2019.04.092Online publication date: Dec-2019
https://doi.org/10.15407/pp2019.04.092
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents