Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Decision trees for mining data streams

Published: 01 January 2006 Publication History

Abstract

In this paper we study the problem of constructing accurate decision tree models from data streams. Data streams are incremental tasks that require incremental, online, and any-time learning algorithms. One of the most successful algorithms for mining data streams is VFDT. We have extended VFDT in three directions: the ability to deal with continuous data; the use of more powerful classification techniques at tree leaves, and the ability to detect and react to concept drift. VFDTc system can incorporate and classify new information online, with a single scan of the data, in time constant per example. The most relevant property of our system is the ability to obtain a performance similar to a standard decision tree algorithm even for medium size datasets. This is relevant due to the any-time property. We also extend VFDTc with the ability to deal with concept drift, by continuously monitoring differences between two class-distribution of the examples: the distribution when a node was built and the distribution in a time window of the most recent examples. We study the sensitivity of VFDTc with respect to drift, noise, the order of examples, and the initial parameters in different problems and demonstrate its utility in large and medium data sets.

References

[1]
{1} C. Blake, E. Keogh and C. Merz, UCI repository of Machine Learning databases, 1999.
[2]
{2} H. Bock and E. Diday, Analysis of symbolic data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer Verlag, 2000.
[3]
{3} L. Breiman, Arcing classifiers, The Annals of Statistics 26(3) (1998), 801-849.
[4]
{4} J. Catlett, Megainduction: a test flight, in Machine Learning: Proceedings of the 8 International Conference, Morgan Kaufmann, 1991.
[5]
{5} P. Domingos, A unified bias-variance decomposition and its applications, in: Machine Learning, Proceedings of the 17th International Conference, P. Langley, ed., Morgan Kaufmann, 2000.
[6]
{6} P. Domingos and G. Hulten, Mining high-speed data streams, in: Knowledge Discovery and Data Mining, 2000, pp. 71-80.
[7]
{7} P. Domingos and M. Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning 29 (1997), 103-129.
[8]
{8} R. Duda, P. Hart and D. Stork, Pattern Classification, New York, Willey and Sons, 2001.
[9]
{9} F. Esposito, D. Malerba and G. Semeraro, A comparative analysis of methods for pruning decision trees, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(5) (1997), 476-491.
[10]
{10} J. Gama, An analysis of functional trees, in: Machine Learning, Proceedings of the 19th International Conference, C. Sammut, ed., Morgan Kaufmann, 2002.
[11]
{11} J. Gama, Functional trees, Machine Learning 55(3) (2004), 219-250.
[12]
{12} J. Gama, R. Rocha and P. Mendas, Accurate decision trees for mining high-speed data streams, in Conference on Knowledge Discovery in Data archive, Proceedings of the ninth ACM IGKDD international conference on Knowledge discovery and data mining, 2003, pp. 523-528.
[13]
{13} J. Gratch, Sequential inductive learning, in: Proceedings of Thirteenth National Conference on Artificial Intelligence, (Vol. 1), 1996, pp. 779-786.
[14]
{14} G. Hulten and P. Domingos, Catching up with the data: research issues in mining data streams, in Proc. of Workshop on Research issues in Data Mining and Knowledge Discovery, 2001.
[15]
{15} G. Hulten and P. Domingos, VFML - a toolkit for mining high-speed time-changing data streams, 2003. http://www.cs.washington.edu/dm/vfml/.
[16]
{16} G. Hulten, L. Spencer and P. Domingos, Mining time-changing data streams, in Proceedings of the Seventh ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, ACM Press, 2001, 97-106.
[17]
{17} D. Kalles and T. Morris, Efficient incremental induction of decision trees, Machine Learning 24(3) (1996), 231-242.
[18]
{18} R. Klinkenberg, Learning drifting concepts: Example selection vs. example weighting, In Intelligent Data Analysis (IDA), Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift 8(3) (2004).
[19]
{19} R. Klinkenberg and S. Rüping, Concept drift and the importance of examples, Text Mining - Theoretical Aspects and Applications (2003), 55-77.
[20]
{20} R. Kohavi, Scaling up the accuracy of naive Bayes classifiers: a decision tree hybrid, in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, AAAI Press.
[21]
{21} R. Kohavi, C. Brodley, B. Frasca, L. Mason and Z. Zheng, KDD-Cup 2000 organizers' report: Peeling the onion, SIGKDD Explorations 2(2) (2000), 86-98. http://www.ecn.purdue.edu/KDDCUP.
[22]
{22} J. Kolter and M. Maloof, Dynamic weighted majority: A new ensemble method for tracking concept drift, Technical Report, CSTR-20030610-3, 2003.
[23]
{23} M. Lazarescu, S. Venkatesh and H. Bui, Using multiple windows to track concept drift, Intelligent Data Analysis Journal 8(1) (2004), 29-59.
[24]
{24} PmatE, Projecto matemática ensino, 2005. http://pmate.ua.pt.
[25]
{25} R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc., 1993.
[26]
{26} J. Schlimmer and R. Granger, Beyond incremental processing: Tracking concept drift, in Proceedings of the Fifth National Conference on Artificial Intelligence, AAAI Press, 1986, 502-507.
[27]
{27} A. Seewald, J. Petrak and G. Widmer, Hybrid decision tree learners with alternative leaf classifiers: an empirical study, in Proceedings of the FLAIRS Conference, AAAI, 2001.
[28]
{28} W. N. Street and Y. Kim, A streaming ensemble algorithm (sea) for large-scale classification, in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, 2001, 377-382.
[29]
{29} P. Utgoff, ID5: An incremental ID3, in Fifth International Conference on Machine Learning, Morgan Kaufmann Publishers, 1988, 107-120.
[30]
{30} P. Utgoff, Perceptron trees - a case study in hybrid concept representation, in Proceedings of the Seventh National Conference on Artificial Intelligence, Morgan Kaufmann, 1988.
[31]
{31} P. E. Utgoff, N. C. Berkman and J. A. Clouse, Decision tree induction based on efficient tree restructuring, Machine Learning 29(1) (1997), 5-44.
[32]
{32} W. Van de Velde, Incremental induction of topologically minimal trees, in: Machine Learning, Proceedings of the 7th International Conference, B. Porter and R. Mooney, eds, Morgan Kaufmann, 1990.
[33]
{33} H. Wang, W. Fan, P. Yu and J. Han, Mining concept-drifting data streams using ensemble classifiers, in Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and data mining, ACM Press, 2003, 735-740.
[34]
{34} G. Widmer and M. Kubat, Learning in the presence of concept drift and hidden contexts, Machine Learning 23 (1996), 69-101.

Cited By

View all
  • (2021)Quantifying and Addressing Ranking Disparity in Human-Powered Data AcquisitionProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467063(2525-2533)Online publication date: 14-Aug-2021
  • (2021)Hard-ODT: Hardware-Friendly Online Decision Tree Learning Algorithm and SystemIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.304332840:11(2279-2292)Online publication date: 1-Nov-2021
  • (2021)Online GBDT with Chunk Dynamic Weighted Majority Learners for Noisy and Drifting Data StreamsNeural Processing Letters10.1007/s11063-021-10565-z53:5(3783-3799)Online publication date: 1-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Intelligent Data Analysis
Intelligent Data Analysis  Volume 10, Issue 1
January 2006
99 pages

Publisher

IOS Press

Netherlands

Publication History

Published: 01 January 2006

Author Tags

  1. Concept Drift
  2. Data Streams
  3. Incremental Decision Trees

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Quantifying and Addressing Ranking Disparity in Human-Powered Data AcquisitionProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467063(2525-2533)Online publication date: 14-Aug-2021
  • (2021)Hard-ODT: Hardware-Friendly Online Decision Tree Learning Algorithm and SystemIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2020.304332840:11(2279-2292)Online publication date: 1-Nov-2021
  • (2021)Online GBDT with Chunk Dynamic Weighted Majority Learners for Noisy and Drifting Data StreamsNeural Processing Letters10.1007/s11063-021-10565-z53:5(3783-3799)Online publication date: 1-Oct-2021
  • (2021)Incremental k-Nearest Neighbors Using Reservoir Sampling for Data StreamsDiscovery Science10.1007/978-3-030-88942-5_10(122-137)Online publication date: 11-Oct-2021
  • (2020)Microcluster-Based Incremental Ensemble Learning for Noisy, Nonstationary Data StreamsComplexity10.1155/2020/61473782020Online publication date: 1-Jan-2020
  • (2019)Memory Efficient Experience Replay for Streaming Learning2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8793982(9769-9776)Online publication date: 20-May-2019
  • (2019)RedPAC: A Simple Evolving Neuro-Fuzzy-based Intelligent Control Framework for Quadcopter2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)10.1109/FUZZ-IEEE.2019.8858991(1-7)Online publication date: 23-Jun-2019
  • (2017)Extremely Fast Decision Tree Mining for Evolving Data StreamsProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098139(1733-1742)Online publication date: 13-Aug-2017
  • (2017)Learn on Source, Refine on Target: A Model Transfer Learning Framework with Random ForestsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2016.261811839:9(1811-1824)Online publication date: 1-Sep-2017
  • (2017)Classification of high-dimensional evolving data streams via a resource-efficient online ensembleData Mining and Knowledge Discovery10.1007/s10618-017-0500-731:5(1242-1265)Online publication date: 1-Sep-2017
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media