Abstract
Many databases have grown to the point where they cannot fit into the fast memory of even large memory machines, to say nothing of current workstations. If what we want to do is to use these data bases to construct predictions of various characteristics, then since the usual methods require that all data be held in fast memory, various work-arounds have to be used. This paper studies one such class of methods which give accuracy comparable to that which could have been obtained if all data could have been held in core and which are computationally fast. The procedure takes small pieces of the data, grows a predictor on each small piece and then pastes these predictors together. A version is given that scales up to terabyte data sets. The methods are also applicable to on-line learning.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Breiman, L. (1996). Out-of-bag estimation, available at ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.
Breiman, L. (1998). Arcing classifiers. Annals of Statistics, 26, 801–824.
Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth.
Breiman, L., & Spector, P. (1994). Parallelizing cart using a workstation network. Proceedings Annual American Statistical Association Meeting, San Francisco, available at ftp.stat.berkeley.edu/usrs/breiman/pcart.ps.Z.
Chan, P., & Stolfo, S. (1997a). Scalability of hierarchical meta-learning on partitioned data, submitted to the Journal of Data Mining and Knowledge Discovery, available at www.cs.fit.edu/»pkc/papers/dmkd-scale.ps.
Chan, P., & Stolfo, S. (1997b). On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 9, 5–28.
Drucker, H., & Cortes, C. (1996). Boosting decision trees. Neural information processing 8 (pp. 479–485). Morgan Kaufmann.
Freund, Y., & Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. http://www.research.att.com/orgs/ssr/people/yoav or http://www.research.att.com/orgs/ ssr/people/schapire.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference (pp. 148–156).
Michie, D., Spiegelhalter, D., & Taylor, C. (1994). Machine learning, neural and statistical classification. London: Ellis Horwood.
Provost, F.J., & Hennessey, D. (1996). Scaling up: Distributed machine learning with cooperation. Proceedings AAAI-96.
Provost, F.J., & Kolluri, V. (1997a). A survey of methods for scaling up inductive learning algorithms. Accepted by Data Mining and Knowledge Discovery Journal: Special Issue on Scalable High-Performance Computing for KDD available at www.pitt.edu/»uxkst/survey-paper.ps.
Provost, F.J., & Kolluri, V. (1997b). Scaling up inductive algorithms: An overview. Proc. of Knowledge Discovery in Databases (Vol. KDD'97, pp. 239–242).
Quinlan, J.R. (1996). Bagging, boosting, and C4.5. Proceedings of AAAI'96 National Conference (Vol. 1, pp. 725–730).
Schapire, R.E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–226.
Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Proceedings of the 22nd VLDB Conference (pp. 544–555).
Tibshirani, R. (1996). Bias, variance, and prediction error for classification rules (Technical Report). Statistics Department, University of Toronto.
Utgoff, P. (1989). Incremental induction of decision trees. Machine Learning, 4, 161–186.
Wolpert, D.H., & Macready,W.G. (1996). An efficient method to estimate bagging's generalization error. Machine Learning, to appear.