Abstract
Capacity control in perceptron decision trees is typically performed by controlling their size. We prove that other quantities can be as relevant to reduce their flexibility and combat overfitting. In particular, we provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes. So enlarging the margin in perceptron decision trees will reduce the upper bound on generalization error. Based on this analysis, we introduce three new algorithms, which can induce large margin perceptron decision trees. To assess the effect of the large margin bias, OC1 (Journal of Artificial Intelligence Research, 1994, 2, 1–32.) of Murthy, Kasif and Salzberg, a well-known system for inducing perceptron decision trees, is used as the baseline algorithm. An extensive experimental study on real world data showed that all three new algorithms perform better or at least not significantly worse than OC1 on almost every dataset with only one exception. OC1 performed worse than the best margin-based method on every dataset.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Alon, N., Ben-David, S., Cesa-Bianchi, N., & Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4), 615–631.
Anthony, M. & Bartlett, P. (1994). Function learning from interpolation. Technical Report (An extended abstract appeared in Computational Learning Theory, Proceedings 2nd European Conference, EuroCOLT'95, edited by Paul Vitanyi (Lecture Notes in Artificial Intelligence, vol. 904) Springer-Verlag, Berlin, 1995, pp. 211–221).
Bartlett, P. L. & Long, P. M. (1995). Prediction, learning, uniform convergence, and scale-sensitive dimensions. Preprint, Department of Systems Engineering, Australian National University.
Bartlett, P., Long, P., & Williamson, R. (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3), 434–452.
Bartlett, P. & Shawe-Taylor, J. (1998). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel methods-support vector learning (pp. 43–54). Cambridge, USA: MIT Press.
Bennett, K. & Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34.
Bennett, K. & Mangasarian, O. (1994a). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 29–39.
Bennett, K. & Mangasarian, O. (1994b). Serial and parallel multicategory discrimination. SIAM Journal on Optimization, 4(4), 722–734.
Bennett, K., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing. R.P.I. Math Report No. 98–100, Rensselaer Polytechnic Institute, Troy, NY.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group.
Broadley, C. E. & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77.
Cortes, C. & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.
Cristianini, N., Shawe-Taylor, J., & Sykacek, P. (1998). Bayesian classifiers are large margin hyperplanes in a Hilbert space. In J. Shavlik (Ed.), Machine Learning: Proceedings of the Fifteenth International Conference (pp. 109–117). San Francisco, CA: Morgan Kaufmann Publishers.
Diettrich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1924.
Kearns, M. & Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts. In Proceedings of the 31st Symposium on the Foundations of Computer Science (pp. 382–391). Los Alamitos, CA: IEEE Computer Society Press.
Kohavi, R. (1995). A study of cross-validation and bootstraping for accuracy estimation and model selection. In International Joint Conference on Artifical Intelligence (pp. 1137–1143). San Mateo, CA: Morgan Kaufmann.
Mangasarian, O., Setiono, R., & Wolberg, W. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. F. Coleman & Y. Li (Eds.), Proceedings on Workshop on Large-Scale Numerical Optimization (pp. 22–31). Philadelphia, PA: SIAM.
Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32.
Neal, R. N. (1998). Assessing relevance determination methods using DELVE generalization. In C. M. Bishop (Ed.), Neural networks and machine learning (pp. 97–129). Springer-Verlag.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Quinlan, J. R. & Rivest, R. (1989). Learning decision trees using the minimum description length principle. Information and Computation 80, 227–248.
Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1(3), 317–327.
Sankar, A. & Mammone, R. J. (1993). Growing and pruning neural tree networks. IEEE Transactions on Computers, 42, 291–299.
Schapire, R., Freund, Y., Bartlett, P. L., & Sun Lee, W. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In D. H. Fisher, Jr. (Ed.), Proceedings of International Conference on Machine Learning, ICML'97, (pp. 322–330). Nashville, Tennessee. Morgan Kaufmann Publishers.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1996). Structural risk minimization over data-dependent hierarchies, IEEE Transactions on Information Theory, 44(5), 1926–1940.
Sirat, J. A. & Nadal, J.-P. (1990). Neural trees: A new tool for classification. Network, 1, 423–438. University of California, Irvine-Machine Learning Repository, http://www.ics.uci.edu/∼mlearn/ MLRepository.html.
Utgoff, P. E. (1989). Perceptron trees: A case study in hybrid concept representations. Connection Science, 1, 377–391.
Vapnik, V. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Vapnik, V. & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Applications, 16, 264–280.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bennett, K.P., Cristianini, N., Shawe-Taylor, J. et al. Enlarging the Margins in Perceptron Decision Trees. Machine Learning 41, 295–313 (2000). https://doi.org/10.1023/A:1007600130808
Issue Date:
DOI: https://doi.org/10.1023/A:1007600130808