article

Free access

Generalization Error Bounds for Threshold Decision Lists

Author:

Martin AnthonyAuthors Info & Claims

The Journal of Machine Learning Research, Volume 5

Pages 189 - 217

Published: 01 December 2004 Publication History

Abstract

In this paper we consider the generalization accuracy of classification methods based on the iterative use of linear classifiers. The resulting classifiers, which we call threshold decision lists act as follows. Some points of the data set to be classified are given a particular classification according to a linear threshold function (or hyperplane). These are then removed from consideration, and the procedure is iterated until all points are classified. Geometrically, we can imagine that at each stage, points of the same classification are successively chopped off from the data set by a hyperplane. We analyse theoretically the generalization properties of data classification techniques that are based on the use of threshold decision lists and on the special subclass of multilevel threshold functions. We present bounds on the generalization error in a standard probabilistic learning framework. The primary focus in this paper is on obtaining generalization error bounds that depend on the levels of separation---or margins---achieved by the successive linear classifiers. We also improve and extend previously published theoretical bounds on the generalization ability of perceptron decision trees.

References

[1]

N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM 44(4): 615-631.]]

Digital Library

[2]

M. Anthony (2001). Discrete Mathematics of Neural Networks: Selected Topics. SIAM Monographs on Discrete Mathematics and Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA.]]

Digital Library

[3]

M. Anthony (2002). Partitioning points by parallel planes. RUTCOR Research Report RRR-39- 2002, Rutgers Center for Operations Research. (Also, CDAM research report LSE-CDAM-2002- 10, Centre for Discrete and Applicable Mathematics, London School of Economics.) To appear, Discrete Mathematics.]]

[4]

M. Anthony and P. L. Bartlett (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge UK.]]

Digital Library

[5]

M. Anthony and P. L. Bartlett (2000). Function learning from interpolation. Combinatorics, Probability and Computing, 9: 213-225.]]

Digital Library

[6]

M. Anthony and N. L. Biggs (1992). Computational Learning Theory: An Introduction. Cambridge Tracts in Theoretical Computer Science, 30. Cambridge University Press, Cambridge, UK.]]

Digital Library

[7]

M. Anthony, G. Brightwell and J. Shawe-Taylor (1995). On specifying Boolean functions by labelled examples. Discrete Applied Mathematics, 61: 1-25.]]

Digital Library

[8]

P. L. Bartlett (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory 44(2): 525-536.]]

Digital Library

[9]

P. L. Bartlett, O. Bousquet and S. Mendelson (2002), Localized Rademacher complexities. Proceedings of the 15th Annual Conference on Computational Learning Theory, ed. J. Kivinen and R. H. Sloan. Springer Lecture Notes in Artificial Intelligence 2375.]]

Digital Library

[10]

P. Bartlett and S. Mendelson (2001), Rademacher and Guassian complexities: risk bounds and structural results. In Proceedings of the 14th Annual Conference on Computational Learning Theory, Lecture Notes in Artificial Intelligence, Springer, 224-240.]]

Digital Library

[11]

E. Baum and D. Haussler (1989). What size net gives valid generalization? Neural Computation, 1(1): 151-160.]]

Digital Library

[12]

K. Bennett, N. Cristianini, J. Shawe-Taylor and D. Wu (2000). Enlarging the Margins in Perceptron Decision Trees. Machine Learning, 41: 295-313.]]

Digital Library

[13]

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4): 929-965.]]

Digital Library

[14]

V. Bohossian and J. Bruck (1998). Multiple threshold neural logic. In Advances in Neural Information Processing, Volume 10: NIPS'1997, Michael Jordan, Michael Kearns, Sara Solla (eds), MIT Press.]]

Digital Library

[15]

O. Bousquet (2003). New Approaches to Statistical Learning Theory. Annals of the Institute of Statistical Mathematics 55 (2): 371-389.]]

[16]

S. Boucheron, G. Lugosi and P. Massart (2000). A sharp concentration inequality with applications. Random Structures and Algorithms, 16: 277-292.]]

Digital Library

[17]

O. Bousquet, V. Koltchinskii and D. Panchenko (2002). Some local measures of complexity on convex hulls and generalization bounds. Proceedings of the 15th Annual Conference on Computational Learning Theory, ed. J. Kivinen and R. H. Sloan. Springer Lecture Notes in Artificial Intelligence 2375.]]

Digital Library

[18]

T. M. Cover (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. Electronic Computers 14: 326-334.]]

[19]

N. Cristianini and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK.]]

Digital Library

[20]

R. M. Dudley (1999). Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK.]]

[21]

A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant (1989). A general lower bound on the number of examples needed for learning. Information and Computation, 82: 247-261.]]

Digital Library

[22]

P. L. Hammer, T. Ibaraki and U. N. Peled (1981). Threshold numbers and threshold completions. Annals of Discrete Mathematics 11: 125-145.]]

[23]

R. G. Jeroslow (1975). On defining sets of vertices of the hypercube by linear inequalities. Discrete Mathematics, 11: 119-124.]]

Digital Library

[24]

O. L. Mangasarian (1968). Multisurface method of pattern separation. IEEE Transactions on Information Theory IT-14 (6): 801-807.]]

Digital Library

[25]

M. Marchand and M. Golea (1993). On learning simple neural concepts: from halfspace intersections to neural decision lists. Network: Computation in Neural Systems, 4: 67-85.]]

[26]

M. Marchand, M. Golea and P. Ruján (1990). A convergence theorem for sequential learning in two-layer perceptrons. Europhys. Lett. 11: 487-492.]]

[27]

M. Marchand, M. Shah, J. Shawe-Taylor and M. Sokolova (2003). The Set Covering Machine with Data-Dependent Half-Spaces. Proceedings of the Twentieth International Conference on Machine Learning (ICML'2003), 520-527, Morgan Kaufmann, San Francisco CA.]]

[28]

S. Mendelson (2003). A few notes on Statistical Learning Theory. In Advanced Lectures in Machine Learning, (S. Mendelson, A. J. Smola Eds), LNCS 2600, 1-40, Springer.]]

Digital Library

[29]

S. K. Murthy, S. Kasif and S. Salzberg (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research 2: 1-32.]]

[30]

A. Ngom, I. Stojmenović and J. Žunić (2003). On the number of multilinear partitions and the computing capacity of multiple-valued multiple-threshold perceptrons, IEEE Transactions on Neural Networks 14(3): 469-477.]]

Digital Library

[31]

Z. Obradović and I. Parberry (1994). Learning with discrete multivalued neurons. Journal of Computer and System Sciences 49: 375-390.]]

Digital Library

[32]

S. Olafsson and Y. S. Abu-Mostafa (1988). The capacity of multilevel threshold functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10 (2): 277-281.]]

Digital Library

[33]

D. Pollard (1984). Convergence of Stochastic Processes. Springer-Verlag.]]

[34]

J. R. Quinlan and R. Rivest (1989). Inferring decision trees using the minimum description length principle. Information and Computation 80: 227-248.]]

Digital Library

[35]

R. L. Rivest (1987). Learning Decision Lists. Machine Learning 2 (3): 229-246.]]

Digital Library

[36]

J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson and M. Anthony (1996). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5): 1926- 1940.]]

Digital Library

[37]

J. Shawe-Taylor and N. Cristianini (1998). Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees. Neurocolt Technical Report NC2-TR-1998-003.]]

[38]

A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans (editors) (2000). Advances in Large-Margin Classifiers (Neural Information Processing), MIT Press.]]

Digital Library

[39]

M. Sokolova, M. Marchand, N. Japkowicz, and J. Shawe-Taylor (2003). The Decision List Machine. Advances in Neural Information Processing Systems 15, 921-928, MIT-Press, Cambridge, MA, USA.]]

[40]

R. Takiyama (1985). The separating capacity of a multi-threshold element. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7: 112-116.]]

Digital Library

[41]

G. Turán and F. Vatan (1997). Linear decision lists and partitioning algorithms for the construction of neural networks. Foundations of Computational Mathematics: selected papers of a conference held at Rio de Janeiro, Springer, 414-423.]]

Digital Library

[42]

L. G. Valiant (1984). A theory of the learnable. Communications of the ACM, 27(11): 1134-1142.]]

Digital Library

[43]

V. N. Vapnik (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York.]]

Digital Library

[44]

V. N. Vapnik (1998). Statistical Learning Theory, Wiley.]]

[45]

V. N. Vapnik and A. Y. Chervonenkis (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2): 264-280.]]

[46]

T. Zhang (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2: 527-550.]]

Digital Library

[47]

A. Zuev and L. I. Lipkin (1988). Estimating the efficiency of threshold representations of Boolean functions. Cybernetics 24: 713-723. (Translated from Kibernetika (Kiev), 6, 1988: 29-37.)]]

Cited By

Chubanov SWilliams BChen YNeville J(2023)On the complexity of PAC learning in hilbert spacesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25878(7202-7209)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25878
Czarnecki WTabor J(2015)Multithreshold Entropy Linear ClassifierExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.03.00742:13(5591-5606)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1016/j.eswa.2015.03.007
dos Santos JHeuser CMoreira VWives L(2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1016/j.ins.2010.05.029
Show More Cited By

Index Terms

Generalization Error Bounds for Threshold Decision Lists
1. Computing methodologies
  1. Machine learning

Recommendations

Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates

For two-class datasets, we provide a method for estimating the generalization error of a bag using out-of-bag estimates. In bagging, each predictor (single hypothesis) is learned from a bootstrap sample of the training examples; the output of a bag (a ...
Margin-based generalization lower bounds for boosted classifiers
NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

Boosting is one of the most successful ideas in machine learning. The most well-accepted explanations for the low generalization error of boosting algorithms such as AdaBoost stem from margin theory. The study of margins in the context of boosting ...
Constrained Cascade Generalization of Decision Trees

While decision tree techniques have been widely used in classification applications, a shortcoming of many decision tree inducers is that they do not learn intermediate concepts, i.e., at each node, only one of the original features is involved in the ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 5, Issue

12/1/2004

1571 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 December 2004

Published in JMLR Volume 5

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
344
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)9

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chubanov SWilliams BChen YNeville J(2023)On the complexity of PAC learning in hilbert spacesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25878(7202-7209)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25878
Czarnecki WTabor J(2015)Multithreshold Entropy Linear ClassifierExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.03.00742:13(5591-5606)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1016/j.eswa.2015.03.007
dos Santos JHeuser CMoreira VWives L(2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1016/j.ins.2010.05.029
Brubaker SWu JSun JMullin MRehg J(2008)On the Design of Cascades of Boosted Ensembles for Face DetectionInternational Journal of Computer Vision10.1007/s11263-007-0060-177:1-3(65-86)Online publication date: 1-May-2008
https://dl.acm.org/doi/10.1007/s11263-007-0060-1
Labed ANadil MDaouadi D(2007)Improving SVM classifiers training using artificial samplesProceedings of the 11th WSEAS International Conference on Computers10.5555/1353956.1353982(140-144)Online publication date: 26-Jul-2007
https://dl.acm.org/doi/10.5555/1353956.1353982
Anthony M(2007)On the generalization error of fixed combinations of classifiersJournal of Computer and System Sciences10.1016/j.jcss.2006.10.01773:5(725-734)Online publication date: 1-Aug-2007
https://dl.acm.org/doi/10.1016/j.jcss.2006.10.017
Anthony M(2004)On data classification by iterative linear partitioningDiscrete Applied Mathematics10.5555/1704842.1705043144:1-2(2-16)Online publication date: 1-Nov-2004
https://dl.acm.org/doi/10.5555/1704842.1705043

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents