Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Generalization Error Bounds for Threshold Decision Lists

Published: 01 December 2004 Publication History

Abstract

In this paper we consider the generalization accuracy of classification methods based on the iterative use of linear classifiers. The resulting classifiers, which we call threshold decision lists act as follows. Some points of the data set to be classified are given a particular classification according to a linear threshold function (or hyperplane). These are then removed from consideration, and the procedure is iterated until all points are classified. Geometrically, we can imagine that at each stage, points of the same classification are successively chopped off from the data set by a hyperplane. We analyse theoretically the generalization properties of data classification techniques that are based on the use of threshold decision lists and on the special subclass of multilevel threshold functions. We present bounds on the generalization error in a standard probabilistic learning framework. The primary focus in this paper is on obtaining generalization error bounds that depend on the levels of separation---or margins---achieved by the successive linear classifiers. We also improve and extend previously published theoretical bounds on the generalization ability of perceptron decision trees.

References

[1]
N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler (1997). Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM 44(4): 615-631.]]
[2]
M. Anthony (2001). Discrete Mathematics of Neural Networks: Selected Topics. SIAM Monographs on Discrete Mathematics and Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA.]]
[3]
M. Anthony (2002). Partitioning points by parallel planes. RUTCOR Research Report RRR-39- 2002, Rutgers Center for Operations Research. (Also, CDAM research report LSE-CDAM-2002- 10, Centre for Discrete and Applicable Mathematics, London School of Economics.) To appear, Discrete Mathematics.]]
[4]
M. Anthony and P. L. Bartlett (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge UK.]]
[5]
M. Anthony and P. L. Bartlett (2000). Function learning from interpolation. Combinatorics, Probability and Computing, 9: 213-225.]]
[6]
M. Anthony and N. L. Biggs (1992). Computational Learning Theory: An Introduction. Cambridge Tracts in Theoretical Computer Science, 30. Cambridge University Press, Cambridge, UK.]]
[7]
M. Anthony, G. Brightwell and J. Shawe-Taylor (1995). On specifying Boolean functions by labelled examples. Discrete Applied Mathematics, 61: 1-25.]]
[8]
P. L. Bartlett (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory 44(2): 525-536.]]
[9]
P. L. Bartlett, O. Bousquet and S. Mendelson (2002), Localized Rademacher complexities. Proceedings of the 15th Annual Conference on Computational Learning Theory, ed. J. Kivinen and R. H. Sloan. Springer Lecture Notes in Artificial Intelligence 2375.]]
[10]
P. Bartlett and S. Mendelson (2001), Rademacher and Guassian complexities: risk bounds and structural results. In Proceedings of the 14th Annual Conference on Computational Learning Theory, Lecture Notes in Artificial Intelligence, Springer, 224-240.]]
[11]
E. Baum and D. Haussler (1989). What size net gives valid generalization? Neural Computation, 1(1): 151-160.]]
[12]
K. Bennett, N. Cristianini, J. Shawe-Taylor and D. Wu (2000). Enlarging the Margins in Perceptron Decision Trees. Machine Learning, 41: 295-313.]]
[13]
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4): 929-965.]]
[14]
V. Bohossian and J. Bruck (1998). Multiple threshold neural logic. In Advances in Neural Information Processing, Volume 10: NIPS'1997, Michael Jordan, Michael Kearns, Sara Solla (eds), MIT Press.]]
[15]
O. Bousquet (2003). New Approaches to Statistical Learning Theory. Annals of the Institute of Statistical Mathematics 55 (2): 371-389.]]
[16]
S. Boucheron, G. Lugosi and P. Massart (2000). A sharp concentration inequality with applications. Random Structures and Algorithms, 16: 277-292.]]
[17]
O. Bousquet, V. Koltchinskii and D. Panchenko (2002). Some local measures of complexity on convex hulls and generalization bounds. Proceedings of the 15th Annual Conference on Computational Learning Theory, ed. J. Kivinen and R. H. Sloan. Springer Lecture Notes in Artificial Intelligence 2375.]]
[18]
T. M. Cover (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. Electronic Computers 14: 326-334.]]
[19]
N. Cristianini and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK.]]
[20]
R. M. Dudley (1999). Uniform Central Limit Theorems, Cambridge Studies in Advanced Mathematics, 63, Cambridge University Press, Cambridge, UK.]]
[21]
A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant (1989). A general lower bound on the number of examples needed for learning. Information and Computation, 82: 247-261.]]
[22]
P. L. Hammer, T. Ibaraki and U. N. Peled (1981). Threshold numbers and threshold completions. Annals of Discrete Mathematics 11: 125-145.]]
[23]
R. G. Jeroslow (1975). On defining sets of vertices of the hypercube by linear inequalities. Discrete Mathematics, 11: 119-124.]]
[24]
O. L. Mangasarian (1968). Multisurface method of pattern separation. IEEE Transactions on Information Theory IT-14 (6): 801-807.]]
[25]
M. Marchand and M. Golea (1993). On learning simple neural concepts: from halfspace intersections to neural decision lists. Network: Computation in Neural Systems, 4: 67-85.]]
[26]
M. Marchand, M. Golea and P. Ruján (1990). A convergence theorem for sequential learning in two-layer perceptrons. Europhys. Lett. 11: 487-492.]]
[27]
M. Marchand, M. Shah, J. Shawe-Taylor and M. Sokolova (2003). The Set Covering Machine with Data-Dependent Half-Spaces. Proceedings of the Twentieth International Conference on Machine Learning (ICML'2003), 520-527, Morgan Kaufmann, San Francisco CA.]]
[28]
S. Mendelson (2003). A few notes on Statistical Learning Theory. In Advanced Lectures in Machine Learning, (S. Mendelson, A. J. Smola Eds), LNCS 2600, 1-40, Springer.]]
[29]
S. K. Murthy, S. Kasif and S. Salzberg (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research 2: 1-32.]]
[30]
A. Ngom, I. Stojmenović and J. Žunić (2003). On the number of multilinear partitions and the computing capacity of multiple-valued multiple-threshold perceptrons, IEEE Transactions on Neural Networks 14(3): 469-477.]]
[31]
Z. Obradović and I. Parberry (1994). Learning with discrete multivalued neurons. Journal of Computer and System Sciences 49: 375-390.]]
[32]
S. Olafsson and Y. S. Abu-Mostafa (1988). The capacity of multilevel threshold functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10 (2): 277-281.]]
[33]
D. Pollard (1984). Convergence of Stochastic Processes. Springer-Verlag.]]
[34]
J. R. Quinlan and R. Rivest (1989). Inferring decision trees using the minimum description length principle. Information and Computation 80: 227-248.]]
[35]
R. L. Rivest (1987). Learning Decision Lists. Machine Learning 2 (3): 229-246.]]
[36]
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson and M. Anthony (1996). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5): 1926- 1940.]]
[37]
J. Shawe-Taylor and N. Cristianini (1998). Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees. Neurocolt Technical Report NC2-TR-1998-003.]]
[38]
A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans (editors) (2000). Advances in Large-Margin Classifiers (Neural Information Processing), MIT Press.]]
[39]
M. Sokolova, M. Marchand, N. Japkowicz, and J. Shawe-Taylor (2003). The Decision List Machine. Advances in Neural Information Processing Systems 15, 921-928, MIT-Press, Cambridge, MA, USA.]]
[40]
R. Takiyama (1985). The separating capacity of a multi-threshold element. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7: 112-116.]]
[41]
G. Turán and F. Vatan (1997). Linear decision lists and partitioning algorithms for the construction of neural networks. Foundations of Computational Mathematics: selected papers of a conference held at Rio de Janeiro, Springer, 414-423.]]
[42]
L. G. Valiant (1984). A theory of the learnable. Communications of the ACM, 27(11): 1134-1142.]]
[43]
V. N. Vapnik (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York.]]
[44]
V. N. Vapnik (1998). Statistical Learning Theory, Wiley.]]
[45]
V. N. Vapnik and A. Y. Chervonenkis (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2): 264-280.]]
[46]
T. Zhang (2002). Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2: 527-550.]]
[47]
A. Zuev and L. I. Lipkin (1988). Estimating the efficiency of threshold representations of Boolean functions. Cybernetics 24: 713-723. (Translated from Kibernetika (Kiev), 6, 1988: 29-37.)]]

Cited By

View all
  • (2023)On the complexity of PAC learning in hilbert spacesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25878(7202-7209)Online publication date: 7-Feb-2023
  • (2015)Multithreshold Entropy Linear ClassifierExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.03.00742:13(5591-5606)Online publication date: 1-Aug-2015
  • (2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
  • Show More Cited By

Index Terms

  1. Generalization Error Bounds for Threshold Decision Lists

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image The Journal of Machine Learning Research
    The Journal of Machine Learning Research  Volume 5, Issue
    12/1/2004
    1571 pages
    ISSN:1532-4435
    EISSN:1533-7928
    Issue’s Table of Contents

    Publisher

    JMLR.org

    Publication History

    Published: 01 December 2004
    Published in JMLR Volume 5

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 18 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)On the complexity of PAC learning in hilbert spacesProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25878(7202-7209)Online publication date: 7-Feb-2023
    • (2015)Multithreshold Entropy Linear ClassifierExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.03.00742:13(5591-5606)Online publication date: 1-Aug-2015
    • (2011)Automatic threshold estimation for data matching applicationsInformation Sciences: an International Journal10.1016/j.ins.2010.05.029181:13(2685-2699)Online publication date: 1-Jul-2011
    • (2008)On the Design of Cascades of Boosted Ensembles for Face DetectionInternational Journal of Computer Vision10.1007/s11263-007-0060-177:1-3(65-86)Online publication date: 1-May-2008
    • (2007)Improving SVM classifiers training using artificial samplesProceedings of the 11th WSEAS International Conference on Computers10.5555/1353956.1353982(140-144)Online publication date: 26-Jul-2007
    • (2007)On the generalization error of fixed combinations of classifiersJournal of Computer and System Sciences10.1016/j.jcss.2006.10.01773:5(725-734)Online publication date: 1-Aug-2007
    • (2004)On data classification by iterative linear partitioningDiscrete Applied Mathematics10.5555/1704842.1705043144:1-2(2-16)Online publication date: 1-Nov-2004

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media