Model Selection and Error Estimation

Bartlett, Peter L.; Boucheron, Stéphane; Lugosi, Gábor

doi:10.1023/A:1013999503812

Model Selection and Error Estimation

Published: July 2002

Volume 48, pages 85–113, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Model Selection and Error Estimation

Download PDF

Peter L. Bartlett¹,
Stéphane Boucheron² &
Gábor Lugosi³

5101 Accesses
210 Citations
Explore all metrics

Abstract

We study model selection strategies based on penalized empirical loss minimization. We point out a tight relationship between error estimation and data-based complexity penalization: any good error estimate may be converted into a data-based penalty function and the performance of the estimate is governed by the quality of the error estimate. We consider several penalty functions, involving error estimates on independent test data, empirical VC dimension, empirical VC entropy, and margin-based quantities. We also consider the maximal difference between the error on the first half of the training data and the second half, and the expected maximal discrepancy, a closely related capacity estimate that can be calculated by Monte Carlo integration. Maximal discrepancy penalty functions are appealing for pattern classification problems, since their computation is equivalent to empirical risk minimization over the training data with some labels flipped.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Google Scholar
Barron, A. R. (1985). Logically smooth density estimation. Technical Report TR 56, Department of Statistics, Stanford University.
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas, (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). NATO ASI Series, Dordrecht: Kluwer Academic Publishers.
Google Scholar
Barron, A. R., Birgé, L., & Massart, P. (1999). Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113, 301–413.
Google Scholar
Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034–1054.
Google Scholar
Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44:2, 525–536.
Google Scholar
Bartlett, P. L., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J., Smola. (Eds.), Advances in Kernel methods: Support vector learning (pp. 43–54). Cambridge: MIT Press.
Google Scholar
Birgé, L., & Massart, P. (1997). From model selection to adaptive estimation. In E. Torgersen, D. Pollard, & G. Yang, (Eds.), Festschrift for Lucien Le Cam: Research papers in probability and statistics (pp. 55–87). New York: Springer.
Google Scholar
Birgé, L., & Massart, P. (1998). Minimum contrast estimators on sieves: Exponential bounds and rates of convergence. Bernoulli, 4, 329–375.
Google Scholar
Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration inequality with applications in random combinatorics and learning. Random Structures and Algorithms, 16, 277–292.
Google Scholar
Buescher, K. L., & Kumar, P. R. (1996a). Learning by canonical smooth estimation, Part I: Simultaneous estimation. IEEE Transactions on Automatic Control, 41, 545–556.
Google Scholar
Buescher, K. L., & Kumar, P. R. (1996b). Learning by canonical smooth estimation, Part II: Learning and choice of model complexity. IEEE Transactions on Automatic Control, 41, 557–569.
Google Scholar
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press.
Google Scholar
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag.
Google Scholar
Freund, Y. (1998). Self bounding learning algorithms. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 247-258).
Gallant, A. R. (1987). Nonlinear statistical models. New York: John Wiley.
Google Scholar
Geman, S., & Hwang, C. R. (1982). Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10, 401–414.
Google Scholar
Giné, E., & Zinn, J. (1984). Some limit theorems for empirical processes. Annals of Probability, 12, 929–989.
Google Scholar
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Google Scholar
Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on Computational Learning Theory (pp. 21–30). New York: Association for Computing Machinery.
Google Scholar
Koltchinskii,V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47:5, 1902–1914.
Google Scholar
Koltchinskii, V., & Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In Giné, Evarist et al. (eds.), High dimensional probability II. 2nd international conference, Boston: Birkhäuser. Prog. Probab., 47, 443–457.
Google Scholar
Krzyÿzak, A., & Linder, T. (1998). Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks, 9, 247–256.
Google Scholar
Lozano, F. (2000). Model selection using Rademacher penalization. In Proceedings of the Second ICSC Symposia on Neural Computation (NC2000), ICSC Adademic Press.
Lugosi, G., & Nobel, A. (1999). Adaptive model selection using empirical complexities. Annals of Statistics, 27:6.
Lugosi, G., & Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Transactions on Information Theory, 41, 677–678.
Google Scholar
Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42, 48–54.
Google Scholar
Mallows, C. L. (1997). Some comments on cp. IEEE Technometrics, 15, 661–675.
Google Scholar
Mason, L., Bartlett, P. L., & Baxter, J. (2000). Improved generalization through explicit optimization of margins. Machine Learning, 38:3, 243–255.
Google Scholar
Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de l'Universitéde Toulouse, Mathématiques, série 6, IX, 245–303.
Google Scholar
McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 (pp. 148–188). Cambridge: Cambridge University Press.
Google Scholar
Mehlhorn, K., & Naher, S. (2000). Leda: A platform for combinatorial and geometric computing. Cambridge: Cambridge University Press.
Google Scholar
Meir, R. (1997). Performance bounds for nonlinear time series prediction. In Proceedings of the Tenth Annual ACM Workshop on Computational Learning Theory (pp. 122–129). New York: Association for Computing Machinery.
Google Scholar
Modha, D. S., & Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory, 42, 2133–2145.
Google Scholar
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11, 416–431.
Google Scholar
Schapire, R. E., Freund, Y., Bartlett, P. L., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26:5, 1651–1686.
Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Google Scholar
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926–1940.
Google Scholar
Shen, X., & Wong, W. H. (1994). Convergence rate of sieve estimates. Annals of Statistics, 22, 580–615.
Google Scholar
Szarek, S. J. (1976). On the best constants in the Khintchine inequality. Studia Mathematica, 63, 197–208.
Google Scholar
Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Inst. Hautes Etudes Sci. Publ. Math., 81, 73–205.
Google Scholar
Vapnik, V. N. (1982). Estimation of dependencies based on empirical data. New York: Springer-Verlag.
Google Scholar
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Google Scholar
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
Google Scholar
Vapnik, V. N., & Chervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264–280.
Google Scholar
Vapnik, V. N., & Chervonenkis, A. Ya. (1974). Theory of pattern recognition. Moscow: Nauka. (in Russian); German translation (1979): Theorie der Zeichenerkennung. Berlin: Akademie Verlag.
Google Scholar
Vapnik, V. N., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6:5, 851–876.
Google Scholar
van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.
Google Scholar
Williamson, R. C., Shawe-Taylor, J., Schölkopf, B., & Smola, A. J. (1999). Sample based generalization bounds. NeuroCOLT Technical Report NC-TR-99-055.
Yang, Y., & Barron, A. R. (1998). An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44, 95–116.
Google Scholar
Yang, Y., & Barron, A. R. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27, 1564–1599.
Google Scholar

Download references

Author information

Authors and Affiliations

BIOwulf Technologies, 2030 Addison Street, Suite 102, Berkeley, CA, 94704, USA
Peter L. Bartlett
Laboratoire de Recherche en Informatique, CNRS-Université Paris-Sud, Bâtiment 490, 91405, Orsay-Cedex, France
Stéphane Boucheron
Department of Economics, Pompeu Fabra University, Ramon Trias Fargas 25-27, 08005, Barcelona, Spain
Gábor Lugosi

Authors

Peter L. Bartlett
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Boucheron
View author publications
You can also search for this author in PubMed Google Scholar
Gábor Lugosi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bartlett, P.L., Boucheron, S. & Lugosi, G. Model Selection and Error Estimation. Machine Learning 48, 85–113 (2002). https://doi.org/10.1023/A:1013999503812

Download citation

Issue Date: July 2002
DOI: https://doi.org/10.1023/A:1013999503812

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Model Selection and Error Estimation

Abstract

Article PDF

Similar content being viewed by others

Shift Happens: Adjusting Classifiers

Structural Risk Minimization

On optimal Bayesian classification and risk estimation under multiple classes

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Model Selection and Error Estimation

Abstract

Article PDF

Similar content being viewed by others

Shift Happens: Adjusting Classifiers

Structural Risk Minimization

On optimal Bayesian classification and risk estimation under multiple classes

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation