Abstract
We study model selection strategies based on penalized empirical loss minimization. We point out a tight relationship between error estimation and data-based complexity penalization: any good error estimate may be converted into a data-based penalty function and the performance of the estimate is governed by the quality of the error estimate. We consider several penalty functions, involving error estimates on independent test data, empirical VC dimension, empirical VC entropy, and margin-based quantities. We also consider the maximal difference between the error on the first half of the training data and the second half, and the expected maximal discrepancy, a closely related capacity estimate that can be calculated by Monte Carlo integration. Maximal discrepancy penalty functions are appealing for pattern classification problems, since their computation is equivalent to empirical risk minimization over the training data with some labels flipped.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Barron, A. R. (1985). Logically smooth density estimation. Technical Report TR 56, Department of Statistics, Stanford University.
Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In G. Roussas, (Ed.), Nonparametric functional estimation and related topics (pp. 561–576). NATO ASI Series, Dordrecht: Kluwer Academic Publishers.
Barron, A. R., Birgé, L., & Massart, P. (1999). Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113, 301–413.
Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, 37, 1034–1054.
Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44:2, 525–536.
Bartlett, P. L., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, & A. J., Smola. (Eds.), Advances in Kernel methods: Support vector learning (pp. 43–54). Cambridge: MIT Press.
Birgé, L., & Massart, P. (1997). From model selection to adaptive estimation. In E. Torgersen, D. Pollard, & G. Yang, (Eds.), Festschrift for Lucien Le Cam: Research papers in probability and statistics (pp. 55–87). New York: Springer.
Birgé, L., & Massart, P. (1998). Minimum contrast estimators on sieves: Exponential bounds and rates of convergence. Bernoulli, 4, 329–375.
Boucheron, S., Lugosi, G., & Massart, P. (2000). A sharp concentration inequality with applications in random combinatorics and learning. Random Structures and Algorithms, 16, 277–292.
Buescher, K. L., & Kumar, P. R. (1996a). Learning by canonical smooth estimation, Part I: Simultaneous estimation. IEEE Transactions on Automatic Control, 41, 545–556.
Buescher, K. L., & Kumar, P. R. (1996b). Learning by canonical smooth estimation, Part II: Learning and choice of model complexity. IEEE Transactions on Automatic Control, 41, 557–569.
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press.
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag.
Freund, Y. (1998). Self bounding learning algorithms. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (pp. 247-258).
Gallant, A. R. (1987). Nonlinear statistical models. New York: John Wiley.
Geman, S., & Hwang, C. R. (1982). Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10, 401–414.
Giné, E., & Zinn, J. (1984). Some limit theorems for empirical processes. Annals of Probability, 12, 929–989.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on Computational Learning Theory (pp. 21–30). New York: Association for Computing Machinery.
Koltchinskii,V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47:5, 1902–1914.
Koltchinskii, V., & Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In Giné, Evarist et al. (eds.), High dimensional probability II. 2nd international conference, Boston: Birkhäuser. Prog. Probab., 47, 443–457.
Krzyÿzak, A., & Linder, T. (1998). Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks, 9, 247–256.
Lozano, F. (2000). Model selection using Rademacher penalization. In Proceedings of the Second ICSC Symposia on Neural Computation (NC2000), ICSC Adademic Press.
Lugosi, G., & Nobel, A. (1999). Adaptive model selection using empirical complexities. Annals of Statistics, 27:6.
Lugosi, G., & Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Transactions on Information Theory, 41, 677–678.
Lugosi, G., & Zeger, K. (1996). Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42, 48–54.
Mallows, C. L. (1997). Some comments on cp. IEEE Technometrics, 15, 661–675.
Mason, L., Bartlett, P. L., & Baxter, J. (2000). Improved generalization through explicit optimization of margins. Machine Learning, 38:3, 243–255.
Massart, P. (2000). Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de l'Universitéde Toulouse, Mathématiques, série 6, IX, 245–303.
McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 (pp. 148–188). Cambridge: Cambridge University Press.
Mehlhorn, K., & Naher, S. (2000). Leda: A platform for combinatorial and geometric computing. Cambridge: Cambridge University Press.
Meir, R. (1997). Performance bounds for nonlinear time series prediction. In Proceedings of the Tenth Annual ACM Workshop on Computational Learning Theory (pp. 122–129). New York: Association for Computing Machinery.
Modha, D. S., & Masry, E. (1996). Minimum complexity regression estimation with weakly dependent observations. IEEE Transactions on Information Theory, 42, 2133–2145.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11, 416–431.
Schapire, R. E., Freund, Y., Bartlett, P. L., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26:5, 1651–1686.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926–1940.
Shen, X., & Wong, W. H. (1994). Convergence rate of sieve estimates. Annals of Statistics, 22, 580–615.
Szarek, S. J. (1976). On the best constants in the Khintchine inequality. Studia Mathematica, 63, 197–208.
Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Inst. Hautes Etudes Sci. Publ. Math., 81, 73–205.
Vapnik, V. N. (1982). Estimation of dependencies based on empirical data. New York: Springer-Verlag.
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
Vapnik, V. N., & Chervonenkis, A. Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16, 264–280.
Vapnik, V. N., & Chervonenkis, A. Ya. (1974). Theory of pattern recognition. Moscow: Nauka. (in Russian); German translation (1979): Theorie der Zeichenerkennung. Berlin: Akademie Verlag.
Vapnik, V. N., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6:5, 851–876.
van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.
Williamson, R. C., Shawe-Taylor, J., Schölkopf, B., & Smola, A. J. (1999). Sample based generalization bounds. NeuroCOLT Technical Report NC-TR-99-055.
Yang, Y., & Barron, A. R. (1998). An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44, 95–116.
Yang, Y., & Barron, A. R. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27, 1564–1599.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bartlett, P.L., Boucheron, S. & Lugosi, G. Model Selection and Error Estimation. Machine Learning 48, 85–113 (2002). https://doi.org/10.1023/A:1013999503812
Issue Date:
DOI: https://doi.org/10.1023/A:1013999503812