Abstract
We present a simple trick to get an approximate estimate of the weight decay parameter λ. The method combines early stopping and weight decay, into the estimate
\( \hat\lambda = \parallel \nabla E(W_{es})\parallel /\parallel 2W_{es}\parallel, \)
where W es is the set of weights at the early stopping point, and E(W) is the training data fit error.
The estimate is demonstrated and compared to the standard cross-validation procedure for λ selection on one synthetic and four real life data sets. The result is that \(\hat\lambda\) is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute.
The results also show that weight decay can produce solutions that are significantly superior to committees of networks trained with early stopping.
Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN 978-3-540-65311-0 (1998).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abu-Mustafa, Y.S.: Hints. Neural Computation 7, 639–671 (1995)
Bishop, C.M.: Curvature-driven smoothing: A learning algorithm for feedforward networks. IEEE Transactions on Neural Networks 4(5), 882–884 (1993)
Brace, M.C., Schmidt, J., Hadlin, M.: Comparison of the forecast accuracy of neural networks with other established techniques. In: Proceedings of the First International Form on Application of Neural Networks to Power System, Seattle WA, pp. 31–35 (1991)
Buntine, W.L., Weigend, A.S.: Bayesian back-propagation. Complex Systems 5, 603–643 (1991)
Cheeseman, P.: On Bayesian model selection. In: The Mathematics of Generalization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pp. 315–330. Addison-Wesley, Reading (1995)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 304–314 (1989)
Engle, R., Clive, F., Granger, W.J., Ramanathan, R., Vahid, F., Werner, M.: Construction of the puget sound forecasting model. EPRI Project # RP2919, Quantitative Economics Research Institute, San Diego, CA (1991)
Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural Computation 4(1), 1–58 (1992)
Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Computation 7, 219–269 (1995)
Hansen, L.K., Rasmussen, C.E., Svarer, C., Larsen, J.: Adaptive regularization. In: Vlontzos, J., Hwang, J.-N., Wilson, E. (eds.) Proceedings of the IEEE Workshop on Neural Networks for Signal Processing IV, pp. 78–87. IEEE Press, Piscataway (1994)
Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation of nonorthogonal problems. Technometrics 12, 55–67 (1970)
Ishikawa, M.: A structural learning algorithm with forgetting of link weights. Technical Report TR-90-7, Electrotechnical Laboratory, Information Science Division, 1-1-4 Umezono, Tsukuba, Ibaraki 305, Japan (1990)
Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, 3rd edn. Hafner Publishing Co., New York (1972)
Moody, J.E., Rögnvaldsson, T.S.: Smoothing regularizers for projective basis function networks. In: Advances in Neural Information Processing Systems 9. MIT Press, Cambridge (1997)
Nowlan, S., Hinton, G.: Simplifying neural networks by soft weight-sharing. Neural Computation 4, 473–493 (1992)
Perrone, M.P., Cooper, L.C.: When networks disagree: Ensemble methods for hybrid neural networks. In: Artificial Neural Networks for Speech and Vision, pp. 126–142. Chapman and Hall, London (1993)
Plaut, D., Nowlan, S., Hinton, G.: Experiments on learning by backpropagation. Technical Report CMU-CS-86-126, Carnegie Mellon University, Pittsburg, PA (1986)
Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In: Ruspini, H. (ed.) Proc. of the IEEE Intl. Conference on Neural Networks, San Fransisco, California, pp. 586–591 (1993)
Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)
Sjöberg, J., Ljung, L.: Overtraining, regularization, and searching for minimum with application to neural nets. Int. J. Control 62(6), 1391–1407 (1995)
Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed problems. V. H. Winston & Sons, Washington D.C. (1977)
Tong, H.: Non-linear Time Series: A Dynamical System Approach. Clarendon Press, Oxford (1990)
Utans, J., Moody, J.E.: Selecting neural network architectures via the prediction risk: Application to corporate bond rating prediction. In: Proceedings of the First International Conference on Artificial Intelligence Applications on Wall Street. IEEE Computer Society Press, Los Alamitos (1991)
Wahba, G., Gu, C., Wang, Y., Chappell, R.: Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance. In: The Mathematics of Generalization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning, pp. 331–359. Addison-Wesley, Reading (1995)
Wahba, G., Wold, S.: A completely automatic french curve. Communications in Statistical Theory & Methods 4, 1–17 (1975)
Weigend, A., Rumelhart, D., Hubermann, B.: Back-propagation, weight-elimination and time series prediction. In: Sejnowski, T., Hinton, G., Touretzky, D. (eds.) Proc. of the Connectionist Models Summer School. Morgan Kaufmann Publishers, San Mateo (1990)
Williams, P.M.: Bayesian regularization and pruning using a Laplace prior. Neural Computation 7, 117–143 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Rögnvaldsson, T.S. (2012). A Simple Trick for Estimating the Weight Decay Parameter. In: Montavon, G., Orr, G.B., Müller, KR. (eds) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-35289-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35288-1
Online ISBN: 978-3-642-35289-8
eBook Packages: Computer ScienceComputer Science (R0)