Abstract
This paper introduces a new family of deterministic and stochastic on-line prediction algorithms which perform well with respect to general loss functions and analyzes their behavior in terms of expected loss bounds. The algorithms use parametric probabilistic models regardless of the kind of loss function used. The key of the algorithms is to iteratively estimate the probabilistic model using the maximum likelihood method, and then to construct the optimal prediction function which minimizes the average of the loss taken with respect to the estimated probabilistic model. A future outcome is predicted using this optimal prediction function. We analyze the algorithms for the cases where the target distribution is 1) k-dimensional parametric and k is known, 2) k-dimensional parametric but k is unknown, and 3) non-parametric. For all the cases, we derive upper bounds on the expected instantaneous or cumulative losses for the algorithms with respect to a large family of loss functions satisfying the constraint introduced by Merhav and Feder. These loss bounds show new universal relations among the expected prediction accuracy, the indexes of the loss function, the complexity of the target rule, and the number of training examples.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Algoet, P.H.:The strong law of large numbers for sequential decisions under uncertainty, IEEE Inform. Theory IT-40 (1994) 609–633.
Amari, S., Murata, N.: Statistical theory of learning curves under entropic loss criterion. Neural Computation 5 (1993) 140–153.
Barren, A.R., Cover T.M.: Minimum complexity density estimation. IEEE Trans. Inform. Theory IT-37 (1991) 1034–1054.
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Verlag (1980).
Cencov, N.N. Evaluation of an unknown distribution density from observations. Soviet Math. 3 (1962) 1559–1562.
Cesa-Bianchi, N., Freund, Y., Helmbold, D.P., Haussler, D., Schapire, R., Warmuth, M.K.: How to use expert advice. Proc. of The Twenty-fifth ACM Symposium on Theory of Computing, ACM Press (1993) 429–438.
Clarke, B., Barren, A.: Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory IT-36 (1990) 453–471.
Dawid, A.: Statistical theory: the presequential approach. J. R. Stat. Soc. A (1984) 278–292.
DeSantis, A., Markowsky, G., Wegman, M.N.: Learning probabilistic prediction functions. Proc. of the First Annual Workshop on Computational Learning Theory, Morgan Kaufmann (1988) 312–328.
Fisher, R.A.: Statistical Methods and Scientific Inference. Olyver and Boyd (1951).
Haussler, D., Littlestone, N., Warmuth, M.K.: Predicting {0,1}-functions on randomly drawn points. Proc. of the First Annual Workshop on Computational Learning Theory, Morgan Kaufmann (1988) 312–328.
Haussler, D., Barron, A.: How well does the Bayes method work in on-line predictions of {+1,-1}-values? Proc. of the Third NEC Symposium (1992) 74–100 SIAM.
Haussler, D., Kearns, M., Schapire, R.: Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Proc. of the Fourth Annual Workshop on Computational Learning Theory, Morgan Kaufmann (1991) 61–74.
Herrndorf, N.: Best Φ-and N φ -approximants in Orlicz spaces of vector valued functions. Z.Wahrscheinlichkeitstheorie verw. Gebiete 58 (1981) 309–329.
Kearns, M., Schapire, R.: Efficient distribution-free learning of probabilistic concepts. Jr. of Computer and System Sciences 48 (1994) 464–497.
Kivinen, J., Warmuth, M. Using experts for predicting continuous outcomes. Computational Learning Theory: EuroCOLT'93, Oxford (1994) 109–120.
Kullback, S.: A lower bound for discrimination in terms of variation. IEEE Trans. Inform. Theory IT-13 (1967) 126–127.
LeCam, L.: On some asymptotic properties of maximum likelihood estimates and related Bayes estimates. Uni. California Publ. Stat. 1 (1953) 277–330.
LeCam, L.: On the asymptotics used to prove asymptotic normality of maximum likelihood estimates. Ann. Math. Stat. 41 (1970) 802–828.
Littlestone, N.: Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Machine Learning 2 (1988) 285–318.
Merhav, N., Feder, M.: Universal sequential learning and decision from individual data sequence. Proc. of the Fifth ACM Conference on Computational Learning Theory, ACM Press (1992) 413–427.
Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inform. Theory IT-30 (1984) 629–636.
Rissanen, J.: Stochastic complexity. J. R. Stat. Soc. B 49 (1987) 223–239.
Rissanen, J.: Stochastic Complexity in Statistical Inquiry. World Scientific, Series in Computer Science 15 (1989).
Takeuchi, K.: Asymptotic Theory of Statistical Estimation. (In Japanese) Kyooiku Publishers (1974).
Vovk, V.G.: Aggregating Strategies. Proc. of the Third Annual Workshop on Computational Learning Theory, Morgan Kaufmann (1990) 371–386.
Yamanishi, K.: A loss bound model for on-line stochastic prediction strategies. Proc. of the Fourth Annual Workshop on Computational Learning Theory, Morgan Kaufmann (1991) 290–302.
Yamanishi, K.: A learning criterion for stochastic rules. Machine Learning: Special Issues for COLT 90 9 (1992) 165–203.
Yamanishi, K.(1993). A loss bound model for on-line stochastic prediction algorithms. Inform. Comput. (to appear).
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yamanishi, K. (1995). On-line maximum likelihood prediction with respect to general loss functions. In: Vitányi, P. (eds) Computational Learning Theory. EuroCOLT 1995. Lecture Notes in Computer Science, vol 904. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-59119-2_170
Download citation
DOI: https://doi.org/10.1007/3-540-59119-2_170
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-59119-1
Online ISBN: 978-3-540-49195-8
eBook Packages: Springer Book Archive