Article

Parametric return density estimation for reinforcement learning

Authors:

Tetsuro Morimura,

Masashi Sugiyama,

Hisashi Kashima,

Hirotaka Hachiya,

Toshiyuki TanakaAuthors Info & Claims

UAI'10: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence

Pages 368 - 375

Published: 08 July 2010 Publication History

Abstract

Most conventional Reinforcement Learning (RL) algorithms aim to optimize decision-making rules in terms of the expected returns. However, especially for risk management purposes, other risk-sensitive criteria such as the value-at-risk or the expected shortfall are sometimes preferred in real applications. Here, we describe a parametric method for estimating density of the returns, which allows us to handle various criteria in a unified manner. We first extend the Bellman equation for the conditional expected return to cover a conditional probability density of the returns. Then we derive an extension of the TD-learning algorithm for estimating the return densities in an unknown environment. As test instances, several parametric density estimation algorithms are presented for the Gaussian, Laplace, and skewed Laplace distributions. We show that these algorithms lead to risk-sensitive as well as robust RL paradigms through numerical experiments.

References

[1]

S. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251-276, 1998.

[2]

S. Amari and H. Nagaoka. Method of Information Geometry. Oxford University Press, 2000.

[3]

S. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399-1409, 2000.

[4]

P. Artzner, F. Delbaen, J. M. Eber, and D. Heath. Coherent measures of risk. Mathematical Finance, 9:203-228, 1999.

[5]

J. A. Bagnell. Learning Decisions: Robustness, Uncertainty, and Approximation. PhD thesis, Robotics Institute, Carnegie Mellon University, 2004.

[6]

J. A. Bagnell and J. G. Schneider. Covariant policy search. In Proceedings of the International Joint Conference on Artificial Intelligence, July 2003.

[7]

D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 1995.

[8]

C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[9]

R. Dearden, N. Friedman, and S. Russell. Bayesian Q-learning. In National Conference on Artificial Intelligence, pages 761-768, 1998.

[10]

B. Defourny, D. Ernst, and L. Wehenkel. Risk-aware decision making and dynamic programming. In NIPS 2008 Workshop on Model Uncertainty and Risk in RL, 2008.

[11]

Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning, pages 201-208, 2005.

[12]

P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81-108, 2005.

[13]

M. Gordy. A comparative anatomy of credit risk models. Journal of Banking and Finance, 24:119-149, 2000.

[14]

M. Heger. Consideration of risk in reinforcement learning. In International Conference on Machine Learning, pages 105-111, 1994.

[15]

S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002.

[16]

R. Koenker. Quantile Regression. Cambridge University Press, 2005.

[17]

D. G. Luenberger. Investment Science. Oxford University Press, 1998.

[18]

O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine Learning, 49(2-3):267-290, 2002.

[19]

T. Minka. Expectation propagation for approximate Bayesian inference. In Conference on Uncertainty in Artificial Intelligence, pages 362-369, 2001.

[20]

T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribution approximation for reinforcement learning. In International Conference on Machine Learning, 2010. to appear.

[21]

J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.

[22]

M. Sato, H. Kimura, and S. Kobayahi. TD algorithm for the variance of return and mean-variance reinforcement learning. The IEICE Transactions on Information and Systems (Japanese Edition), 16(3):353-362, 2001.

[23]

M. Sugiyama, H. Hachiya, H. Kashima, and T. Morimura. Least absolute policy iteration for robust value function approximation. In IEEE International Conference on Robotics and Automation, pages 699-704, 2009.

[24]

R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998.

[25]

C. J. C. H. Watkins and P. Dayan. Technical note: Q-learning. Machine Learning, 8:279-292, 1992.

[26]

D. J. White. Mean, variance, and probabilistic criteria in finite markov decision processes: A review. Journal of Optimization Theory and Applications, 56(1):1-29, 1988.

Cited By

Kastner TErdogdu MFarahmand AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Distributional model equivalence for risk-sensitive reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668589(56531-56552)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668589
Skalse JFarrugia-Roberts MRussell SAbate AGleave AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Invariance in policy optimisation and partial identifiability in reward learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619736(32033-32058)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619736
Rowland MTang YLyle CMunos RBellemare MDabney WKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)The statistical benefits of quantile temporal-difference learning for value estimationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619622(29210-29231)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619622
Show More Cited By

Index Terms

Parametric return density estimation for reinforcement learning
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Nonparametric return distribution approximation for reinforcement learning
ICML'10: Proceedings of the 27th International Conference on International Conference on Machine Learning

Standard Reinforcement Learning (RL) aims to optimize decision-making rules in terms of the expected return. However, especially for risk-management purposes, other criteria such as the expected shortfall are sometimes preferred. Here, we describe a ...
Density Ratio Estimation in Machine Learning
Mixture density estimation with group membership functions

The mixture density model has been extensively studied in the field of statistical pattern recognition. And the EM algorithm has been well known as a convenient and efficient tool to iteratively compute the maximum likelihood estimates of mixture model ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

UAI'10: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence

July 2010

751 pages

ISBN:9780974903965

Editors:
Peter Grunwald
Advanced Systems Research, CWI, Amsterdam, Netherlands
,
Peter Spirtes
Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA

Publisher

AUAI Press

Arlington, Virginia, United States

Publication History

Published: 08 July 2010

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Kastner TErdogdu MFarahmand AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Distributional model equivalence for risk-sensitive reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668589(56531-56552)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668589
Skalse JFarrugia-Roberts MRussell SAbate AGleave AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Invariance in policy optimisation and partial identifiability in reward learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619736(32033-32058)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619736
Rowland MTang YLyle CMunos RBellemare MDabney WKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)The statistical benefits of quantile temporal-difference learning for value estimationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619622(29210-29231)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619622
Segal MGeorge ADimitrakakis C(2023)Policy Fairness and Unknown Bias Dynamics in Sequential AllocationsProceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization10.1145/3617694.3623262(1-10)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3617694.3623262
Lim SMalik IKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Distributional reinforcement learning for risk-sensitive policiesProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602516(30977-30989)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602516
Tang YRowland MMunos RPires BDabney WBellemare MKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)The nature of temporal difference errors in multi-step distributional reinforcement learningProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602464(30265-30276)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602464
Farahmand AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Value function in frequency domain and the characteristic value iteration algorithmProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455613(14808-14819)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455613
Tang YAgrawal S(2018)Exploration by distributional reinforcement learningProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304889.3305037(2710-2716)Online publication date: 13-Jul-2018
https://dl.acm.org/doi/10.5555/3304889.3305037
Bellemare MDabney WMunos R(2017)A Distributional Perspective on Reinforcement LearningProceedings of the 34th International Conference on Machine Learning - Volume 7010.5555/3305381.3305428(449-458)Online publication date: 6-Aug-2017
https://dl.acm.org/doi/10.5555/3305381.3305428

View Options

View options

Media

Figures

Other

Tables

View Table of Contents