article

Free access

Risk-constrained reinforcement learning with percentile risk criteria

Authors:

Mohammad Ghavamzadeh,

Marco PavoneAuthors Info & Claims

The Journal of Machine Learning Research, Volume 18, Issue 1

Pages 6070 - 6120

Published: 01 January 2017 Publication History

PDF eReader Publisher Site

Abstract

In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile risk-constrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.

References

[1]

E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

[2]

E. Altman, K. Avrachenkov, and R. Núñez-Queija. Perturbation analysis for denumerable Markov chains with application to queueing models. Advances in Applied Probability, pages 839-853, 2004.

[3]

P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Journal of Mathematical Finance, 9(3):203-228, 1999.

[4]

O. Bardou, N. Frikha, and G. Pagès. Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173-210, 2009.

[5]

N. Bäuerle and A. Mundt. Dynamic mean-risk optimization in a binomial model. Mathematical Methods of Operations Research, 70(2):219-239, 2009.

[6]

N. Bäuerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361-379, 2011.

[7]

J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319-350, 2001.

[8]

M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions, Part II: Applications. Mathematics of Operations Research, 31(4):673-695, 2006.

Digital Library

[9]

D. Bertsekas. Dynamic programming and optimal control. Athena Scientific, 1995.

Digital Library

[10]

D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.

[11]

D. Bertsekas. Min common/max crossing duality: A geometric view of conjugacy in convex optimization. Lab. for Information and Decision Systems, MIT, Tech. Rep. Report LIDS-P-2796, 2009.

[12]

D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.

Digital Library

[13]

S. Bhatnagar. An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes. Systems & Control Letters, 59(12):760-766, 2010.

[14]

S. Bhatnagar and K. Lakshmanan. An online actor-critic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688-708, 2012.

Digital Library

[15]

S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471-2482, 2009.

Digital Library

[16]

S. Bhatnagar, H. Prasad, and L. Prashanth. Stochastic recursive algorithms for optimization, volume 434. Springer, 2013.

[17]

K. Boda and J. Filar. Time consistent dynamic risk measures. Mathematical Methods of Operations Research, 63(1):169-186, 2006.

[18]

K. Boda, J. Filar, Y. Lin, and L. Spanjers. Stochastic target hitting time and the problem of early retirement. Automatic Control, IEEE Transactions on, 49(3):409-419, 2004.

[19]

V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. Systems & Control Letters, 44:339-346, 2001.

[20]

V. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27:294-311, 2002.

Digital Library

[21]

V. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207-213, 2005.

[22]

V. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press, 2008.

[23]

V. Borkar and R. Jain. Risk-constrained Markov decision processes. IEEE Transaction on Automatic Control, 2014.

[24]

Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages 3509-3517, 2014.

Digital Library

[25]

Y. Chow and M. Pavone. Stochastic Optimal Control with Dynamic, Time-Consistent Risk Constraints. In American Control Conference, pages 390-395, Washington, DC, June 2013. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6579868.

[26]

E. Collins. Using Markov decision processes to optimize a nonlinear functional of the final distribution, with manufacturing applications. In Stochastic Modelling in Innovative Manufacturing, pages 30-45. Springer, 1997.

[27]

B. Derfer, N. Goodyear, K. Hung, C. Matthews, G. Paoni, K. Rollins, R. Rose, M. Seaman, and J. Wiles. Online marketing platform, August 17 2007. US Patent App. 11/893,765.

[28]

J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. Mathematics of Operations Research, 14(1):147-161, 1989.

Digital Library

[29]

J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision processes. IEEE Transaction of Automatic Control, 40(1):2-10, 1995.

[30]

R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7):356-369, 1972.

[31]

H. Khalil and J. Grizzle. Nonlinear systems, volume 3. Prentice hall Upper Saddle River, 2002.

[32]

V. Konda and J. Tsitsiklis. Actor-Critic algorithms. In Proceedings of Advances in Neural Information Processing Systems 12, pages 1008-1014, 2000.

Digital Library

[33]

G. Konidaris, S. Osentoski, and P. Thomas. Value function approximation in reinforcement learning using the Fourier basis. In AAAI, 2011.

Digital Library

[34]

H. Kushner and G. Yin. Stochastic approximation algorithms and applications. Springer, 1997.

[35]

P. Marbach. Simulated-Based Methods for Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998.

[36]

P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583-601, 2002.

[37]

T. Morimura, M. Sugiyama, M. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning, pages 799-806, 2010.

Digital Library

[38]

M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555-571, 2015.

Digital Library

[39]

J. Ott. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. PhD thesis, Karlsruhe Institute of Technology, 2010.

[40]

J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the Sixteenth European Conference on Machine Learning, pages 280-291, 2005.

Digital Library

[41]

M. Petrik and D. Subramanian. An approximate solution method for large risk-averse Markov decision processes. In Proceedings of the 28th International Conference on Uncertainty in Artificial Intelligence, 2012.

Digital Library

[42]

L. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Proceedings of Advances in Neural Information Processing Systems 26, pages 252-260, 2013.

Digital Library

[43]

R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:21-42, 2000.

[44]

R. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. Journal of Banking and Finance, 26(7):1443-1471, 2002.

[45]

G. Shani, R. Brafman, and D. Heckerman. An MDP-based recommender system. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 453-460. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[46]

A. Shapiro, W. Tekaya, J. da Costa, and M. Soares. Risk neutral and risk averse stochastic dual dynamic programming method. European journal of operational research, 224(2):375-391, 2013.

[47]

T. Shardlow and A. Stuart. A perturbation theory for ergodic Markov chains and application to numerical approximations. SIAM journal on numerical analysis, 37(4):1120-1137, 2000.

Digital Library

[48]

M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794-802, 1982.

[49]

J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332-341, 1992.

[50]

R. Sutton and A. Barto. Introduction to reinforcement learning. MIT Press, 1998.

Digital Library

[51]

R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems 12, pages 1057-1063, 2000.

Digital Library

[52]

Y. Le Tallec. Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, 2007.

Digital Library

[53]

A. Tamar and S. Mannor. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697, 2013.

[54]

A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387-396, 2012.

Digital Library

[55]

A. Tamar, Y. Glassner, and S. Mannor. Policy gradients beyond expectations: Conditional value-at-risk. In AAAI, 2015.

[56]

G. Theocharous and A. Hallak. Lifetime value marketing using reinforcement learning. RLDM 2013, page 19, 2013.

[57]

D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review. Journal of Optimization Theory and Applications, 56(1):1-29, 1988.

Digital Library

[58]

R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229-256, 1992.

Digital Library

[59]

C. Wu and Y. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, 231(1):47-67, 1999.

Cited By

Ghaemi HKebriaei HRamezani Moghaddam ANili Ahmadabadi MDastani MSichman JAlechina NDignum V(2024)Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov GamesProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663134(2282-2284)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663134
Gronauer SHaider TSchmoeller da Roza FDiepold KDastani MSichman JAlechina NDignum V(2024)Reinforcement Learning with Ensemble Model Predictive Safety CertificationProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662925(724-732)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662925
Chen WOnyejizu JVu LHoang LSubramanian DKar KMishra SPaternain SDastani MSichman JAlechina NDignum V(2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662881
Show More Cited By

Recommendations

Risk-sensitive reinforcement learning: a martingale approach to reward uncertainty
ICAIF '20: Proceedings of the First ACM International Conference on AI in Finance

We introduce a novel framework to account for sensitivity to rewards uncertainty in sequential decision-making problems. While risk-sensitive formulations for Markov decision processes studied so far focus on the distribution of the cumulative reward as ...
Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach
IJCCI 2019: Proceedings of the 11th International Joint Conference on Computational Intelligence

Conditional Value-at-Risk (CVaR) is a well-known measure of risk that has been directly equated to robustness, an important component of Artificial Intelligence (AI) safety. In this paper we focus on optimizing CVaR in the context of Reinforcement ...
An Interior-Point Approach for Solving Risk-Averse PDE-Constrained Optimization Problems with Coherent Risk Measures

The prevalence of uncertainty in models of engineering and the natural sciences necessitates the inclusion of random parameters in the underlying partial differential equations (PDEs). The resulting decision problems governed by the solution of such random ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 18, Issue 1

January 2017

8830 pages

ISSN:1532-4435

EISSN:1533-7928

Editors:
Kevin Murphy
Google
,
Bernhard Schölkopf
MPI for Intelligent Systems

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Revised: 01 April 2017

Published: 01 January 2017

Published in JMLR Volume 18, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
336
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)12

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ghaemi HKebriaei HRamezani Moghaddam ANili Ahmadabadi MDastani MSichman JAlechina NDignum V(2024)Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov GamesProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663134(2282-2284)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663134
Gronauer SHaider TSchmoeller da Roza FDiepold KDastani MSichman JAlechina NDignum V(2024)Reinforcement Learning with Ensemble Model Predictive Safety CertificationProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662925(724-732)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662925
Chen WOnyejizu JVu LHoang LSubramanian DKar KMishra SPaternain SDastani MSichman JAlechina NDignum V(2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3662881
Jin CMa XRen TLin WDing Z(2024)Safe Controller Synthesis for Nonlinear Systems Using Bayesian Optimization Enhanced Reinforcement LearningProceedings of the 27th ACM International Conference on Hybrid Systems: Computation and Control10.1145/3641513.3650137(1-10)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3641513.3650137
Laguel YMalick Jvan Ackooij W(2024)Chance-constrained programs with convex underlying functions: a bilevel convex optimization perspectiveComputational Optimization and Applications10.1007/s10589-024-00573-988:3(819-847)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s10589-024-00573-9
Rigter MLacerda BHawes NOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)One risk to rule them allProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669511(77520-77545)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669511
Ganai MGong ZYu CHerbert SGao SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Iterative reachability estimation for safe reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669180(69764-69797)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669180
Ding DWei CZhang KRibeiro AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Last-iterate convergent policy gradient primal-dual methods for constrained MDPsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669011(66138-66200)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669011
Luo YLiu GPoupart PPan YOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)An alternative to varianceProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668784(60922-60946)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668784
Cho THan SLee HLee KLee JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Pitfall of optimismProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668603(56802-56824)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668603
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents