Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Risk-constrained reinforcement learning with percentile risk criteria

Published: 01 January 2017 Publication History
  • Get Citation Alerts
  • Abstract

    In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile risk-constrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.

    References

    [1]
    E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
    [2]
    E. Altman, K. Avrachenkov, and R. Núñez-Queija. Perturbation analysis for denumerable Markov chains with application to queueing models. Advances in Applied Probability, pages 839-853, 2004.
    [3]
    P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Journal of Mathematical Finance, 9(3):203-228, 1999.
    [4]
    O. Bardou, N. Frikha, and G. Pagès. Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173-210, 2009.
    [5]
    N. Bäuerle and A. Mundt. Dynamic mean-risk optimization in a binomial model. Mathematical Methods of Operations Research, 70(2):219-239, 2009.
    [6]
    N. Bäuerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361-379, 2011.
    [7]
    J. Baxter and P. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319-350, 2001.
    [8]
    M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions, Part II: Applications. Mathematics of Operations Research, 31(4):673-695, 2006.
    [9]
    D. Bertsekas. Dynamic programming and optimal control. Athena Scientific, 1995.
    [10]
    D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
    [11]
    D. Bertsekas. Min common/max crossing duality: A geometric view of conjugacy in convex optimization. Lab. for Information and Decision Systems, MIT, Tech. Rep. Report LIDS-P-2796, 2009.
    [12]
    D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
    [13]
    S. Bhatnagar. An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes. Systems & Control Letters, 59(12):760-766, 2010.
    [14]
    S. Bhatnagar and K. Lakshmanan. An online actor-critic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3):688-708, 2012.
    [15]
    S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471-2482, 2009.
    [16]
    S. Bhatnagar, H. Prasad, and L. Prashanth. Stochastic recursive algorithms for optimization, volume 434. Springer, 2013.
    [17]
    K. Boda and J. Filar. Time consistent dynamic risk measures. Mathematical Methods of Operations Research, 63(1):169-186, 2006.
    [18]
    K. Boda, J. Filar, Y. Lin, and L. Spanjers. Stochastic target hitting time and the problem of early retirement. Automatic Control, IEEE Transactions on, 49(3):409-419, 2004.
    [19]
    V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. Systems & Control Letters, 44:339-346, 2001.
    [20]
    V. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27:294-311, 2002.
    [21]
    V. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207-213, 2005.
    [22]
    V. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press, 2008.
    [23]
    V. Borkar and R. Jain. Risk-constrained Markov decision processes. IEEE Transaction on Automatic Control, 2014.
    [24]
    Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages 3509-3517, 2014.
    [25]
    Y. Chow and M. Pavone. Stochastic Optimal Control with Dynamic, Time-Consistent Risk Constraints. In American Control Conference, pages 390-395, Washington, DC, June 2013. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6579868.
    [26]
    E. Collins. Using Markov decision processes to optimize a nonlinear functional of the final distribution, with manufacturing applications. In Stochastic Modelling in Innovative Manufacturing, pages 30-45. Springer, 1997.
    [27]
    B. Derfer, N. Goodyear, K. Hung, C. Matthews, G. Paoni, K. Rollins, R. Rose, M. Seaman, and J. Wiles. Online marketing platform, August 17 2007. US Patent App. 11/893,765.
    [28]
    J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. Mathematics of Operations Research, 14(1):147-161, 1989.
    [29]
    J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision processes. IEEE Transaction of Automatic Control, 40(1):2-10, 1995.
    [30]
    R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7):356-369, 1972.
    [31]
    H. Khalil and J. Grizzle. Nonlinear systems, volume 3. Prentice hall Upper Saddle River, 2002.
    [32]
    V. Konda and J. Tsitsiklis. Actor-Critic algorithms. In Proceedings of Advances in Neural Information Processing Systems 12, pages 1008-1014, 2000.
    [33]
    G. Konidaris, S. Osentoski, and P. Thomas. Value function approximation in reinforcement learning using the Fourier basis. In AAAI, 2011.
    [34]
    H. Kushner and G. Yin. Stochastic approximation algorithms and applications. Springer, 1997.
    [35]
    P. Marbach. Simulated-Based Methods for Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, 1998.
    [36]
    P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583-601, 2002.
    [37]
    T. Morimura, M. Sugiyama, M. Kashima, H. Hachiya, and T. Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning, pages 799-806, 2010.
    [38]
    M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555-571, 2015.
    [39]
    J. Ott. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. PhD thesis, Karlsruhe Institute of Technology, 2010.
    [40]
    J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the Sixteenth European Conference on Machine Learning, pages 280-291, 2005.
    [41]
    M. Petrik and D. Subramanian. An approximate solution method for large risk-averse Markov decision processes. In Proceedings of the 28th International Conference on Uncertainty in Artificial Intelligence, 2012.
    [42]
    L. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Proceedings of Advances in Neural Information Processing Systems 26, pages 252-260, 2013.
    [43]
    R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2:21-42, 2000.
    [44]
    R. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. Journal of Banking and Finance, 26(7):1443-1471, 2002.
    [45]
    G. Shani, R. Brafman, and D. Heckerman. An MDP-based recommender system. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 453-460. Morgan Kaufmann Publishers Inc., 2002.
    [46]
    A. Shapiro, W. Tekaya, J. da Costa, and M. Soares. Risk neutral and risk averse stochastic dual dynamic programming method. European journal of operational research, 224(2):375-391, 2013.
    [47]
    T. Shardlow and A. Stuart. A perturbation theory for ergodic Markov chains and application to numerical approximations. SIAM journal on numerical analysis, 37(4):1120-1137, 2000.
    [48]
    M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794-802, 1982.
    [49]
    J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332-341, 1992.
    [50]
    R. Sutton and A. Barto. Introduction to reinforcement learning. MIT Press, 1998.
    [51]
    R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of Advances in Neural Information Processing Systems 12, pages 1057-1063, 2000.
    [52]
    Y. Le Tallec. Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD thesis, Massachusetts Institute of Technology, 2007.
    [53]
    A. Tamar and S. Mannor. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697, 2013.
    [54]
    A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387-396, 2012.
    [55]
    A. Tamar, Y. Glassner, and S. Mannor. Policy gradients beyond expectations: Conditional value-at-risk. In AAAI, 2015.
    [56]
    G. Theocharous and A. Hallak. Lifetime value marketing using reinforcement learning. RLDM 2013, page 19, 2013.
    [57]
    D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review. Journal of Optimization Theory and Applications, 56(1):1-29, 1988.
    [58]
    R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229-256, 1992.
    [59]
    C. Wu and Y. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, 231(1):47-67, 1999.

    Cited By

    View all
    • (2024)Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov GamesProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663134(2282-2284)Online publication date: 6-May-2024
    • (2024)Reinforcement Learning with Ensemble Model Predictive Safety CertificationProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662925(724-732)Online publication date: 6-May-2024
    • (2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image The Journal of Machine Learning Research
    The Journal of Machine Learning Research  Volume 18, Issue 1
    January 2017
    8830 pages
    ISSN:1532-4435
    EISSN:1533-7928
    Issue’s Table of Contents

    Publisher

    JMLR.org

    Publication History

    Revised: 01 April 2017
    Published: 01 January 2017
    Published in JMLR Volume 18, Issue 1

    Author Tags

    1. Markov decision process
    2. actor-critic algorithms
    3. chance-constrained optimization
    4. conditional value-at-risk
    5. policy gradient algorithms
    6. reinforcement learning

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)65
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Risk-Sensitive Multi-Agent Reinforcement Learning in Network Aggregative Markov GamesProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663134(2282-2284)Online publication date: 6-May-2024
    • (2024)Reinforcement Learning with Ensemble Model Predictive Safety CertificationProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662925(724-732)Online publication date: 6-May-2024
    • (2024)Adaptive Primal-Dual Method for Safe Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3662881(326-334)Online publication date: 6-May-2024
    • (2024)Safe Controller Synthesis for Nonlinear Systems Using Bayesian Optimization Enhanced Reinforcement LearningProceedings of the 27th ACM International Conference on Hybrid Systems: Computation and Control10.1145/3641513.3650137(1-10)Online publication date: 14-May-2024
    • (2024)Chance-constrained programs with convex underlying functions: a bilevel convex optimization perspectiveComputational Optimization and Applications10.1007/s10589-024-00573-988:3(819-847)Online publication date: 1-Jul-2024
    • (2023)One risk to rule them allProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669511(77520-77545)Online publication date: 10-Dec-2023
    • (2023)Iterative reachability estimation for safe reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669180(69764-69797)Online publication date: 10-Dec-2023
    • (2023)Last-iterate convergent policy gradient primal-dual methods for constrained MDPsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669011(66138-66200)Online publication date: 10-Dec-2023
    • (2023)An alternative to varianceProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668784(60922-60946)Online publication date: 10-Dec-2023
    • (2023)Pitfall of optimismProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668603(56802-56824)Online publication date: 10-Dec-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media