Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3635637.3663153acmconferencesArticle/Chapter ViewAbstractPublication PagesaamasConference Proceedingsconference-collections
extended-abstract

Difference of Convex Functions Programming for Policy Optimization in Reinforcement Learning

Published: 06 May 2024 Publication History

Abstract

We formulate the problem of optimizing an agent's policy within the Markov decision process (MDP) model as a difference-of-convex functions (DC) program. The DC perspective enables optimizing the policy iteratively where each iteration constructs an easier-to-optimize lower bound on the value function using the well known concave-convex procedure. We show that several popular policy gradient based deep RL algorithms (both for discrete and continuous state, action spaces, and stochastic/deterministic policies) such as actor-critic, deterministic policy gradient (DPG), and soft actor critic (SAC) can be derived from the DC perspective. Additionally, the DC formulation enables more sample efficient learning approaches by exploiting the structure of the value function lower bound, and when the policy has a simpler parametric form, allows using efficient nonlinear programming solvers. Furthermore, we show that the DC perspective extends easily to constrained RL and partially observable and multiagent settings. Such connections provide new insight on previous algorithms, and also help develop new algorithms for RL.

References

[1]
Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman. 2000. The Complexity of Decentralized Control of Markov Decision Processes. In Conference on Uncertainty in Artificial Intelligence.
[2]
D.P. Bertsekas. 1999. Nonlinear Programming. Athena Scientific.
[3]
Abhinav Bhatia, Pradeep Varakantham, and Akshat Kumar. 2019. Resource Constrained Deep Reinforcement Learning. In International Conference on Automated Planning and Scheduling. 610--620.
[4]
Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, and Shimon Whiteson. 2019. VIREL: A Variational Inference Framework for Reinforcement Learning. In Advances in Neural Information Processing Systems. arxiv: 1811.01132
[5]
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning. 2171--2186. arxiv: 1702.08165 https://arxiv.org/pdf/1702.08165.pdf
[6]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning. 2976--2989. arxiv: 1801.01290 http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf
[7]
Vijay R. Konda and John N. Tsitsiklis. 2000. Actor-critic algorithms. Advances in Neural Information Processing Systems (2000), 1008--1014. https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf
[8]
Akshat Kumar and Shlomo Zilberstein. 2010. MAP Estimation for Graphical Models by Likelihood Maximization. In Advances in Neural Information Processing Systems. 1180--1188.
[9]
Akshat Kumar, Shlomo Zilberstein, and Marc Toussaint. 2015. Probabilistic Inference Techniques for Scalable Multiagent Decision Making. J. Artif. Intell. Res., Vol. 53 (2015), 223--270. https://doi.org/10.1613/JAIR.4649
[10]
Sergey Levine. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. CoRR, Vol. abs/1805.0 (2018). arxiv: 1805.00909
[11]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. International Conference on Learning Representations (2016). arxiv: 1509.02971 https://arxiv.org/pdf/1509.02971.pdf
[12]
Thomas Lipp and Stephen Boyd. 2016. Variations and extension of the convex-concave procedure. Optimization and Engineering, Vol. 17, 2 (2016), 263--287. https://doi.org/10.1007/s11081-015-9294-x
[13]
Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2014. Difference of Convex functions programming for Reinforcement Learning. In Advances in Neural Information Processing Systems. 2519--2527.
[14]
John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, Vol. 2015-Janua. 3528--3536. arxiv: 1506.05254
[15]
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In International Conference on Machine Learning, Vol. 1. 605--619.
[16]
Arambam James Singh, Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. 2019. Multiagent Decision Making For Maritime Traffic Management. In AAAI Conference on Artificial Intelligence. 6171--6178.
[17]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement Learning: An Introduction second ed.). The MIT Press. http://incompleteideas.net/book/the-book-2nd.html
[18]
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. 1057--1063.
[19]
Marc Toussaint and Amos J Storkey. 2006. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In International Conference on Machine Learning. 945--952. https://doi.org/10.1145/1143844.1143963
[20]
Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A, Veness Joel, Bellemare Marc G, Graves Alex, Riedmiller Martin, Fidjeland Andreas K, and Ostrovski Georg. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529.
[21]
Christopher J C H Watkins and Peter Dayan. 1992. Q-learning. In Machine Learning. 279--292.
[22]
A. L. Yuille and Anand Rangarajan. 2003. The concave-convex procedure. Neural Computation, Vol. 15, 4 (2003), 915--936. https://doi.org/10.1162/08997660360581958

Index Terms

  1. Difference of Convex Functions Programming for Policy Optimization in Reinforcement Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems
      May 2024
      2898 pages
      ISBN:9798400704864

      Sponsors

      Publisher

      International Foundation for Autonomous Agents and Multiagent Systems

      Richland, SC

      Publication History

      Published: 06 May 2024

      Check for updates

      Author Tags

      1. dc programming
      2. policy gradient
      3. reinforcement learning

      Qualifiers

      • Extended-abstract

      Conference

      AAMAS '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 7
        Total Downloads
      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 26 Sep 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media