extended-abstract

Difference of Convex Functions Programming for Policy Optimization in Reinforcement Learning

Author:

Akshat KumarAuthors Info & Claims

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

Pages 2339 - 2341

Published: 06 May 2024 Publication History

Abstract

We formulate the problem of optimizing an agent's policy within the Markov decision process (MDP) model as a difference-of-convex functions (DC) program. The DC perspective enables optimizing the policy iteratively where each iteration constructs an easier-to-optimize lower bound on the value function using the well known concave-convex procedure. We show that several popular policy gradient based deep RL algorithms (both for discrete and continuous state, action spaces, and stochastic/deterministic policies) such as actor-critic, deterministic policy gradient (DPG), and soft actor critic (SAC) can be derived from the DC perspective. Additionally, the DC formulation enables more sample efficient learning approaches by exploiting the structure of the value function lower bound, and when the policy has a simpler parametric form, allows using efficient nonlinear programming solvers. Furthermore, we show that the DC perspective extends easily to constrained RL and partially observable and multiagent settings. Such connections provide new insight on previous algorithms, and also help develop new algorithms for RL.

References

[1]

Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman. 2000. The Complexity of Decentralized Control of Markov Decision Processes. In Conference on Uncertainty in Artificial Intelligence.

[2]

D.P. Bertsekas. 1999. Nonlinear Programming. Athena Scientific.

[3]

Abhinav Bhatia, Pradeep Varakantham, and Akshat Kumar. 2019. Resource Constrained Deep Reinforcement Learning. In International Conference on Automated Planning and Scheduling. 610--620.

[4]

Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, and Shimon Whiteson. 2019. VIREL: A Variational Inference Framework for Reinforcement Learning. In Advances in Neural Information Processing Systems. arxiv: 1811.01132

[5]

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning. 2171--2186. arxiv: 1702.08165 https://arxiv.org/pdf/1702.08165.pdf

[6]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning. 2976--2989. arxiv: 1801.01290 http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf

[7]

Vijay R. Konda and John N. Tsitsiklis. 2000. Actor-critic algorithms. Advances in Neural Information Processing Systems (2000), 1008--1014. https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf

Digital Library

[8]

Akshat Kumar and Shlomo Zilberstein. 2010. MAP Estimation for Graphical Models by Likelihood Maximization. In Advances in Neural Information Processing Systems. 1180--1188.

[9]

Akshat Kumar, Shlomo Zilberstein, and Marc Toussaint. 2015. Probabilistic Inference Techniques for Scalable Multiagent Decision Making. J. Artif. Intell. Res., Vol. 53 (2015), 223--270. https://doi.org/10.1613/JAIR.4649

[10]

Sergey Levine. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. CoRR, Vol. abs/1805.0 (2018). arxiv: 1805.00909

[11]

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. International Conference on Learning Representations (2016). arxiv: 1509.02971 https://arxiv.org/pdf/1509.02971.pdf

[12]

Thomas Lipp and Stephen Boyd. 2016. Variations and extension of the convex-concave procedure. Optimization and Engineering, Vol. 17, 2 (2016), 263--287. https://doi.org/10.1007/s11081-015-9294-x

[13]

Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2014. Difference of Convex functions programming for Reinforcement Learning. In Advances in Neural Information Processing Systems. 2519--2527.

[14]

John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. 2015. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, Vol. 2015-Janua. 3528--3536. arxiv: 1506.05254

[15]

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In International Conference on Machine Learning, Vol. 1. 605--619.

[16]

Arambam James Singh, Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. 2019. Multiagent Decision Making For Maritime Traffic Management. In AAAI Conference on Artificial Intelligence. 6171--6178.

[17]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement Learning: An Introduction second ed.). The MIT Press. http://incompleteideas.net/book/the-book-2nd.html

Digital Library

[18]

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. 1057--1063.

[19]

Marc Toussaint and Amos J Storkey. 2006. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In International Conference on Machine Learning. 945--952. https://doi.org/10.1145/1143844.1143963

Digital Library

[20]

Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A, Veness Joel, Bellemare Marc G, Graves Alex, Riedmiller Martin, Fidjeland Andreas K, and Ostrovski Georg. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529.

[21]

Christopher J C H Watkins and Peter Dayan. 1992. Q-learning. In Machine Learning. 279--292.

[22]

A. L. Yuille and Anand Rangarajan. 2003. The concave-convex procedure. Neural Computation, Vol. 15, 4 (2003), 915--936. https://doi.org/10.1162/08997660360581958

Digital Library

Index Terms

Difference of Convex Functions Programming for Policy Optimization in Reinforcement Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Planning and scheduling
  2. Machine learning
    1. Learning paradigms
      1. Reinforcement learning

Recommendations

Optimal policy switching algorithms for reinforcement learning
AAMAS '10: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1

We address the problem of single-agent, autonomous sequential decision making. We assume that some controllers or behavior policies are given as prior knowledge, and the task of the agent is to learn how to switch between these policies. We formulate ...
Hessian matrix distribution for Bayesian policy gradient reinforcement learning

Bayesian policy gradient algorithms have been recently proposed for modeling the policy gradient of the performance measure in reinforcement learning as a Gaussian process. These methods were known to reduce the variance and the number of samples needed ...
Decentralized multi-task reinforcement learning policy gradient method with momentum over networks
Abstract
To find the optimal policy quickly for reinforcement learning problems, policy gradient (PG) method is very effective, it parameters the policy and updates policy parameter directly. Besides, momentum methods are commonly employed to improve ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

May 2024

2898 pages

ISBN:9798400704864

General Chairs:
Mehdi Dastani
Utrecht University, Netherlands
,
Jaime Simão Sichman
University of São Paulo, Brazil
,
Program Chairs:
Natasha Alechina
Utrecht University, Netherlands
,
Virginia Dignum
Umeå University, Sweden

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 06 May 2024

Check for updates

Author Tags

Qualifiers

Extended-abstract

Conference

AAMAS '23

Sponsor:

SIGAI

AAMAS '23: International Conference on Autonomous Agents and Multiagent Systems

May 6 - 10, 2024

Auckland, New Zealand

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
7
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents