research-article

Open access

Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

Authors:

Adam WiermanAuthors Info & Claims

Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 7, Issue 1

Article No.: 13, Pages 1 - 51

https://doi.org/10.1145/3579443

Published: 02 March 2023 Publication History

Abstract

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its κ-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in κ. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing κ. Numerical simulations demonstrate the effectiveness of LPI.

References

[1]

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. 2020. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory. PMLR, 64--66.

[2]

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, Vol. 22, 98 (2021), 1--76.

[3]

Bassam Bamieh, Fernando Paganini, and Munther A Dahleh. 2002. Distributed control of spatially invariant systems. IEEE Transactions on automatic control, Vol. 47, 7 (2002), 1091--1107.

[4]

Dimitri P Bertsekas and John N Tsitsiklis. 1996. Neuro-dynamic programming. Vol. 5. Athena Scientific Belmont, MA.

[5]

Jalaj Bhandari and Daniel Russo. 2019. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019).

[6]

Lucian Bu, Robert Babu, Bart De Schutter, et al. 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 38, 2 (2008), 156--172.

Digital Library

[7]

Semih Cayci, Niao He, and R Srikant. 2021. Linear convergence of entropy-regularized natural policy gradient with linear function approximation. Preprint arXiv:2106.04096 (2021).

[8]

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. 2021. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research (2021).

[9]

Xin Chen, Guannan Qu, Yujie Tang, Steven Low, and Na Li. 2022. Reinforcement learning for selective key applications in power systems: Recent advances and future challenges. IEEE Transactions on Smart Grid (2022).

[10]

Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, Vol. 1998 (1998), 746--752.

Digital Library

[11]

Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Mihailo Jovanovic. 2022. Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence. In International Conference on Machine Learning. PMLR, 5166--5220.

[12]

Thinh Doan, Siva Maguluri, and Justin Romberg. 2019. Finite-time analysis of distributed TD (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning. PMLR, 1626--1635.

[13]

Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, Vol. 29 (2016).

Digital Library

[14]

David Gamarnik. 2013. Correlation decay method for decision, optimization, and inference in large-scale networks. In Theory Driven by Influential Applications. INFORMS, 108--121.

[15]

David Gamarnik, David A Goldberg, and Theophane Weber. 2014. Correlation decay in random decision networks. Mathematics of Operations Research, Vol. 39, 2 (2014), 229--261.

Digital Library

[16]

Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. 2003. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, Vol. 19 (2003), 399--468.

Digital Library

[17]

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. 2018. Is Q-learning Provably Efficient? arXiv:1807.03765 [cs, math, stat] (July 2018). http://arxiv.org/abs/1807.03765 arXiv: 1807.03765.

[18]

Michael Kearns and Daphne Koller. 1999. Efficient reinforcement learning in factored MDPs. In IJCAI, Vol. 16. 740--747.

Digital Library

[19]

Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. 2021. Global convergence of multi-agent policy gradient in markov potential games. arXiv preprint arXiv:2106.01969 (2021).

[20]

David A Levin and Yuval Peres. 2017. Markov chains and mixing times. Vol. 107. American Mathematical Soc.

[21]

Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman. 2021. Multi-agent reinforcement learning in stochastic networked systems. Advances in Neural Information Processing Systems, Vol. 34 (2021), 7825--7837.

[22]

Michael L Littman. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994. Elsevier, 157--163.

Digital Library

[23]

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems. 6379--6390.

Digital Library

[24]

Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. 2020. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning. PMLR, 6820--6829.

[25]

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. PMLR, 1928--1937.

[26]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529.

[27]

Nader Motee and Ali Jadbabaie. 2008. Optimal control of spatially distributed systems. IEEE Trans. Automat. Control, Vol. 53, 7 (2008), 1616--1629.

[28]

Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. 2020a. Scalable multi-agent reinforcement learning for networked systems with average reward. Advances in Neural Information Processing Systems, Vol. 33 (2020), 2074--2086.

[29]

Guannan Qu, Adam Wierman, and Na Li. 2020b. Scalable reinforcement learning of localized policies for multi-agent networked systems. In Learning for Dynamics and Control. PMLR, 256--266.

[30]

Sungho Shin, Yiheng Lin, Guannan Qu, Adam Wierman, and Mihai Anitescu. 2022. Near-Optimal Distributed Linear-Quadratic Regulator for Networked Systems. arXiv preprint arXiv:2204.05551 (2022).

[31]

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484.

[32]

Sainbayar Sukhbaatar, Rob Fergus, et al. 2016. Learning multiagent communication with backpropagation. Advances in neural information processing systems, Vol. 29 (2016).

[33]

Wesley Suttle, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer Bacs ar, and Ji Liu. 2020. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. IFAC-PapersOnLine, Vol. 53, 2 (2020), 1549--1554.

[34]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.

Digital Library

[35]

Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning. 330--337.

Digital Library

[36]

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. 2020. Leverage the average: an analysis of KL regularization in reinforcement learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 12163--12174.

[37]

Neil Walton and Kuang Xu. 2021. Learning and information in stochastic networks and queues. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications. INFORMS, 161--198.

[38]

Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science, Vol. 3, 3 (1991), 241--268.

[39]

Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean field multi-agent reinforcement learning. In International conference on machine learning. PMLR, 5571--5580.

[40]

Kaiqing Zhang, Zhuoran Yang, and Tamer Bacs ar. 2021b. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control (2021), 321--384.

[41]

Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning. PMLR, 5872--5881.

[42]

Runyu Zhang, Weiyu Li, and Na Li. 2022. On the Optimal Control of Network LQR with Spatially-Exponential Decaying Structure. arXiv preprint arXiv:2209.14376 (2022).

[43]

Shangtong Zhang, Remi Tachet, and Romain Laroche. 2021a. Global optimality and finite sample analysis of softmax off-policy actor critic under state distribution mismatch. Preprint arXiv:2111.02997 (2021).

Cited By

Xu EQu G(2024)Stability and Regret Bounds on Distributed Truncated Predictive Control for Networked Dynamical Systems2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644979(2604-2611)Online publication date: 10-Jul-2024
https://doi.org/10.23919/ACC60939.2024.10644979
Olsson JZhang RTegling ELi N(2024)Scalable Reinforcement Learning for Linear-Quadratic Control of Networks2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644413(1813-1818)Online publication date: 10-Jul-2024
https://doi.org/10.23919/ACC60939.2024.10644413
Zhang RFlathmann CMusick GSchelble BMcNeese NKnijnenburg BDuan W(2024)I Know This Looks Bad, But I Can Explain: Understanding When AI Should Explain Actions In Human-AI TeamsACM Transactions on Interactive Intelligent Systems10.1145/363547414:1(1-23)Online publication date: 5-Feb-2024
https://dl.acm.org/doi/10.1145/3635474
Show More Cited By

Index Terms

Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

Recommendations

Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning
SIGMETRICS '23

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of ...
Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of ...
Mediated Multi-Agent Reinforcement Learning
AAMAS '23: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems

The majority of Multi-Agent Reinforcement Learning (MARL) literature equates the cooperation of self-interested agents in mixed environments to the problem of social welfare maximization, allowing agents to arbitrarily share rewards and private ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 7, Issue 1

POMACS

March 2023

749 pages

EISSN:2476-1249

DOI:10.1145/3586099

Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States

Issue’s Table of Contents

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2023

Published in POMACS Volume 7, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
892
Total Downloads

Downloads (Last 12 months)539
Downloads (Last 6 weeks)61

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu EQu G(2024)Stability and Regret Bounds on Distributed Truncated Predictive Control for Networked Dynamical Systems2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644979(2604-2611)Online publication date: 10-Jul-2024
https://doi.org/10.23919/ACC60939.2024.10644979
Olsson JZhang RTegling ELi N(2024)Scalable Reinforcement Learning for Linear-Quadratic Control of Networks2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644413(1813-1818)Online publication date: 10-Jul-2024
https://doi.org/10.23919/ACC60939.2024.10644413
Zhang RFlathmann CMusick GSchelble BMcNeese NKnijnenburg BDuan W(2024)I Know This Looks Bad, But I Can Explain: Understanding When AI Should Explain Actions In Human-AI TeamsACM Transactions on Interactive Intelligent Systems10.1145/363547414:1(1-23)Online publication date: 5-Feb-2024
https://dl.acm.org/doi/10.1145/3635474
Liu HLiu XXie XTong XLi K(2024)PmTrackProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314337:4(1-30)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631433
Chen ZZhang KMazumdar EOzdaglar AWierman AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)A finite-sample analysis of payoff-based independent learning in zero-sum stochastic gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669435(75826-75883)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669435
Zhang YQu GXu PLin YChen ZWierman A(2023)Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement LearningACM SIGMETRICS Performance Evaluation Review10.1145/3606376.359354551:1(83-84)Online publication date: 27-Jun-2023
https://dl.acm.org/doi/10.1145/3606376.3593545
Manvi BSubramanian E(2023)A Nash Equilibrium Solution for Periodic Double Auctions2023 62nd IEEE Conference on Decision and Control (CDC)10.1109/CDC49753.2023.10383887(209-214)Online publication date: 13-Dec-2023
https://doi.org/10.1109/CDC49753.2023.10383887
Xu EQu G(2023)Natural Policy Gradient Preserves Spatial Decay Properties for Control of Networked Dynamical Systems2023 62nd IEEE Conference on Decision and Control (CDC)10.1109/CDC49753.2023.10383735(4486-4493)Online publication date: 13-Dec-2023
https://doi.org/10.1109/CDC49753.2023.10383735
Abliz PYing S(2023)Bias Reduced Methods to Q-learningNeural Information Processing10.1007/978-981-99-8132-8_29(378-395)Online publication date: 26-Nov-2023
https://doi.org/10.1007/978-981-99-8132-8_29

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents