Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

Published: 02 March 2023 Publication History

Abstract

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its κ-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in κ. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing κ. Numerical simulations demonstrate the effectiveness of LPI.

References

[1]
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. 2020. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory. PMLR, 64--66.
[2]
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, Vol. 22, 98 (2021), 1--76.
[3]
Bassam Bamieh, Fernando Paganini, and Munther A Dahleh. 2002. Distributed control of spatially invariant systems. IEEE Transactions on automatic control, Vol. 47, 7 (2002), 1091--1107.
[4]
Dimitri P Bertsekas and John N Tsitsiklis. 1996. Neuro-dynamic programming. Vol. 5. Athena Scientific Belmont, MA.
[5]
Jalaj Bhandari and Daniel Russo. 2019. Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786 (2019).
[6]
Lucian Bu, Robert Babu, Bart De Schutter, et al. 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 38, 2 (2008), 156--172.
[7]
Semih Cayci, Niao He, and R Srikant. 2021. Linear convergence of entropy-regularized natural policy gradient with linear function approximation. Preprint arXiv:2106.04096 (2021).
[8]
Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. 2021. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research (2021).
[9]
Xin Chen, Guannan Qu, Yujie Tang, Steven Low, and Na Li. 2022. Reinforcement learning for selective key applications in power systems: Recent advances and future challenges. IEEE Transactions on Smart Grid (2022).
[10]
Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, Vol. 1998 (1998), 746--752.
[11]
Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Mihailo Jovanovic. 2022. Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence. In International Conference on Machine Learning. PMLR, 5166--5220.
[12]
Thinh Doan, Siva Maguluri, and Justin Romberg. 2019. Finite-time analysis of distributed TD (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning. PMLR, 1626--1635.
[13]
Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, Vol. 29 (2016).
[14]
David Gamarnik. 2013. Correlation decay method for decision, optimization, and inference in large-scale networks. In Theory Driven by Influential Applications. INFORMS, 108--121.
[15]
David Gamarnik, David A Goldberg, and Theophane Weber. 2014. Correlation decay in random decision networks. Mathematics of Operations Research, Vol. 39, 2 (2014), 229--261.
[16]
Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. 2003. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, Vol. 19 (2003), 399--468.
[17]
Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. 2018. Is Q-learning Provably Efficient? arXiv:1807.03765 [cs, math, stat] (July 2018). http://arxiv.org/abs/1807.03765 arXiv: 1807.03765.
[18]
Michael Kearns and Daphne Koller. 1999. Efficient reinforcement learning in factored MDPs. In IJCAI, Vol. 16. 740--747.
[19]
Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. 2021. Global convergence of multi-agent policy gradient in markov potential games. arXiv preprint arXiv:2106.01969 (2021).
[20]
David A Levin and Yuval Peres. 2017. Markov chains and mixing times. Vol. 107. American Mathematical Soc.
[21]
Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman. 2021. Multi-agent reinforcement learning in stochastic networked systems. Advances in Neural Information Processing Systems, Vol. 34 (2021), 7825--7837.
[22]
Michael L Littman. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994. Elsevier, 157--163.
[23]
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems. 6379--6390.
[24]
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. 2020. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning. PMLR, 6820--6829.
[25]
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. PMLR, 1928--1937.
[26]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, Vol. 518, 7540 (2015), 529.
[27]
Nader Motee and Ali Jadbabaie. 2008. Optimal control of spatially distributed systems. IEEE Trans. Automat. Control, Vol. 53, 7 (2008), 1616--1629.
[28]
Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. 2020a. Scalable multi-agent reinforcement learning for networked systems with average reward. Advances in Neural Information Processing Systems, Vol. 33 (2020), 2074--2086.
[29]
Guannan Qu, Adam Wierman, and Na Li. 2020b. Scalable reinforcement learning of localized policies for multi-agent networked systems. In Learning for Dynamics and Control. PMLR, 256--266.
[30]
Sungho Shin, Yiheng Lin, Guannan Qu, Adam Wierman, and Mihai Anitescu. 2022. Near-Optimal Distributed Linear-Quadratic Regulator for Networked Systems. arXiv preprint arXiv:2204.05551 (2022).
[31]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, Vol. 529, 7587 (2016), 484.
[32]
Sainbayar Sukhbaatar, Rob Fergus, et al. 2016. Learning multiagent communication with backpropagation. Advances in neural information processing systems, Vol. 29 (2016).
[33]
Wesley Suttle, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer Bacs ar, and Ji Liu. 2020. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. IFAC-PapersOnLine, Vol. 53, 2 (2020), 1549--1554.
[34]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[35]
Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning. 330--337.
[36]
Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. 2020. Leverage the average: an analysis of KL regularization in reinforcement learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 12163--12174.
[37]
Neil Walton and Kuang Xu. 2021. Learning and information in stochastic networks and queues. In Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications. INFORMS, 161--198.
[38]
Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science, Vol. 3, 3 (1991), 241--268.
[39]
Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. 2018. Mean field multi-agent reinforcement learning. In International conference on machine learning. PMLR, 5571--5580.
[40]
Kaiqing Zhang, Zhuoran Yang, and Tamer Bacs ar. 2021b. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control (2021), 321--384.
[41]
Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning. PMLR, 5872--5881.
[42]
Runyu Zhang, Weiyu Li, and Na Li. 2022. On the Optimal Control of Network LQR with Spatially-Exponential Decaying Structure. arXiv preprint arXiv:2209.14376 (2022).
[43]
Shangtong Zhang, Remi Tachet, and Romain Laroche. 2021a. Global optimality and finite sample analysis of softmax off-policy actor critic under state distribution mismatch. Preprint arXiv:2111.02997 (2021).

Cited By

View all
  • (2024)Stability and Regret Bounds on Distributed Truncated Predictive Control for Networked Dynamical Systems2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644979(2604-2611)Online publication date: 10-Jul-2024
  • (2024)Scalable Reinforcement Learning for Linear-Quadratic Control of Networks2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644413(1813-1818)Online publication date: 10-Jul-2024
  • (2024)I Know This Looks Bad, But I Can Explain: Understanding When AI Should Explain Actions In Human-AI TeamsACM Transactions on Interactive Intelligent Systems10.1145/363547414:1(1-23)Online publication date: 5-Feb-2024
  • Show More Cited By

Index Terms

  1. Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
        Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 7, Issue 1
        POMACS
        March 2023
        749 pages
        EISSN:2476-1249
        DOI:10.1145/3586099
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 02 March 2023
        Published in POMACS Volume 7, Issue 1

        Check for updates

        Author Tags

        1. distributed algorithms
        2. machine learning
        3. multi-agent reinforcement learning
        4. networked systems

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)565
        • Downloads (Last 6 weeks)97
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Stability and Regret Bounds on Distributed Truncated Predictive Control for Networked Dynamical Systems2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644979(2604-2611)Online publication date: 10-Jul-2024
        • (2024)Scalable Reinforcement Learning for Linear-Quadratic Control of Networks2024 American Control Conference (ACC)10.23919/ACC60939.2024.10644413(1813-1818)Online publication date: 10-Jul-2024
        • (2024)I Know This Looks Bad, But I Can Explain: Understanding When AI Should Explain Actions In Human-AI TeamsACM Transactions on Interactive Intelligent Systems10.1145/363547414:1(1-23)Online publication date: 5-Feb-2024
        • (2024)PmTrackProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314337:4(1-30)Online publication date: 12-Jan-2024
        • (2023)A finite-sample analysis of payoff-based independent learning in zero-sum stochastic gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669435(75826-75883)Online publication date: 10-Dec-2023
        • (2023)Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement LearningACM SIGMETRICS Performance Evaluation Review10.1145/3606376.359354551:1(83-84)Online publication date: 27-Jun-2023
        • (2023)A Nash Equilibrium Solution for Periodic Double Auctions2023 62nd IEEE Conference on Decision and Control (CDC)10.1109/CDC49753.2023.10383887(209-214)Online publication date: 13-Dec-2023
        • (2023)Natural Policy Gradient Preserves Spatial Decay Properties for Control of Networked Dynamical Systems2023 62nd IEEE Conference on Decision and Control (CDC)10.1109/CDC49753.2023.10383735(4486-4493)Online publication date: 13-Dec-2023
        • (2023)Bias Reduced Methods to Q-learningNeural Information Processing10.1007/978-981-99-8132-8_29(378-395)Online publication date: 26-Nov-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media