Article

Free access

Actor-critic policy optimization in partially observable multiagent environments

Authors:

Sriram Srinivasan,

Vinicius Zambaldi,

Julien Pérolat,

Michael BowlingAuthors Info & Claims

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

Pages 3426 - 3439

Published: 03 December 2018 Publication History

PDF eReader Publisher Site

Abstract

Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.

References

[1]

Sherief Abdallah and Victor Lesser. A multiagent reinforcement learning algorithm with non-linear dynamics. JAIR, 33(1):521-549, 2008.

Digital Library

[2]

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Nicolas Heess Remi Munos, and Martin Riedmiller. Maximum a posteriori policy optimisation. CoRR, abs/1806.06920, 2018.

[3]

Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. In Proceedings of the Sixth International Conference on Learning Representations, 2018.

[4]

Stefano V. Albrecht and Peter Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66-95, 2018.

[5]

Cameron Allen, Melrose Roderick Kavosh Asadi, Abdel rahman Mohamed, George Konidaris, and Michael Littman. Mean actor critic. CoRR, abs/1709.00503, 2017.

[6]

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pages 322-331, 1995.

Digital Library

[7]

Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. In Proceedings of the Sixth International Conference on Learning Representations, 2018.

[8]

Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-agent learning: A survey. J. Artif. Intell. Res. (JAIR), 53:659-697, 2015.

Digital Library

[9]

A. Blum and Y. Mansour. Learning, regret minimization, and equilibria. In Algorithmic Game Theory, chapter 4. Cambridge University Press, 2007.

[10]

Branislav Bošanský, Viliam Lisý, Marc Lanctot, Jiří Č ermák, and Mark H.M. Winands. Algorithms for computing strategies in two-player simultaneous move games. Artificial Intelligence, 237:1-40, 2016.

Digital Library

[11]

Michael Bowling. Convergence and no-regret in multiagent learning. In Advances in Neural Information Processing Systems 17 (NIPS), pages 209-216, 2005.

[12]

Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up Limit Hold'em Poker is solved. Science, 347(6218):145-149, January 2015.

[13]

Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate. Artificial Intelligence, 136:215-250, 2002.

Digital Library

[14]

G. W. Brown. Iterative solutions of games by fictitious play. In T.C. Koopmans, editor, Activity Analysis of Production and Allocation, pages 374-376. John Wiley & Sons, Inc., 1951.

[15]

Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017.

Digital Library

[16]

Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 360(6385), December 2017.

[17]

L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(2):156-172, 2008.

Digital Library

[18]

Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), 2018.

[19]

Kamil Ciosek and Shimon Whiteson. Expected policy gradients. In Proceedings of the Thirty-Second AAAI conference on Artificial Intelligence (AAAI-18), 2018.

[20]

Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a nash equilibrium. In Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing, STOC '06, pages 71-78, New York, NY, USA, 2006. ACM.

Digital Library

[21]

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.

Digital Library

[22]

Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. CoRR, abs/1802.01561, 2018.

[23]

Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2017.

Digital Library

[24]

Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2017.

[25]

N. Gatti, F. Panozzo, and M. Restelli. Efficient evolutionary dynamics with extensive-form games. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 335-341, 2013.

Digital Library

[26]

Richard Gibson. Regret minimization in non-zero-sum games with applications to building champion multiplayer computer poker agents. CoRR, abs/1305.0034, 2013.

[27]

Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation. CoRR, abs/1610.00633, 2016.

[28]

S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127-1150, 2000.

[29]

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157-325, 2015.

Digital Library

[30]

He He, Jordan L. Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning (ICML 2016), 2016.

Digital Library

[31]

Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form games. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), 2015.

Digital Library

[32]

Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games. CoRR, abs/1603.01121, 2016.

[33]

Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. CoRR, abs/1707.09183, 2017.

[34]

Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge University Press, 1998.

[35]

Josef Hofbauer, Sylvain Sorin, and Yannick Viossat. Time average replicator and best-reply dynamics. Mathematics of Operations Research, 34(2):263-269, 2009.

Digital Library

[36]

Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A deep policy inference q-network for multi-agent systems. CoRR, abs/1712.07893, 2017.

[37]

Max Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of the International Conference on Representation Learning, 2017.

[38]

Peter H. Jin, Sergey Levine, and Kurt Keutzer. Regret minimization for partially observable deep reinforcement learning. CoRR, abs/1710.11424, 2017.

[39]

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998.

Digital Library

[40]

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML '02, pages 267-274, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.

Digital Library

[41]

Ian A. Kash and Katja Hoffman. Combining no-regret and Q-learning. In European Workshop on Reinforcement Learning (EWRL) 14, 2018.

[42]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[43]

Vojtech Kovarík and Viliam Lisý. Analysis of hannan consistent selection for monte carlo tree search in simultaneous move games. CoRR, abs/1509.00149, 2015.

[44]

H. W. Kuhn. Extensive games and the problem of information. Contributions to the Theory of Games, 2:193-216, 1953.

[45]

Shapley L. Some topics in two-person games. In Advances in Game Theory. Princeton University Press., 1964.

[46]

M. Lanctot, K. Waugh, M. Bowling, and M. Zinkevich. Sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems (NIPS 2009), pages 1078-1086, 2009.

Digital Library

[47]

Marc Lanctot. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games. PhD thesis, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, June 2013.

Digital Library

[48]

Marc Lanctot. Further developments of extensive-form replicator dynamics using the sequence-form representation. In Proceedings of the Thirteenth International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1257-1264, 2014.

Digital Library

[49]

Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1078-1086, 2009.

Digital Library

[50]

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, 2017.

Digital Library

[51]

Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. In Proceedings of the International Conference on Learning Representations (ICLR), April 2017.

[52]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436-444, 2015.

[53]

Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. CoRR, abs/1707.01068, 2017.

[54]

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.

[55]

Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In In Proceedings of the Eleventh International Conference on Machine Learning, pages 157-163. Morgan Kaufmann, 1994.

Digital Library

[56]

Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via stein identity. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.

[57]

Ryan Lowe, YI WU, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6379-6390. Curran Associates, Inc., 2017.

Digital Library

[58]

L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems. The Knowledge Engineering Review, 27(01):1-31, 2012.

Digital Library

[59]

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lilli-crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1928-1937, 2016.

Digital Library

[60]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529-533, 2015.

[61]

Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 358(6362), October 2017.

[62]

Todd W. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. In Proceedings of Model AI Assignments, The Fourth Symposium on Educational Advances in Artificial Intelligence (EAAI-2013), 2013. http://modelai.gettysburg.edu/2013/cfr/index.html.

[63]

Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Policy gradient with value function approximation for collective multiagent planning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4319-4329. Curran Associates, Inc., 2017.

Digital Library

[64]

A. Nowé, P. Vrancx, and Y-M. De Hauwere. Game theory and multi-agent reinforcement learning. In Reinforcement Learning: State-of-the-Art, chapter 14, pages 441-470. 2012.

[65]

Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.

Digital Library

[66]

Fabio Panozzo, Nicola Gatti, and Marcello Restelli. Evolutionary dynamics of q-learning over the sequence form. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 2034-2040, 2014.

Digital Library

[67]

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. CoRR, abs/1804.02717, 2018.

[68]

Julien Perolat, Bilal Piot, and Olivier Pietquin. Actor-critic fictitious play in simultaneous move multistage games. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 919-928, Playa Blanca, Lanzarote, Canary Islands, 09-11 Apr 2018. PMLR.

[69]

Julien Pérolat, Bilal Piot, Bruno Scherrer, and Olivier Pietquin. On the use of non-stationary strategies for solving two-player zero-sum markov games. In The 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016), 2016.

[70]

Julien Pérolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum markov games. In Proceedings of the International Conference on Machine Learning (ICML), 2015.

Digital Library

[71]

Jan Peters. Policy gradient methods for control applications. Technical Report TR-CLMC-2007-1, University of Southern California, 2002.

[72]

Yu Qian, Fang Debin, Zhang Xiaoling, Jin Chen, and Ren Qiyu. Stochastic evolution dynamic of the rock–scissors–paper game based on a quasi birth and death process. Scientific Reports, 6(1):28585, 2016.

[73]

Deirdre Quillen, Eric Jang, Ofir Nachum, Chelsea Finn, Julian Ibarz, and Sergey Levine. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. CoRR, abs/1802.10264, 2018.

[74]

N. A. Risk and D. Szafron. Using counterfactual regret minimization to create competitive multiplayer poker agents. In Proceedings of the International Conference on Autonomus Agents and Multiagent Systems (AAMAS), pages 159-166, 2010.

Digital Library

[75]

William H. Sandholm. Population Games and Evolutionary Dynamics. The MIT Press, 2010.

[76]

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.

Digital Library

[77]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[78]

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. CoRR, abs/1610.03295, 2016.

[79]

Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.

Digital Library

[80]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484-489, 2016.

[81]

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.

[82]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 530:354-359, 2017.

[83]

Satinder P. Singh, Michael J. Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, UAI '00, pages 541-548, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

Digital Library

[84]

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. CoRR, abs/1810.09026, 2018.

[85]

R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.

Digital Library

[86]

Richard S. Sutton, Satinder Singh, and David McAllester. Comparing policy-gradient algorithms, 2001. Unpublished.

[87]

Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas Hold'em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015.

Digital Library

[88]

Taylor and Jonker. Evolutionary stable strategies and game dynamics. Mathematical Biosciences, 40:145-156, 1978.

[89]

Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A Generalised Method for Empirical Game Theoretic Analysis. In AAMAS, 2018.

Digital Library

[90]

Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37-57.

Digital Library

[91]

W. E. Walsh, D. C. Parkes, and R. Das. Choosing samples to compute heuristic-strategy Nash equilibrium. In Proceedings of the Fifth Workshop on Agent-Mediated Electronic Commerce, 2003.

[92]

William E Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O Kephart. Analyzing Complex Strategic Interactions in Multi-Agent Systems. In AAAI, 2002.

[93]

Kevin Waugh, Dustin Morrill, J. Andrew Bagnell, and Michael Bowling. Solving games with functional regret estimation. In Proceedongs of the AAAI Conference on Artificial Intelligence, 2015.

Digital Library

[94]

Michael P. Wellman. Methods for empirical game-theoretic analysis. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, pages 1552-1556, 2006.

Digital Library

[95]

R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229-256, 1992.

Digital Library

[96]

Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.

[97]

Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. In Proceedings of the International Conference on Representation Learning, 2017.

[98]

Michael Wunder, Michael Littman, and Monica Babes. Classes of multiagent q-learning dynamics with ε-greedy exploration. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML'10, pages 1167-1174, 2010.

Digital Library

[99]

Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 927-934, 2010.

Digital Library

[100]

M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of Twentieth International Conference on Machine Learning (ICML-2003), 2003.

Digital Library

[101]

M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. Technical Report CMU-CS-03-110, Carnegie Mellon University, 2003.

[102]

M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008.

Digital Library

[103]

Martin Zinkevich, Amy Greenwald, and Michael L. Littman. Cyclic equilibria in markov games. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS'05, pages 1641-1648, Cambridge, MA, USA, 2005. MIT Press.

Digital Library

Cited By

Zhang XZhang KMiehling EBaşar TWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Non-cooperative inverse reinforcement learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455138(9487-9497)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455138
Lockhart ELanctot MPérolat JLespiau JMorrill DTimbers FTuyls K(2019)Computing approximate equilibria in sequential adversarial games by exploitability descentProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367032.3367099(464-470)Online publication date: 10-Aug-2019
https://dl.acm.org/doi/10.5555/3367032.3367099

Recommendations

Off-policy actor-critic
ICML'12: Proceedings of the 29th International Coference on International Conference on Machine Learning

This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic ...
Off-policy average reward actor-critic with deterministic policy search
ICML'23: Proceedings of the 40th International Conference on Machine Learning

The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but ...
Meta attention for Off-Policy Actor-Critic
Abstract
Off-Policy Actor-Critic methods can effectively exploit past experiences and thus they have achieved great success in various reinforcement learning tasks. In many image-based and multi-agent tasks, attention mechanism has been employed in Actor-...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems

December 2018

11021 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 December 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
168
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XZhang KMiehling EBaşar TWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Non-cooperative inverse reinforcement learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455138(9487-9497)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455138
Lockhart ELanctot MPérolat JLespiau JMorrill DTimbers FTuyls K(2019)Computing approximate equilibria in sequential adversarial games by exploitability descentProceedings of the 28th International Joint Conference on Artificial Intelligence10.5555/3367032.3367099(464-470)Online publication date: 10-Aug-2019
https://dl.acm.org/doi/10.5555/3367032.3367099

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten