A Survey on Population-Based Deep Reinforcement Learning
Abstract
:1. Introduction
2. Background
2.1. Game Theory
2.2. Multi-Agent Reinforcement Learning
3. Population-Based Deep Reinforcement Learning
3.1. Naive Self-Play
3.2. Fictitious Self-Play
3.3. Population-Play
3.4. Evolution-Based Training Methods
3.5. General Framework
4. Challenges and Hot Topics
4.1. Challenges
4.2. Hot Topics
- AI bots that can learn to make decisions like humans, making them suitable for use in tutorials, hosting games, computer opponents, and more.
- AI non-player characters that train agents to interact with players according to their own character settings, which can be used for virtual hosts, open-world RPG games, and other applications.
- AI teammates that are designed to help and support human players in cooperative games or simulations. AI teammates can provide assistance, such as cover fire, healing, or completing objectives, to human players in cooperative games or simulations.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
- Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; de Las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
- Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef]
- Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar] [CrossRef]
- Buşoniu, L.; Babuška, R.; De Schutter, B. Multi-agent reinforcement learning: An overview. In Innovations in Multi-Agent Systems and Applications-1; Springer: Berlin/Heidelberg, Germany, 2010; pp. 183–221. [Google Scholar]
- Brown, N.; Sandholm, T. Superhuman AI for multiplayer poker. Science 2019, 365, 885–890. [Google Scholar] [CrossRef]
- Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
- Czarnecki, W.M.; Gidel, G.; Tracey, B.; Tuyls, K.; Omidshafiei, S.; Balduzzi, D.; Jaderberg, M. Real world games look like spinning tops. Adv. Neural Inf. Process. Syst. 2020, 33, 17443–17454. [Google Scholar]
- Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Elsevier: Amsterdam, The Netherlands, 2007. [Google Scholar]
- Kuhn, H. Extensive games and the problem of information. Contributions to the Theory of Games; Princeton University Press: Princeton, NJ, USA, 1953; p. 193. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR: Cambridge, MA, USA, 2018; Volume 80, pp. 1587–1596. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Strouse, D.; McKee, K.; Botvinick, M.; Hughes, E.; Everett, R. Collaborating with humans without human data. Adv. Neural Inf. Process. Syst. 2021, 34, 14502–14515. [Google Scholar]
- Lin, F.; Huang, S.; Pearce, T.; Chen, W.; Tu, W.W. TiZero: Mastering Multi-Agent Football with Curriculum Learning and Self-Play. arXiv 2023, arXiv:2302.07515. [Google Scholar]
- Yang, Y.; Wang, J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv 2020, arXiv:2011.00583. [Google Scholar]
- de Witt, C.S.; Gupta, T.; Makoviichuk, D.; Makoviychuk, V.; Torr, P.H.; Sun, M.; Whiteson, S. Is independent learning all you need in the starcraft multi-agent challenge? arXiv 2020, arXiv:2011.09533. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Lanctot, M.; Zambaldi, V.F.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; Pérolat, J.; Silver, D.; Graepel, T. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4190–4203. [Google Scholar]
- Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6379–6390. [Google Scholar]
- Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR: Cambridge, MA, USA, 2018; Volume 80, pp. 4295–4304. [Google Scholar]
- Sukhbaatar, S.; Szlam, A.; Fergus, R. Learning Multiagent Communication with Backpropagation. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2244–2252. [Google Scholar]
- Al, S. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 1959, 3, 210–229. [Google Scholar]
- Brown, G.W. Iterative solution of games by fictitious play. Act. Anal. Prod. Alloc. 1951, 13, 374. [Google Scholar]
- Heinrich, J.; Lanctot, M.; Silver, D. Fictitious self-play in extensive-form games. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2015; pp. 805–813. [Google Scholar]
- Heinrich, J.; Silver, D. Deep reinforcement learning from self-play in imperfect-information games. arXiv 2016, arXiv:1603.01121. [Google Scholar]
- Lockhart, E.; Lanctot, M.; Pérolat, J.; Lespiau, J.; Morrill, D.; Timbers, F.; Tuyls, K. Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, 10–16 August 2019; Kraus, S., Ed.; pp. 464–470. [Google Scholar] [CrossRef]
- Bansal, T.; Pachocki, J.; Sidor, S.; Sutskever, I.; Mordatch, I. Emergent Complexity via Multi-Agent Competition. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castaneda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef]
- Yu, C.; Gao, J.; Liu, W.; Xu, B.; Tang, H.; Yang, J.; Wang, Y.; Wu, Y. Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased. arXiv 2023, arXiv:2302.01605. [Google Scholar]
- Jaderberg, M.; Dalibard, V.; Osindero, S.; Czarnecki, W.M.; Donahue, J.; Razavi, A.; Vinyals, O.; Green, T.; Dunning, I.; Simonyan, K.; et al. Population Based Training of Neural Networks. arXiv 2017, arXiv:1711.09846v2. [Google Scholar]
- Majumdar, S.; Khadka, S.; Miret, S.; Mcaleer, S.; Tumer, K. Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination. In 37th International Conference on Machine Learning; Daumé, H., Singh, A., Eds.; PMLR: Cambridge, MA, USA, 2020; Volume 119, pp. 6651–6660. [Google Scholar]
- Khadka, S.; Majumdar, S.; Nassar, T.; Dwiel, Z.; Tumer, E.; Miret, S.; Liu, Y.; Tumer, K. Collaborative Evolutionary Reinforcement Learning. In 36th International Conference on Machine Learning; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Cambridge, MA, USA, 2019; Volume 97, pp. 3341–3350. [Google Scholar]
- Gupta, A.; Savarese, S.; Ganguli, S.; Fei-Fei, L. Embodied intelligence via learning and evolution. Nat. Commun. 2021, 12, 5721. [Google Scholar] [CrossRef]
- Balduzzi, D.; Garnelo, M.; Bachrach, Y.; Czarnecki, W.; Perolat, J.; Jaderberg, M.; Graepel, T. Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 434–443. [Google Scholar]
- Perez-Nieves, N.; Yang, Y.; Slumbers, O.; Mguni, D.H.; Wen, Y.; Wang, J. Modelling behavioural diversity for learning in open-ended games. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 8514–8524. [Google Scholar]
- McAleer, S.; Lanier, J.B.; Fox, R.; Baldi, P. Pipeline psro: A scalable approach for finding approximate nash equilibria in large games. Adv. Neural Inf. Process. Syst. 2020, 33, 20238–20248. [Google Scholar]
- Muller, P.; Omidshafiei, S.; Rowland, M.; Tuyls, K.; Perolat, J.; Liu, S.; Hennes, D.; Marris, L.; Lanctot, M.; Hughes, E.; et al. A Generalized Training Approach for Multiagent Learning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Hernandez, D.; Denamganai, K.; Devlin, S.; Samothrakis, S.; Walker, J.A. A comparison of self-play algorithms under a generalized framework. IEEE Trans. Games 2021, 14, 221–231. [Google Scholar] [CrossRef]
- Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 1994, 6, 215–219. [Google Scholar] [CrossRef]
- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
- Ye, D.; Chen, G.; Zhao, P.; Qiu, F.; Yuan, B.; Zhang, W.; Chen, S.; Sun, M.; Li, X.; Li, S.; et al. Supervised learning achieves human-level performance in moba games: A case study of honor of kings. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 908–918. [Google Scholar] [CrossRef] [PubMed]
- Baker, B.; Kanitscheider, I.; Markov, T.; Wu, Y.; Powell, G.; McGrew, B.; Mordatch, I. Emergent Tool Use From Multi-Agent Autocurricula. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Shamma, J.S.; Arslan, G. Dynamic fictitious play, dynamic gradient play, and distributed convergence to Nash equilibria. IEEE Trans. Autom. Control 2005, 50, 312–327. [Google Scholar] [CrossRef]
- Siu, H.C.; Peña, J.; Chen, E.; Zhou, Y.; Lopez, V.; Palko, K.; Chang, K.; Allen, R. Evaluation of Human-AI Teams for Learned and Rule-Based Agents in Hanabi. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 16183–16195. [Google Scholar]
- He, J.Z.Y.; Erickson, Z.; Brown, D.S.; Raghunathan, A.; Dragan, A. Learning Representations that Enable Generalization in Assistive Tasks. In Proceedings of the 6th Annual Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022. [Google Scholar]
- Liu, S.; Lever, G.; Merel, J.; Tunyasuvunakool, S.; Heess, N.; Graepel, T. Emergent Coordination Through Competition. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Liu, X.; Jia, H.; Wen, Y.; Hu, Y.; Chen, Y.; Fan, C.; Hu, Z.; Yang, Y. Towards unifying behavioral and response diversity for open-ended learning in zero-sum games. Adv. Neural Inf. Process. Syst. 2021, 34, 941–952. [Google Scholar]
- Zhou, M.; Chen, J.; Wen, Y.; Zhang, W.; Yang, Y.; Yu, Y. Efficient Policy Space Response Oracles. arXiv 2022, arXiv:2202.0063v4. [Google Scholar]
- Omidshafiei, S.; Papadimitriou, C.; Piliouras, G.; Tuyls, K.; Rowland, M.; Lespiau, J.B.; Czarnecki, W.M.; Lanctot, M.; Perolat, J.; Munos, R. α-rank: Multi-agent evaluation by evolution. Sci. Rep. 2019, 9, 9937. [Google Scholar] [CrossRef]
- Marris, L.; Muller, P.; Lanctot, M.; Tuyls, K.; Graepel, T. Multi-agent training beyond zero-sum with correlated equilibrium meta-solvers. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 7480–7491. [Google Scholar]
- Muller, P.; Rowland, M.; Elie, R.; Piliouras, G.; Pérolat, J.; Laurière, M.; Marinier, R.; Pietquin, O.; Tuyls, K. Learning Equilibria in Mean-Field Games: Introducing Mean-Field PSRO. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2022, Auckland, New Zealand, 9–13 May 2022; Faliszewski, P., Mascardi, V., Pelachaud, C., Taylor, M.E., Eds.; International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2022; pp. 926–934. [Google Scholar] [CrossRef]
- McKee, K.R.; Leibo, J.Z.; Beattie, C.; Everett, R. Quantifying the effects of environment and population diversity in multi-agent reinforcement learning. Auton. Agents Multi-Agent Syst. 2022, 36, 21. [Google Scholar] [CrossRef]
- Črepinšek, M.; Liu, S.H.; Mernik, M. Exploration and exploitation in evolutionary algorithms: A survey. ACM Comput. Surv. (CSUR) 2013, 45, 1–33. [Google Scholar] [CrossRef]
- Garnelo, M.; Czarnecki, W.M.; Liu, S.; Tirumala, D.; Oh, J.; Gidel, G.; van Hasselt, H.; Balduzzi, D. Pick Your Battles: Interaction Graphs as Population-Level Objectives for Strategic Diversity. In Proceedings of the AAMAS’21: 20th International Conference on Autonomous Agents and Multiagent Systems, Virtual Event, UK, 3–7 May 2021; Dignum, F., Lomuscio, A., Endriss, U., Nowé, A., Eds.; ACM: New York, NY, USA, 2021; pp. 1501–1503. [Google Scholar] [CrossRef]
- Zhao, R.; Song, J.; Hu, H.; Gao, Y.; Wu, Y.; Sun, Z.; Wei, Y. Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination. arXiv 2021, arXiv:2112.11701v3. [Google Scholar]
- Lupu, A.; Cui, B.; Hu, H.; Foerster, J. Trajectory diversity for zero-shot coordination. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 7204–7213. [Google Scholar]
- Bai, Y.; Jin, C. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 551–560. [Google Scholar]
- Dinh, L.C.; McAleer, S.M.; Tian, Z.; Perez-Nieves, N.; Slumbers, O.; Mguni, D.H.; Wang, J.; Ammar, H.B.; Yang, Y. Online Double Oracle. arXiv 2021, arXiv:2103.07780v5. [Google Scholar]
- Yin, Q.; Yu, T.; Shen, S.; Yang, J.; Zhao, M.; Huang, K.; Liang, B.; Wang, L. Distributed Deep Reinforcement Learning: A Survey and A Multi-Player Multi-Agent Learning Toolbox. arXiv 2022, arXiv:2212.00253. [Google Scholar]
- Espeholt, L.; Marinier, R.; Stanczyk, P.; Wang, K.; Michalski, M. SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Nair, A.; Srinivasan, P.; Blackwell, S.; Alcicek, C.; Fearon, R.; De Maria, A.; Panneershelvam, V.; Suleyman, M.; Beattie, C.; Petersen, S.; et al. Massively parallel methods for deep reinforcement learning. arXiv 2015, arXiv:1507.04296. [Google Scholar]
- Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; pp. 1407–1416. [Google Scholar]
- Flajolet, A.; Monroc, C.B.; Beguir, K.; Pierrot, T. Fast population-based reinforcement learning on a single machine. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2022; pp. 6533–6547. [Google Scholar]
- Shih, A.; Sawhney, A.; Kondic, J.; Ermon, S.; Sadigh, D. On the Critical Role of Conventions in Adaptive Human-AI Collaboration. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Hwangbo, J.; Wellhausen, L.; Koltun, V.; Hutter, M. Learning quadrupedal locomotion over challenging terrain. Sci. Robot. 2020, 5, eabc5986. [Google Scholar] [CrossRef] [PubMed]
- Miki, T.; Lee, J.; Hwangbo, J.; Wellhausen, L.; Koltun, V.; Hutter, M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 2022, 7, eabk2822. [Google Scholar] [CrossRef] [PubMed]
- OpenAI, O.; Plappert, M.; Sampedro, R.; Xu, T.; Akkaya, I.; Kosaraju, V.; Welinder, P.; D’Sa, R.; Petron, A.; Pinto, H.P.d.O.; et al. Asymmetric self-play for automatic goal discovery in robotic manipulation. arXiv 2021, arXiv:2101.04882. [Google Scholar]
- Riviere, B.; Hönig, W.; Anderson, M.; Chung, S.J. Neural tree expansion for multi-robot planning in non-cooperative environments. IEEE Robot. Autom. Lett. 2021, 6, 6868–6875. [Google Scholar] [CrossRef]
- Mahjourian, R.; Miikkulainen, R.; Lazic, N.; Levine, S.; Jaitly, N. Hierarchical policy design for sample-efficient learning of robot table tennis through self-play. arXiv 2018, arXiv:1811.12927. [Google Scholar]
- Li, D.; Li, W.; Varakantham, P. Diversity Induced Environment Design via Self-Play. arXiv 2023, arXiv:2302.02119. [Google Scholar]
- Posth, J.A.; Kotlarz, P.; Misheva, B.H.; Osterrieder, J.; Schwendner, P. The applicability of self-play algorithms to trading and forecasting financial markets. Front. Artif. Intell. 2021, 4, 668465. [Google Scholar] [CrossRef]
Category | Algorithm | Advantages | Disadvantages | Descriptions |
---|---|---|---|---|
Naive Self-Play | Naive SP [29] | Simplicity, Effectiveness | Overfitting, Instability | Playing against a mirrored copy of the agent |
Fictitious Self-Play | Fictitious play [30] | Flexibility | Exploration requirement | Players choose the best response to a uniform mixture of all previous policies at each iteration |
FSP [31] | Robust and efficient learning | Exploration requirement, Sensitive initialization | Extending FP to extensive-form games | |
NFSP [32] | Handling complex environments, High scalability | Instability, Hyperparameter tuning | Combining FSP with neural network function approximation | |
ED [33] | Efficient, No requirement of average strategies | Exploration requirement, Scalability issues | Directly optimizes policies against worst-case opponents | |
-Uniform FSP [34] | Simplicity | Limited Exploration | Learning a policy that can beat older versions of itself sampled uniformly at random | |
Prioritized FSP [5] | Simplicity | Limited Exploration | Sampling the policies of opponents by their expected win rate | |
Population-Play | PP [35] | Exploration, Diversity | Inefficiency | Training a population of agents, all of whom interact with each other |
FCP [20] | Zero-shot collaboration | Inefficiency, Prone to researcher biases | Using PP to train a diversity policy pool of partners which is used to train a robust agent in cooperative setting | |
Hidden-Utility Self-Play [36] | Modeling human bias | Inefficiency, Domain knowledge requirement | Following the FCP framework and uses a hidden reward function to model human bias to human–AI cooperation problem | |
Evolution-based Method | PBT [37] | Efficient search, Dynamic hyperparameter tuning | Resource-intensive, Complex implementation | Online evolutionary process that adapts internal rewards and hyperparameters |
MERL [38] | No requirement for reward sharping | Computationally expensive | Split-level training platform without requiring domain-specific reward shaping | |
CERL [39] | Efficient sampling | Resource-intensive | Using a collective replay buffer to share all information across the population | |
DERL [40] | Diverse solution | Computationally expensive | Decoupling the processes of learning and evolution in a distributed asynchronous manner | |
General Framework | PSRO [25] | Robust learning, Zero-sum convergence | Low scalability, Computationally expensive | Using DRL to compute best responses to a distribution over policies and empirical game-theoretic analysis to compute new meta-strategy distributions |
PSRO [41] | Open-ended learning, Non-transitive cycle | Low scalability, Computationally expensive | A framework based on PSRO but defines and uses the effective diversity to encourage diverse skills | |
Diverse PSRO [42] | Open-ended learning, Low exploitability | Low scalability, Computationally expensive | Introducing a new diversity measure based on a geometric interpretation of games modelled by a determinantal point process to improve diversity | |
Pipeline PSRO [43] | Scalability, High efficiency | Approximation errors | Maintaining a hierarchical pipeline of reinforcement learning workers to improve the efficiency of PSRO | |
-PSRO [44] | General-sum many-player game | Resource-intensive, Solver dependence | Using an -Rank method as the meta-solver which is a critical component of PSRO |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Long, W.; Hou, T.; Wei, X.; Yan, S.; Zhai, P.; Zhang, L. A Survey on Population-Based Deep Reinforcement Learning. Mathematics 2023, 11, 2234. https://doi.org/10.3390/math11102234
Long W, Hou T, Wei X, Yan S, Zhai P, Zhang L. A Survey on Population-Based Deep Reinforcement Learning. Mathematics. 2023; 11(10):2234. https://doi.org/10.3390/math11102234
Chicago/Turabian StyleLong, Weifan, Taixian Hou, Xiaoyi Wei, Shichao Yan, Peng Zhai, and Lihua Zhang. 2023. "A Survey on Population-Based Deep Reinforcement Learning" Mathematics 11, no. 10: 2234. https://doi.org/10.3390/math11102234