Adaptive Warm-Start MCTS in AlphaZero-Like Deep Reinforcement Learning

Wang, Hui; Preuss, Mike; Plaat, Aske

doi:10.1007/978-3-030-89370-5_5

Hui Wang¹²,
Mike Preuss¹² &
Aske Plaat¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13033))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1505 Accesses
4 Citations

Abstract

AlphaZero has achieved impressive performance in deep reinforcement learning by utilizing an architecture that combines search and training of a neural network in self-play. Many researchers are looking for ways to reproduce and improve results for other games/tasks. However, the architecture is designed to learn from scratch, tabula rasa, accepting a cold-start problem in self-play. Recently, a warm-start enhancement method for Monte Carlo Tree Search was proposed to improve the self-play starting phase. It employs a fixed parameter $I^\prime $ to control the warm-start length. Improved performance was reported in small board games. In this paper we present results with an adaptive switch method. Experiments show that our approach works better than the fixed $I^\prime $, especially for “deep”, tactical, games (Othello and Connect Four). We conjecture that the adaptive value for $I^\prime $ is also influenced by the size of the game, and that on average $I^\prime $ will increase with game size. We conclude that AlphaZero-like deep reinforcement learning benefits from adaptive rollout based warm-start, as Rapid Action Value Estimate did for rollout-based reinforcement learning 15 years ago.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Warm-Start AlphaZero Self-play Search Enhancements

Playout Policy Adaptation for Games

Memorizing the Playout Policy

References

Browne, C.B., et al.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)
Article Google Scholar
Coulom, R.: Whole-history rating: a Bayesian rating system for players of time-varying strength. In: van den Herik, H.J., Xu, X., Ma, Z., Winands, M.H.M. (eds.) CG 2008. LNCS, vol. 5131, pp. 113–124. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87608-3_11
Chapter Google Scholar
Gelly, S., Silver, D.: Combining online and offline knowledge in UCT. In: Proceedings of the 24th International Conference on Machine Learning, pp. 273–280 (2007)
Google Scholar
Gelly, S., Silver, D.: Monte-Carlo tree search and rapid action value estimation in computer go. Artif. Intell. 175(11), 1856–1875 (2011)
Article MathSciNet Google Scholar
Nair, S.: Alphazero general (2018). https://github.com/suragnair/alpha-zero-general. Accessed May 2018
Plaat, A.: Learning to Play: Reinforcement Learning and Games. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59238-7
Book Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Article Google Scholar
Segler, M.H., Preuss, M., Waller, M.P.: Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698), 604–610 (2018)
Article Google Scholar
Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Silver, D.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Article MathSciNet Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995)
Article Google Scholar
Vinyals, O.: Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019)
Article Google Scholar
Wang, H., Emmerich, M., Plaat, A.: Assessing the potential of classical Q-learning in general game playing. In: Atzmueller, M., Duivesteijn, W. (eds.) BNAIC 2018. CCIS, vol. 1021, pp. 138–150. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31978-6_11
Chapter Google Scholar
Wang, H., Emmerich, M., Preuss, M., Plaat, A.: Alternative loss functions in alphazero-like self-play. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 155–162. IEEE (2019)
Google Scholar
Wang, H., Emmerich, M., Preuss, M., Plaat, A.: Analysis of hyper-parameters for small games: Iterations or epochs in self-play? arXiv preprint arXiv:2003.05988 (2020)
Wang, H., Preuss, M., Emmerich, M., Plaat, A.: Tackling morpion solitaire with alphazero-like ranked reward reinforcement learning. arXiv preprint arXiv:2006.07970 (2020)
Wang, H., Preuss, M., Plaat, A.: Warm-start alphazero self-play search enhancements. In: Proceedings of the Parallel Problem Solving from Nature - PPSN XVI, pp. 528–542 (2020)
Google Scholar
Wu, D.J.: Accelerating self-play learning in go. arXiv preprint arXiv:1902.10565 (2019)

Download references

Acknowledgments

Hui Wang acknowledges financial support from the China Scholarship Council (CSC), CSC No.201706990015.

Author information

Authors and Affiliations

Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Hui Wang, Mike Preuss & Aske Plaat

Authors

Hui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mike Preuss
View author publications
You can also search for this author in PubMed Google Scholar
Aske Plaat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui Wang .

Editor information

Editors and Affiliations

MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Sirindhorn International Institute of Science and Technology, Thammasat University, Mueang Pathum Thani, Thailand
Thanaruk Theeramunkong
Data61, CSIRO, Brisbane, QLD, Australia
Guido Governatori
Department of Philosophy, Tsinghua University, Beijing, China
Fenrong Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., Preuss, M., Plaat, A. (2021). Adaptive Warm-Start MCTS in AlphaZero-Like Deep Reinforcement Learning. In: Pham, D.N., Theeramunkong, T., Governatori, G., Liu, F. (eds) PRICAI 2021: Trends in Artificial Intelligence. PRICAI 2021. Lecture Notes in Computer Science(), vol 13033. Springer, Cham. https://doi.org/10.1007/978-3-030-89370-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-89370-5_5
Published: 01 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89369-9
Online ISBN: 978-3-030-89370-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Warm-Start MCTS in AlphaZero-Like Deep Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Warm-Start AlphaZero Self-play Search Enhancements

Playout Policy Adaptation for Games

Memorizing the Playout Policy

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Adaptive Warm-Start MCTS in AlphaZero-Like Deep Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Warm-Start AlphaZero Self-play Search Enhancements

Playout Policy Adaptation for Games

Memorizing the Playout Policy

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation