Mastering the game of Go without human knowledge

Silver, David; Schrittwieser, Julian; Simonyan, Karen; Antonoglou, Ioannis; Huang, Aja; Guez, Arthur; Hubert, Thomas; Baker, Lucas; Lai, Matthew; Bolton, Adrian; Chen, Yutian; Lillicrap, Timothy; Hui, Fan; Sifre, Laurent; van den Driessche, George; Graepel, Thore; Hassabis, Demis

doi:10.1038/nature24270

Article
Published: 19 October 2017

Mastering the game of Go without human knowledge

David Silver¹^Â na1,
Julian Schrittwieser¹^Â na1,
Karen Simonyan¹^Â na1,
Ioannis Antonoglou¹,
Aja Huang¹,
Arthur Guez¹,
Thomas Hubert¹,
Lucas Baker¹,
Matthew Lai¹,
Adrian Bolton¹,
Yutian Chen¹,
Timothy Lillicrap¹,
Fan Hui¹,
Laurent Sifre¹,
George van den Driessche¹,
Thore Graepel¹ &
â¦
Demis Hassabis¹Â

Nature volumeÂ 550,Â pages 354â359 (2017)Cite this article

371k Accesses
5314 Citations
2663 Altmetric
Metrics details

Subjects

Abstract

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGoâs own move selections and also the winner of AlphaGoâs games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100â0 against the previously published, champion-defeating AlphaGo.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Self-play reinforcement learning in AlphaGo Zero.**

**Figure 3: Empirical evaluation of AlphaGo Zero.**

**Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee.**

**Figure 5: Go knowledge learned by AlphaGo Zero.**

**Figure 6: Performance of AlphaGo Zero.**

Mastering Atari, Go, chess and shogi by planning with a learned model

Article 23 December 2020

Using deep neural networks as a guide for modeling human planning

Article Open access 20 November 2023

Catalyzing next-generation Artificial Intelligence through NeuroAI

Article Open access 22 March 2023

References

Friedman, J., Hastie, T. & Tibshirani, R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009)
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436â444 (2015)
ArticleÂ CASÂ ADSÂ Google ScholarÂ
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Adv. Neural Inf. Process. Syst. Vol. 25 (eds Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q. ) 1097â1105 (2012)
He, K., Zhang, X., Ren, S . & Sun, J. Deep residual learning for image recognition. In Proc. 29th IEEE Conf. Comput. Vis. Pattern Recognit. 770â778 (2016)
Hayes-Roth, F., Waterman, D. & Lenat, D. Building Expert Systems (Addison-Wesley, 1984)
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529â533 (2015)
ArticleÂ CASÂ ADSÂ Google ScholarÂ
Guo, X., Singh, S. P., Lee, H., Lewis, R. L. & Wang, X. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In Adv. Neural Inf. Process. Syst. Vol. 27 (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q. ) 3338â3346 (2014)
Mnih, V . et al. Asynchronous methods for deep reinforcement learning. In Proc. 33rd Int. Conf. Mach. Learn. Vol. 48 (eds Balcan, M. F. & Weinberger, K. Q. ) 1928â1937 (2016)
Jaderberg, M . et al. Reinforcement learning with unsupervised auxiliary tasks. In 5th Int. Conf. Learn. Representations (2017)
Dosovitskiy, A. & Koltun, V. Learning to act by predicting the future. In 5th Int. Conf. Learn. Representations (2017)
ManÂ´dziuk, J. in Challenges for Computational Intelligence ( Duch, W. & ManÂ´dziuk, J. ) 407â442 (Springer, 2007)
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484â489 (2016)
ArticleÂ CASÂ ADSÂ Google ScholarÂ
Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In 5th Int. Conf. Computers and Games (eds Ciancarini, P. & van den Herik, H. J. ) 72â83 (2006)
Kocsis, L. & SzepesvÃ¡ri, C. Bandit based Monte-Carlo planning. In 15th Eu. Conf. Mach. Learn. 282â293 (2006)
Browne, C. et al. A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4, 1â49 (2012)
ArticleÂ Google ScholarÂ
Fukushima, K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193â202 (1980)
ArticleÂ CASÂ Google ScholarÂ
LeCun, Y. & Bengio, Y. in The Handbook of Brain Theory and Neural Networks Ch. 3 (ed. Arbib, M. ) 276â278 (MIT Press, 1995)
Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. 32nd Int. Conf. Mach. Learn. Vol. 37 448â456 (2015)
Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J. & Seung, H. S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 947â951 (2000)
ArticleÂ CASÂ ADSÂ Google ScholarÂ
Howard, R. Dynamic Programming and Markov Processes (MIT Press, 1960)
Sutton, R . & Barto, A. Reinforcement Learning: an Introduction (MIT Press, 1998)
Bertsekas, D. P. Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9, 310â335 (2011)
ArticleÂ MathSciNetÂ Google ScholarÂ
Scherrer, B. Approximate policy iteration schemes: a comparison. In Proc. 31st Int. Conf. Mach. Learn. Vol. 32 1314â1322 (2014)
Rosin, C. D. Multi-armed bandits with episode context. Ann. Math. Artif. Intell. 61, 203â230 (2011)
ArticleÂ MathSciNetÂ Google ScholarÂ
Coulom, R. Whole-history rating: a Bayesian rating system for players of time-varying strength. In Int. Conf. Comput. Games (eds van den Herik, H. J., Xu, X . Ma, Z . & Winands, M. H. M. ) Vol. 5131 113â124 (Springer, 2008)
Laurent, G. J., Matignon, L. & Le Fort-Piat, N. The world of independent learners is not Markovian. Int. J. Knowledge-Based Intelligent Engineering Systems 15, 55â64 (2011)
ArticleÂ Google ScholarÂ
Foerster, J. N . et al. Stabilising experience replay for deep multi-agent reinforcement learning. In Proc. 34th Int. Conf. Mach. Learn. Vol. 70 1146â1155 (2017)
Heinrich, J . & Silver, D. Deep reinforcement learning from self-play in imperfect-information games. In NIPS Deep Reinforcement Learning Workshop (2016)
Jouppi, N. P . et al. In-datacenter performance analysis of a Tensor Processing Unit. Proc. 44th Annu. Int. Symp. Comp. Architecture Vol. 17 1â12 (2017)
Maddison, C. J., Huang, A., Sutskever, I . & Silver, D. Move evaluation in Go using deep convolutional neural networks. In 3rd Int. Conf. Learn. Representations. (2015)
Clark, C . & Storkey, A. J. Training deep convolutional neural networks to play Go. In Proc. 32nd Int. Conf. Mach. Learn. Vol. 37 1766â1774 (2015)
Tian, Y. & Zhu, Y. Better computer Go player with neural network and long-term prediction. In 4th Int. Conf. Learn. Representations (2016)
Cazenave, T. Residual networks for computer Go. IEEE Trans. Comput. Intell. AI Games https://doi.org/10.1109/TCIAIG.2017.2681042 (2017)
Huang, A. AlphaGo master online series of games. https://deepmind.com/research/AlphaGo/match-archive/master (2017)
Barto, A. G. & Duff, M. Monte Carlo matrix inversion and reinforcement learning. Adv. Neural Inf. Process. Syst. 6, 687â694 (1994)
Google ScholarÂ
Singh, S. P. & Sutton, R. S. Reinforcement learning with replacing eligibility traces. Mach. Learn. 22, 123â158 (1996)
MATHÂ Google ScholarÂ
Lagoudakis, M. G. & Parr, R. Reinforcement learning as classification: leveraging modern classifiers. In Proc. 20th Int. Conf. Mach. Learn. 424â431 (2003)
Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. & Geist, M. Approximate modified policy iteration and its application to the game of Tetris. J. Mach. Learn. Res. 16, 1629â1676 (2015)
MathSciNetÂ MATHÂ Google ScholarÂ
Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In Proc. 11th Int. Conf. Mach. Learn. 157â163 (1994)
Enzenberger, M. The integration of a priori knowledge into a Go playing neural network. http://www.cgl.ucsf.edu/go/Programs/neurogo-html/neurogo.html (1996)
Enzenberger, M. in Advances in Computer Games (eds Van Den Herik, H. J., Iida, H. & Heinz, E. A. ) 97â108 (2003)
Sutton, R. Learning to predict by the method of temporal differences. Mach. Learn. 3, 9â44 (1988)
Google ScholarÂ
Schraudolph, N. N., Dayan, P. & Sejnowski, T. J. Temporal difference learning of position evaluation in the game of Go. Adv. Neural Inf. Process. Syst. 6, 817â824 (1994)
Google ScholarÂ
Silver, D., Sutton, R. & MÃ¼ller, M. Temporal-difference search in computer Go. Mach. Learn. 87, 183â219 (2012)
ArticleÂ MathSciNetÂ Google ScholarÂ
Silver, D. Reinforcement Learning and Simulation-Based Search in Computer Go. PhD thesis, Univ. Alberta, Edmonton, Canada (2009)
Gelly, S. & Silver, D. Monte-Carlo tree search and rapid action value estimation in computer Go. Artif. Intell. 175, 1856â1875 (2011)
ArticleÂ MathSciNetÂ Google ScholarÂ
Coulom, R. Computing Elo ratings of move patterns in the game of Go. Int. Comput. Games Assoc. J. 30, 198â208 (2007)
Google ScholarÂ
Gelly, S., Wang, Y., Munos, R. & Teytaud, O. Modification of UCT with patterns in Monte-Carlo Go. Report No. 6062 (INRIA, 2006)
Baxter, J., Tridgell, A. & Weaver, L. Learning to play chess using temporal differences. Mach. Learn. 40, 243â263 (2000)
ArticleÂ Google ScholarÂ
Veness, J., Silver, D., Blair, A. & Uther, W. Bootstrapping from game tree search. In Adv. Neural Inf. Process. Syst. 1937â1945 (2009)
Lai, M. Giraffe: Using Deep Reinforcement Learning to Play Chess. MSc thesis, Imperial College London (2015)
Schaeffer, J., Hlynka, M . & Jussila, V. Temporal difference learning applied to a high-performance game-playing program. In Proc. 17th Int. Jt Conf. Artif. Intell. Vol. 1 529â534 (2001)
Tesauro, G. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput. 6, 215â219 (1994)
ArticleÂ Google ScholarÂ
Buro, M. From simple features to sophisticated evaluation functions. In Proc. 1st Int. Conf. Comput. Games 126â145 (1999)
Sheppard, B. World-championship-caliber Scrabble. Artif. Intell. 134, 241â275 (2002)
ArticleÂ Google ScholarÂ
MoravcËÃk, M. et al. DeepStack: expert-level artificial intelligence in heads-up no-limit poker. Science 356, 508â513 (2017)
ArticleÂ ADSÂ MathSciNetÂ Google ScholarÂ
Tesauro, G & Galperin, G. On-line policy improvement using Monte-Carlo search. In Adv. Neural Inf. Process. Syst. 1068â1074 (1996)
Tesauro, G. Neurogammon: a neural-network backgammon program. In Proc. Int. Jt Conf. Neural Netw. Vol. 3, 33â39 (1990)
Samuel, A. L. Some studies in machine learning using the game of checkers II - recent progress. IBM J. Res. Develop. 11, 601â617 (1967)
ArticleÂ Google ScholarÂ
Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32, 1238â1274 (2013)
ArticleÂ Google ScholarÂ
Zhang, W. & Dietterich, T. G. A reinforcement learning approach to job-shop scheduling. In Proc. 14th Int. Jt Conf. Artif. Intell. 1114â1120 (1995)
Cazenave, T., Balbo, F. & Pinson, S. Using a Monte-Carlo approach for bus regulation. In Int. IEEE Conf. Intell. Transport. Syst. 1â6 (2009)
Evans, R. & Gao, J. Deepmind AI reduces Google data centre cooling bill by 40%. https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/ (2016)
Abe, N . et al. Empirical comparison of various reinforcement learning strategies for sequential targeted marketing. In IEEE Int. Conf. Data Mining 3â10 (2002)
Silver, D., Newnham, L., Barker, D., Weller, S. & McFall, J. Concurrent reinforcement learning from customer interactions. In Proc. 30th Int. Conf. Mach. Learn. Vol. 28 924â932 (2013)
Tromp, J. TrompâTaylor rules. http://tromp.github.io/go.html (1995)
MÃ¼ller, M. Computer Go. Artif. Intell. 134, 145â179 (2002)
ArticleÂ Google ScholarÂ
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N. Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104, 148â175 (2016)
ArticleÂ Google ScholarÂ
Segal, R. B. On the scalability of parallel UCT. Comput. Games 6515, 36â47 (2011)
ArticleÂ MathSciNetÂ Google ScholarÂ

Download references

Acknowledgements

We thank A. Cain for work on the visuals; A. Barreto, G. Ostrovski, T. Ewalds, T. Schaul, J. Oh and N. Heess for reviewing the paper; and the rest of the DeepMind team for their support.

Author information

David Silver, Julian Schrittwieser and Karen Simonyan: These authors contributed equally to this work.

Authors and Affiliations

DeepMind, 5 New Street Square, London, EC4A 3TW, UK
David Silver,Â Julian Schrittwieser,Â Karen Simonyan,Â Ioannis Antonoglou,Â Aja Huang,Â Arthur Guez,Â Thomas Hubert,Â Lucas Baker,Â Matthew Lai,Â Adrian Bolton,Â Yutian Chen,Â Timothy Lillicrap,Â Fan Hui,Â Laurent Sifre,Â George van den Driessche,Â Thore GraepelÂ &Â Demis Hassabis

Authors

David Silver
View author publications
You can also search for this author in PubMedÂ Google Scholar
Julian Schrittwieser
View author publications
You can also search for this author in PubMedÂ Google Scholar
Karen Simonyan
View author publications
You can also search for this author in PubMedÂ Google Scholar
Ioannis Antonoglou
View author publications
You can also search for this author in PubMedÂ Google Scholar
Aja Huang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Arthur Guez
View author publications
You can also search for this author in PubMedÂ Google Scholar
Thomas Hubert
View author publications
You can also search for this author in PubMedÂ Google Scholar
Lucas Baker
View author publications
You can also search for this author in PubMedÂ Google Scholar
Matthew Lai
View author publications
You can also search for this author in PubMedÂ Google Scholar
Adrian Bolton
View author publications
You can also search for this author in PubMedÂ Google Scholar
Yutian Chen
View author publications
You can also search for this author in PubMedÂ Google Scholar
Timothy Lillicrap
View author publications
You can also search for this author in PubMedÂ Google Scholar
Fan Hui
View author publications
You can also search for this author in PubMedÂ Google Scholar
Laurent Sifre
View author publications
You can also search for this author in PubMedÂ Google Scholar
George van den Driessche
View author publications
You can also search for this author in PubMedÂ Google Scholar
Thore Graepel
View author publications
You can also search for this author in PubMedÂ Google Scholar
Demis Hassabis
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

D.S., J.S., K.S., I.A., A.G., L.S. and T.H. designed and implemented the reinforcement learning algorithm in AlphaGo Zero. A.H., J.S., M.L. and D.S. designed and implemented the search in AlphaGo Zero. L.B., J.S., A.H., F.H., T.H., Y.C. and D.S. designed and implemented the evaluation framework for AlphaGo Zero. D.S., A.B., F.H., A.G., T.L., T.G., L.S., G.v.d.D. and D.H. managed and advised on the project. D.S., T.G. and A.G. wrote the paper.

Corresponding author

Correspondence to David Silver.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Reviewer Information Nature thanks S. Singh and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Figure 1 Tournament games between AlphaGo Zero (20 blocks, 3 days) versus AlphaGo Lee using 2âh time controls.

One hundred moves of the first 20 games are shown; full games are provided in the Supplementary Information.

Extended Data Figure 2 Frequency of occurence over time during training, for each joseki from Fig. 5a (corner sequences common in professional play that were discovered by AlphaGo Zero).

The corresponding joseki are shown on the right.

Extended Data Figure 3 Frequency of occurence over time during training, for each joseki from Fig. 5b (corner sequences that AlphaGo Zero favoured for at least one iteration), and one additional variation.

The corresponding joseki are shown on the right.

Extended Data Figure 4 AlphaGo Zero (20 blocks) self-play games.

The 3-day training run was subdivided into 20 periods. The best player from each period (as selected by the evaluator) played a single game against itself, with 2âh time controls. One hundred moves are shown for each game; full games are provided in the Supplementary Information.

Extended Data Figure 5 AlphaGo Zero (40 blocks) self-play games.

The 40-day training run was subdivided into 20 periods. The best player from each period (as selected by the evaluator) played a single game against itself, with 2âh time controls. One hundred moves are shown for each game; full games are provided in the Supplementary Information.

Extended Data Figure 6 AlphaGo Zero (40 blocks, 40 days) versus AlphaGo Master tournament games using 2âh time controls.

One hundred moves of the first 20 games are shown; full games are provided in the Supplementary Information.

Extended Data Table 1 Move prediction accuracy

Full size table

Extended Data Table 2 Game outcome prediction error

Full size table

Extended Data Table 3 Learning rate schedule

Full size table

Supplementary information

Reporting Summary (PDF 67 kb)

Supplementary Data

This zipped file contains the game records of self-play and tournament games played by AlphaGo Zero in .sgf format. (ZIP 82 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silver, D., Schrittwieser, J., Simonyan, K. et al. Mastering the game of Go without human knowledge. Nature 550, 354â359 (2017). https://doi.org/10.1038/nature24270

Download citation

Received: 07 April 2017
Accepted: 13 September 2017
Published: 19 October 2017
Issue Date: 19 October 2017
DOI: https://doi.org/10.1038/nature24270

This article is cited by

Feasibility of an artificial intelligence system for tumor response evaluation
- Nie Xiuli
- Chen Hua
- Yan Peng
BMC Medical Imaging (2024)
Electrochemical random-access memory: recent advances in materials, devices, and systems towards neuromorphic computing
- Hyunjeong Kwak
- Nayeon Kim
- Jiyong Woo
Nano Convergence (2024)
Is artificial intelligence a hazardous technology? Economic trade-off model
- Bodo Herzog
European Journal of Futures Research (2024)
AI sees beyond humans: automated diagnosis of myopia based on peripheral refraction map using interpretable deep learning
- Yong Tang
- Zhenghua Lin
- Weizhong Lan
Journal of Big Data (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data figures and tables

Supplementary information

PowerPoint slides

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links