Article

Achieving master level play in 9×9 computer go

Authors:

Sylvain Gelly,

David SilverAuthors Info & Claims

AAAI'08: Proceedings of the 23rd national conference on Artificial intelligence - Volume 3

Pages 1537 - 1540

Published: 13 July 2008 Publication History

Abstract

The UCT algorithm uses Monte-Carlo simulation to estimate the value of states in a search tree from the current state. However, the first time a state is encountered, UCT has no knowledge, and is unable to generalise from previous experience. We describe two extensions that address these weaknesses. Our first algorithm, heuristic UCT, incorporates prior knowledge in the form of a value function. The value function can be learned offline, using a linear combination of a million binary features, with weights trained by temporal-difference learning. Our second algorithm, UCT-RAVE, forms a rapid online generalisation based on the value of moves. We applied our algorithms to the domain of 9 × 9 Computer Go, using the program MoGo. Using both heuristic UCT and RAVE, MoGo became the first program to achieve human master level in competitive play.

References

[1]

Baxter, J.; Tridgell, A.; and Weaver, L. 1998. Experiments in parameter learning using temporal differences. International Computer Chess Association Journal 21(2):84-99.

Google Scholar

[2]

Bruegmann, B. 1993. Monte-Carlo Go. http://www.cgl.ucsf.edu/go/Programs/Gobble.html.

Google Scholar

[3]

Buro, M. 1999. From simple features to sophisticated evaluation functions. In First International Conference on Computers and Games, 126-145.

Digital Library

Google Scholar

[4]

Coulom, R. 2006. Efficient selectivity and backup operators in Monte-Carlo tree search. In 5th International Conference on Computer and Games, 2006-05 29, 72-83.

Digital Library

Google Scholar

[5]

Gelly, S., and Silver, D. 2007. Combining online and offline learning in UCT. In 17th International Conference on Machine Learning, 273-280.

Digital Library

Google Scholar

[6]

Gelly, S.; Wang, Y.; Munos, R.; and Teytaud. O. 2006. Modification of UCT with patterns in Monte-Carlo Go. Technical Report 6062, INRIA.

Google Scholar

[7]

Kocsis, L., and Szepesvari, C. 2006. Bandit based Monte-Carlo planning. In 15th European Conference on Machine Learning, 282-293.

Digital Library

Google Scholar

[8]

Müller, M. 2002. Computer Go. Artificial lntelligence 134:145-179.

Digital Library

Google Scholar

[9]

Schaeffer, J.; Hlynka, M.; and Jussila, V. 2000. Temporal difference learning applied to a high-performance game-playing program. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, 529-534.

Digital Library

Google Scholar

[10]

Schaeffer, J. 1989. The history heuristic and alpha-beta search enhancements in practice. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-11(11):1203-1212.

Digital Library

Google Scholar

[11]

Silver, D.; Sutton, R.; and Müller, M. 2007. Reinforcement learning of local shape in the game of Go. In 20th International Joint Conference on Artificial Intelligence, 1053-1058.

Digital Library

Google Scholar

[12]

Sutton, R. 1988. Learning to predict by the method of temporal differences. Machine Learning 3(9):9-44.

Digital Library

Google Scholar

Cited By

View all

Zerbel NYliniemi LElkind EVeloso MAgmon NTaylor M(2019)Multiagent Monte Carlo Tree SearchProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3332094(2309-2311)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.5555/3306127.3332094
Ottens BDimitrakakis CFaltings B(2017)DUCTACM Transactions on Intelligent Systems and Technology10.1145/30661568:5(1-27)Online publication date: 12-Jul-2017
https://dl.acm.org/doi/10.1145/3066156
Knox WStone PKim JNichols JSzekely P(2013)Learning non-myopically from human-generated rewardProceedings of the 2013 international conference on Intelligent user interfaces10.1145/2449396.2449422(191-202)Online publication date: 19-Mar-2013
https://dl.acm.org/doi/10.1145/2449396.2449422
Show More Cited By

Recommendations

Associating shallow and selective global tree search with monte carlo for 9 × 9 go
CG'04: Proceedings of the 4th international conference on Computers and Games

This paper explores the association of shallow and selective global tree search with Monte Carlo in 9 × 9 Go. This exploration is based on Olga and Indigo, two experimental Monte-Carlo programs. We provide a min-max algorithm that iteratively deepens ...
Temporal-difference search in computer Go

Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function ...
Temporal-difference search in Computer Go
ICAPS'13: Proceedings of the Twenty-Third International Conference on International Conference on Automated Planning and Scheduling

Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent ...

Comments

Information & Contributors

Information

Published In

AAAI'08: Proceedings of the 23rd national conference on Artificial intelligence - Volume 3

July 2008

1892 pages

ISBN:9781577353683

Editor:
Anthony Cohn
University of Leeds

Publisher

AAAI Press

Publication History

Published: 13 July 2008

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zerbel NYliniemi LElkind EVeloso MAgmon NTaylor M(2019)Multiagent Monte Carlo Tree SearchProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3332094(2309-2311)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.5555/3306127.3332094
Ottens BDimitrakakis CFaltings B(2017)DUCTACM Transactions on Intelligent Systems and Technology10.1145/30661568:5(1-27)Online publication date: 12-Jul-2017
https://dl.acm.org/doi/10.1145/3066156
Knox WStone PKim JNichols JSzekely P(2013)Learning non-myopically from human-generated rewardProceedings of the 2013 international conference on Intelligent user interfaces10.1145/2449396.2449422(191-202)Online publication date: 19-Mar-2013
https://dl.acm.org/doi/10.1145/2449396.2449422
Bratman JSingh SSorg JLewis Rvan der Hoek WPadgham LConitzer VWinikoff M(2012)Strong mitigationProceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 110.5555/2343576.2343634(407-414)Online publication date: 4-Jun-2012
https://dl.acm.org/doi/10.5555/2343576.2343634
Wu BSzekely PKnoblock CNambiar UNie Z(2012)Learning data transformation rules through examplesProceedings of the Ninth International Workshop on Information Integration on the Web10.1145/2331801.2331809(1-6)Online publication date: 20-May-2012
https://dl.acm.org/doi/10.1145/2331801.2331809
Pettit JHelmbold DEl-Nasr MConsalvo MFeiner S(2012)Evolutionary learning of policies for MCTS simulationsProceedings of the International Conference on the Foundations of Digital Games10.1145/2282338.2282379(212-219)Online publication date: 29-May-2012
https://dl.acm.org/doi/10.1145/2282338.2282379
Sabharwal ASamulowitz HReddy C(2012)Guiding combinatorial optimization with UCTProceedings of the 9th international conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems10.1007/978-3-642-29828-8_23(356-361)Online publication date: 28-May-2012
https://dl.acm.org/doi/10.1007/978-3-642-29828-8_23
Ramanujan RSelman B(2011)Trade-offs in sampling-based adversarial planningProceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling10.5555/3038485.3038512(202-209)Online publication date: 11-Jun-2011
https://dl.acm.org/doi/10.5555/3038485.3038512
Asmuth JLittman M(2011)Learning is planningProceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence10.5555/3020548.3020552(19-26)Online publication date: 14-Jul-2011
https://dl.acm.org/doi/10.5555/3020548.3020552
Previti ARamanujan RSchaerf MSelman B(2011)Applying UCT to boolean satisfiabilityProceedings of the 14th international conference on Theory and application of satisfiability testing10.5555/2023474.2023519(373-374)Online publication date: 19-Jun-2011
https://dl.acm.org/doi/10.5555/2023474.2023519
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Associating shallow and selective global tree search with monte carlo for 9 × 9 go

Temporal-difference search in computer Go

Temporal-difference search in Computer Go