Article

On-line policy improvement using Monte-Carlo search

Authors:

Gerald Tesauro,

Gregory R. GalperinAuthors Info & Claims

NIPS'96: Proceedings of the 9th International Conference on Neural Information Processing Systems

Pages 1068 - 1074

Published: 03 December 1996 Publication History

Publisher Site

Abstract

We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers.

We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment.

References

[1]

D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA (1995).

Crossref

Google Scholar

[2]

R. H. Crites and A. G. Barto, "Improving elevator performance using reinforcement learning." In: D. Touretzky et al., eds., Advances in Neural Information Processing Systems 8, 1017-1023, MIT Press (1996).

Google Scholar

[3]

C. E. Shannon, "Programming a computer for playing chess." Philosophical Magazine 41, 265-275 (1950).

Google Scholar

[4]

R. S. Sutton, "Learning to predict by the methods of temporal differences." Machine Learning 3, 9-44 (1988).

Crossref

Google Scholar

[5]

G. Tesauro, "Connectionist learning of expert preferences by comparison training." In: D. Touretzky, ed., Advances in Neural Information Processing Systems 1, 99- 106, Morgan Kaufmann (1989).

Crossref

Google Scholar

[6]

G. Tesauro, "Practical issues in temporal difference learning." Machine Learning 8, 257-277 (1992).

Crossref

Google Scholar

[7]

G. Tesauro, "Temporal difference learning and TD-Gammon." Comm. of the ACM, 38:3, 58-67 (1995).

Crossref

Google Scholar

[8]

W. Zhang and T. G. Dietterich, "High-performance job-shop scheduling with a time-delay TD(λ) network." In: D. Touretzky et al., eds., Advances in Neural Information Processing Systems 8, 1024-1030, MIT Press (1996).

Google Scholar

Cited By

View all

Igl MKim DKuefler AMougin PShah PShiarlis KAnguelov DPalatucci MWhite BWhiteson S(2022)Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation2022 International Conference on Robotics and Automation (ICRA)10.1109/ICRA46639.2022.9811990(2445-2451)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1109/ICRA46639.2022.9811990
Efroni YDalal GScherrer BMannor S(2018)Multiple-step greedy policies in online and approximate reinforcement learningProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327345.3327430(5244-5253)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327345.3327430
Ashouri AKillian WCavazos JPalermo GSilvano C(2018)A Survey on Compiler Autotuning using Machine LearningACM Computing Surveys10.1145/319797851:5(1-42)Online publication date: 18-Sep-2018
https://dl.acm.org/doi/10.1145/3197978
Show More Cited By

Recommendations

Nested rollout policy adaptation for Monte Carlo tree search
IJCAI'11: Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One

Monte Carlo tree search (MCTS) methods have had recent success in games, planning, and optimization. MCTS uses results from rollouts to guide search; a rollout is a path that descends the tree with a randomized decision at each ply until reaching a ...
Monte-Carlo simulation balancing
ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather ...
Random search for constrained Markov decision processes with multi-policy improvement

This communique first presents a novel multi-policy improvement method which generates a feasible policy at least as good as any policy in a given set of feasible policies in finite constrained Markov decision processes (CMDPs). A random search ...

Comments

Information & Contributors

Information

Published In

NIPS'96: Proceedings of the 9th International Conference on Neural Information Processing Systems

December 1996

1088 pages

Editors:
M. I. Jordan,
T. Petsche

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 03 December 1996

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Igl MKim DKuefler AMougin PShah PShiarlis KAnguelov DPalatucci MWhite BWhiteson S(2022)Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation2022 International Conference on Robotics and Automation (ICRA)10.1109/ICRA46639.2022.9811990(2445-2451)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1109/ICRA46639.2022.9811990
Efroni YDalal GScherrer BMannor S(2018)Multiple-step greedy policies in online and approximate reinforcement learningProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327345.3327430(5244-5253)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327345.3327430
Ashouri AKillian WCavazos JPalermo GSilvano C(2018)A Survey on Compiler Autotuning using Machine LearningACM Computing Surveys10.1145/319797851:5(1-42)Online publication date: 18-Sep-2018
https://dl.acm.org/doi/10.1145/3197978
Cui HKhardon R(2016)Online symbolic gradient-based optimization for factored action MDPsProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3061051(3075-3081)Online publication date: 9-Jul-2016
https://dl.acm.org/doi/10.5555/3060832.3061051
Lassaigne RPeyronnet SOssowski SLecca P(2012)Approximate planning and verification for large markov decision processesProceedings of the 27th Annual ACM Symposium on Applied Computing10.1145/2245276.2231984(1314-1319)Online publication date: 26-Mar-2012
https://dl.acm.org/doi/10.1145/2245276.2231984
Fern AYoon SGivan R(2004)Learning domain-specific control knowledge from random walksProceedings of the Fourteenth International Conference on International Conference on Automated Planning and Scheduling10.5555/3037008.3037033(191-198)Online publication date: 3-Jun-2004
https://dl.acm.org/doi/10.5555/3037008.3037033
Lagoudakis MParr R(2003)Reinforcement learning as classificationProceedings of the Twentieth International Conference on International Conference on Machine Learning10.5555/3041838.3041892(424-431)Online publication date: 21-Aug-2003
https://dl.acm.org/doi/10.5555/3041838.3041892

Abstract

References

Cited By

Recommendations

Nested rollout policy adaptation for Monte Carlo tree search

Monte-Carlo simulation balancing

Random search for constrained Markov decision processes with multi-policy improvement

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations