Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2998981.2999131guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

On-line policy improvement using Monte-Carlo search

Published: 03 December 1996 Publication History

Abstract

We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers.
We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment.

References

[1]
D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA (1995).
[2]
R. H. Crites and A. G. Barto, "Improving elevator performance using reinforcement learning." In: D. Touretzky et al., eds., Advances in Neural Information Processing Systems 8, 1017-1023, MIT Press (1996).
[3]
C. E. Shannon, "Programming a computer for playing chess." Philosophical Magazine 41, 265-275 (1950).
[4]
R. S. Sutton, "Learning to predict by the methods of temporal differences." Machine Learning 3, 9-44 (1988).
[5]
G. Tesauro, "Connectionist learning of expert preferences by comparison training." In: D. Touretzky, ed., Advances in Neural Information Processing Systems 1, 99- 106, Morgan Kaufmann (1989).
[6]
G. Tesauro, "Practical issues in temporal difference learning." Machine Learning 8, 257-277 (1992).
[7]
G. Tesauro, "Temporal difference learning and TD-Gammon." Comm. of the ACM, 38:3, 58-67 (1995).
[8]
W. Zhang and T. G. Dietterich, "High-performance job-shop scheduling with a time-delay TD(λ) network." In: D. Touretzky et al., eds., Advances in Neural Information Processing Systems 8, 1024-1030, MIT Press (1996).

Cited By

View all
  • (2022)Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation2022 International Conference on Robotics and Automation (ICRA)10.1109/ICRA46639.2022.9811990(2445-2451)Online publication date: 23-May-2022
  • (2018)Multiple-step greedy policies in online and approximate reinforcement learningProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327345.3327430(5244-5253)Online publication date: 3-Dec-2018
  • (2018)A Survey on Compiler Autotuning using Machine LearningACM Computing Surveys10.1145/319797851:5(1-42)Online publication date: 18-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'96: Proceedings of the 9th International Conference on Neural Information Processing Systems
December 1996
1088 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 03 December 1996

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation2022 International Conference on Robotics and Automation (ICRA)10.1109/ICRA46639.2022.9811990(2445-2451)Online publication date: 23-May-2022
  • (2018)Multiple-step greedy policies in online and approximate reinforcement learningProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327345.3327430(5244-5253)Online publication date: 3-Dec-2018
  • (2018)A Survey on Compiler Autotuning using Machine LearningACM Computing Surveys10.1145/319797851:5(1-42)Online publication date: 18-Sep-2018
  • (2016)Online symbolic gradient-based optimization for factored action MDPsProceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence10.5555/3060832.3061051(3075-3081)Online publication date: 9-Jul-2016
  • (2012)Approximate planning and verification for large markov decision processesProceedings of the 27th Annual ACM Symposium on Applied Computing10.1145/2245276.2231984(1314-1319)Online publication date: 26-Mar-2012
  • (2004)Learning domain-specific control knowledge from random walksProceedings of the Fourteenth International Conference on International Conference on Automated Planning and Scheduling10.5555/3037008.3037033(191-198)Online publication date: 3-Jun-2004
  • (2003)Reinforcement learning as classificationProceedings of the Twentieth International Conference on International Conference on Machine Learning10.5555/3041838.3041892(424-431)Online publication date: 21-Aug-2003

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media