Abstract
Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of how good options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goal-achievement tasks in an environment that is othertherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989).
We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
[1995]Bradtke:SMDPQ Bradtke, S. J.,& Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov Decision Problems. Advances in Neural Information Processing Systems 7 (pp. 393–400). MIT Press.
[1998]Dietterich:MAXQ Dietterich, T. G. (1998). The MAXQ method for hierarchical reinforcement learning. Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann.
[1972]Fikes:RobotPlan Fikes, R., P.E. Hart, & Nilsson, N. J. (1972). Learning and executing generalized robot plans. Artificial Intelligence, 3, 251–288.
[1989]Iba:Macro Iba, G. A. (1989). A heuristic approach to the discovery of macro-operators. Machine Learning, 3, 285–317.
[1985]Korf:Macro Korf, R. E. (1985). Learning to solve problems by searching for macro-operators. Pitman Publishing Ltd.
[1986]Laird:ChunkSOAR Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in SOAR: The anatomy of a general learning mechanism. Machine Learning, 1, 11–46.
[1997]Mahadevan:SMDP Mahadevan, S., Mar-challek, N., Das, T. K., & Gosavi, A. (1997). Self-improving factory simulation using continuous-time average-reward reinforcement learning. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 202–210). Morgan Kaufmann.
[1997]McGovern:MacroRL McGovern, A., Sutton, R.S., & Fagg, A. H. (1997). Roles of macro-actions in accelerating reinforcement learning. Grace Hopper Celebration of Women in Computing (pp. 13–17).
[2002]McGovern:Thesis McGovern, E. A. (2002). Autonomous discovery of temporal abstractions from interaction with an environment. Doctoral dissertation, University of Massachusetts, Amherst.
[2001]McGovern:ICML McGovern, E. A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 361–368). Morgan Kaufman.
[1988]Minton:BookMinton, S. (1988). Learning search control knowledge. An explanation-based approach. Kluwer Academic Publishers.
[1972]Newell:Simon Newell, A., & Simon, H. A. (1972). Human problem solving. Prentice-Hall.
[1998]Parr:Thesis Parr, R. (1998). Hierarchical control and learning for Markov Decision Processes. Doctoral dissertation, Computer Science Division, University of California, Berkeley, USA.
[1998]Parr:HAMs Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems 10. MIT Press.
[2000]Precup:Thesis Precup, D. (2000)._Temporal abstraction in reinforcement learning. Doctoral dissertation, Department of Computer Science, University of Massachusetts, Amherst, USA.
[1994]Puterman:Book Puterman, M. L. (1994). Markov Decision Processes: Discrete stochastic dynamic programming. Wiley.
[1974]Sacerdoti:PlanArt Sacerdoti, E. D. (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, 5, 115–135.
[1992]Singh:HDynaAAAI Singh, S. P. (1992). Reinforcement learning with a hierarchy of abstract models. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 202–207). MIT/AAAI Press.
[1998]Sutton:BookSutton,R.S.,&Barto, A.G. (1998). Reinforcement learning: An introduction. MIT Press.
[1999]Precup:Options Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.
[1989]Watkins:Qlearn Watkins, C. J. C. H. (1989). Learning with delayed rewards. Doctoral dissertation, Psychology Department, Cambridge University, Cambridge, UK.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stolle, M., Precup, D. (2002). Learning Options in Reinforcement Learning. In: Koenig, S., Holte, R.C. (eds) Abstraction, Reformulation, and Approximation. SARA 2002. Lecture Notes in Computer Science(), vol 2371. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45622-8_16
Download citation
DOI: https://doi.org/10.1007/3-540-45622-8_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43941-7
Online ISBN: 978-3-540-45622-3
eBook Packages: Springer Book Archive