Abstract
We propose a model-based learning algorithm, the Adaptive-resolution Reinforcement Learning (ARL) algorithm, that aims to solve the online, continuous state space reinforcement learning problem in a deterministic domain. Our goal is to combine adaptive-resolution approximation schemes with efficient exploration in order to obtain polynomial learning rates. The proposed algorithm uses an adaptive approximation of the optimal value function using kernel-based averaging, going from coarse to fine kernel-based representation of the state space, which enables us to use finer resolution in the “important” areas of the state space, and coarser resolution elsewhere. We consider an online learning approach, in which we discover these important areas online, using an uncertainty intervals exploration technique. In addition, we introduce an incremental variant of the ARL (IARL), which is a more practical version of the original algorithm with reduced computational complexity at each stage. Polynomial learning rates in terms of mistake bound (in a PAC framework) are established for these algorithms, under appropriate continuity assumptions.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Albus, J. S. (1975). A new approach to manipulator control: the cerebellar model articulation controller (CMAC). Journal of Dynamic Systems, Measurement and Control, 97, 220–227.
Antos, A., Szepesvári, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129.
Auer, P., & Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement learning. In Proceedings of neural information processing systems conference (NIPS).
Bernstein, A. (2007). Adaptive state aggregation for reinforcement learning. Master’s thesis, Technion—Israel Institute of Technology. URL: http://tx.technion.ac.il/~andreyb/MSc_Thesis_final.pdf.
Bernstein, A., & Shimkin, N. (2008). Adaptive aggregation for reinforcement learning with efficient exploration: deterministic domains. In Proceedings of the 21st annual conference on learning theory (COLT 2008).
Bertsekas, D. P. (2007). Dynamic programming and optimal control (3rd ed., vol. 2). Belmont: Athena Scientific.
Bonarini, A., Lazaric, A., & Restelli, M. (2005). LEAP: learning entities adaptive partitioning. In Proceedings of neural information processing systems conference (NIPS 2005), workshop on reinforcement learning benchmarks and bake-offs II, Whistler, Canada (pp. 41–47).
Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11, 1–94.
Brafman, R. I., & Tennenholtz, M. (2002). R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231.
Chapman, H. (2007). Global confidence bound algorithms for the exploration-exploitation tradeoff in reinforcement learning. Master’s thesis, Technion—Israel Institute of Technology.
Chow, C.-S., & Tsitsiklis, J. N. (1991). An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control, 36(8), 898–914.
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12, 219–245.
Jong, N., & Stone, P. (2006). Kernel-based models for reinforcement learning in continuous state spaces. In 23th international conference on machine learning (ICML 2006), workshop on kernel machines and reinforcement learning.
Kakade, S. M. (2003). On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, UK.
Kearns, M., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209–232.
Konda, V. R., & Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM Journal on Control and Optimization, 42(4), 1143–1166.
Loth, M., Davy, M., Coulom, R., & Preux, P. (2006) Equi-gradient temporal difference learning. In 23th international conference on machine learning (ICML 2006), workshop on kernel machines and reinforcement learning.
Moore, A. W., & Atkeson, C. G. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21, 199–233.
Munos, R., & Moore, A. W. (2002). Variable resolution discretization in optimal control. Machine Learning, 49, 291–323.
Munos, R., & Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9, 815–857.
Nouri, A., & Littman, M. L. (2008). Multi-resolution exploration in continuous spaces. In Advances in neural information processing systems (NIPS) 21 (pp. 1209–1216).
Ormoneit, D., & Sen, S. (2002). Kernel-based reinforcement learning. Machine Learning, 49, 161–178.
Powell, W. B. (2007). Approximate dynamic programming for operations research: solving the curses of dimensionality. New York: Wiley.
Puterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. New York: Wiley.
Singh, S. P., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems (NIPS) 7 (pp. 361–368).
Strehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on machine learning (pp. 857–864).
Strehl, A. L., Li, L., & Littman, M. L. (2006a). Incremental model-based learners with formal learning-time guarantees. In Proceedings of the 22nd international conference on uncertainty in artificial intelligence (pp. 485–493).
Strehl, A. L., Wiewiora, E., Langford, J., & Littman, M. L. (2006b). PAC model-free reinforcement learning. In Proceedings of the 23nd international conference on machine learning (pp. 881–888).
Sutton, R. S. (1996). Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in neural information processing systems 8 (NIPS) (pp. 1038–1044).
Tewari, A., & Bartlett, P. L. (2007). Optimistic linear programming gives logarithmic regret for irreducible MDPs. In Proceedings of neural information processing systems conference (NIPS).
Unser, M. (1999). Splines: A perfect fit for signal and image processing. IEEE Signal Processing Magazine, 16, 22–38.
Whitt, W. (1978). Approximations of dynamic programs, I. Mathematics of Operations Research, 3(3), 231–243.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Roni Khardon.
Rights and permissions
About this article
Cite this article
Bernstein, A., Shimkin, N. Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains. Mach Learn 81, 359–397 (2010). https://doi.org/10.1007/s10994-010-5186-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5186-7