Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1768841.1768871guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Q-learning with linear function approximation

Published: 13 June 2007 Publication History

Abstract

In this paper, we analyze the convergence of Q-learning with linear function approximation. We identify a set of conditions that implies the convergence of this method with probability 1, when a fixed learning policy is used. We discuss the differences and similarities between our results and those obtained in several related works. We also discuss the applicability of this method when a changing policy is used. Finally, we describe the applicability of this approximate method in partially observable scenarios.

References

[1]
Sutton, R.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9-44 (1988).
[2]
Watkins, C.: Learning from delayed rewards. PhD thesis, King's College, University of Cambridge (May 1989).
[3]
Rummery, G., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department (1994).
[4]
Sutton, R.: DYNA, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2(4), 160-163 (1991).
[5]
Barto, A., Bradtke, S., Singh, S.: Learning to act using real-time dynamic programming. Technical Report UM-CS-1993-002, Department of Computer Science, University of Massachusetts at Amherst (1993).
[6]
Boyan, J.: Least-squares temporal difference learning. In: Proc. 16th Int. Conf. Machine Learning, 49-56 (1999).
[7]
Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996).
[8]
Sutton, R.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems 8, 1038- 1044 (1996).
[9]
Boyan, J., Moore, A.: Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems 7, 369-376 (1994).
[10]
Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 215-219 (1994).
[11]
Singh, S., Jaakkola, T., Jordan, M.: Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, 361-368 (1994).
[12]
Gordon, G.: Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, School of Computer Science, Carnegie Mellon University (1995).
[13]
Tsitsiklis, J., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59-94 (1996).
[14]
Precup, D., Sutton, R., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: Proc. 18th Int. Conf. Machine Learning, 417-424 (2001).
[15]
Szepesvári, C., Smart, W.: Interpolation-based Q-learning. In: Proc. 21st Int. Conf. Machine learning, 100-107 (2004).
[16]
Tsitsiklis, J., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control AC-42(5), 674-690 (1996).
[17]
Borkar, V.: A learning algorithm for discrete-time stochastic control. Probability in the Engineering and Informational Sciences 14, 243-258 (2000).
[18]
Melo, F., Ribeiro, M.I.: Q-learning with linear function approximation. Technical Report RT-602-07, Institute for Systems and Robotics (March 2007).
[19]
Watkins, C., Dayan, P.: Technical note: Q-learning. Machine Learning 8, 279-292 (1992).
[20]
Meyn, S., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, Heidelberg (1993).
[21]
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Proc. 12th Int. Conf. Machine Learning, 30-37 (1995).
[22]
Bertsekas, D., Borkar, V., Nedic, A.: 9. In: Improved temporal difference methods with linear function approximation. Wiley Publishers, 235-260 (2004).
[23]
Baker, W.: Learning via stochastic approximation in function space. PhD Thesis (1997).
[24]
Lusena, C., Goldsmith, J., Mundhenk, M.: Nonapproximability results for partially observable Markov decision processes. J. Artificial Intelligence Research 14, 83-103 (2001).
[25]
Papadimitriou, C., Tsitsiklis, J.: The complexity of Markov chain decision processes. Mathematics of Operations Research 12(3), 441-450 (1987).
[26]
Cassandra, A.: Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, Brown University (May 1998).
[27]
Aberdeen, D.: A (revised) survey of approximate methods for solving partially observable Markov decision processes. Technical report, National ICT Australia, Canberra, Australia (2003).
[28]
Littman, M., Cassandra, A., Kaelbling, L.: Learning policies for partially observable environments: Scaling up. In: Proc. 12th Int. Conf. Machine Learning, 362-370 (1995).
[29]
Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: Proc. Int. Joint Conf. Artificial Intelligence, 1088-1094 (1995).
[30]
He, Q., Shayman, M.: Solving POMDPs by on-policy linear approximate learning algorithm. In: Proc. Conf. Information Sciences and Systems (2000).
[31]
Glaubius, R., Smart, W.: Manifold representations for value-function approximation in reinforcement learning. Technical Report 05-19, Department of Computer Science and Engineering, Washington University in St. Louis (2005).
[32]
Keller, P., Mannor, S., Precup, D.: Automatic basis function construction for approximate dynamic programming and reinforcement learning. In: Proc. 23rd Int. Conf. Machine Learning, 449-456 (2006).
[33]
Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134(1), 215-238 (2005).

Cited By

View all
  • (2024)Multi-Timescale Ensemble $Q$-Learning for Markov Decision Process Policy OptimizationIEEE Transactions on Signal Processing10.1109/TSP.2024.337269972(1427-1442)Online publication date: 1-Jan-2024
  • (2023)Convex-concave 0-sum Markov stackelberg gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669039(66818-66832)Online publication date: 10-Dec-2023
  • (2023)On the convergence and sample complexity analysis of deep Q-networks with ε-greedy explorationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666694(13064-13102)Online publication date: 10-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
COLT'07: Proceedings of the 20th annual conference on Learning theory
June 2007
634 pages
ISBN:9783540729259
  • Editors:
  • Nader H. Bshouty,
  • Claudio Gentile

Sponsors

  • Google Inc.
  • Machine Learning Journal
  • IBM: IBM

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 13 June 2007

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-Timescale Ensemble $Q$-Learning for Markov Decision Process Policy OptimizationIEEE Transactions on Signal Processing10.1109/TSP.2024.337269972(1427-1442)Online publication date: 1-Jan-2024
  • (2023)Convex-concave 0-sum Markov stackelberg gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669039(66818-66832)Online publication date: 10-Dec-2023
  • (2023)On the convergence and sample complexity analysis of deep Q-networks with ε-greedy explorationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666694(13064-13102)Online publication date: 10-Dec-2023
  • (2023)Cooperative multi-agent reinforcement learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619440(24785-24811)Online publication date: 23-Jul-2023
  • (2022)Instance-dependent near-optimal policy identification in linear MDPs via online experiment designProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600702(5968-5981)Online publication date: 28-Nov-2022
  • (2019)Provably efficient Q-learning with function approximation via distribution shift error checking oracleProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455011(8060-8070)Online publication date: 8-Dec-2019
  • (2018)Non-delusional Q-learning and value iterationProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327661(9971-9981)Online publication date: 3-Dec-2018
  • (2018)Q-learning with nearest neighborsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327233(3115-3125)Online publication date: 3-Dec-2018
  • (2017)Reinforcement Learning for Multi-Step Expert AdviceProceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems10.5555/3091125.3091262(962-971)Online publication date: 8-May-2017
  • (2009)A task annotation model for sandbox Serious GamesProceedings of the 5th international conference on Computational Intelligence and Games10.5555/1719293.1719336(233-240)Online publication date: 7-Sep-2009

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media