Article

Q-learning with linear function approximation

Authors:

Francisco S. Melo,

M. Isabel RibeiroAuthors Info & Claims

COLT'07: Proceedings of the 20th annual conference on Learning theory

Pages 308 - 322

Published: 13 June 2007 Publication History

Abstract

In this paper, we analyze the convergence of Q-learning with linear function approximation. We identify a set of conditions that implies the convergence of this method with probability 1, when a fixed learning policy is used. We discuss the differences and similarities between our results and those obtained in several related works. We also discuss the applicability of this method when a changing policy is used. Finally, we describe the applicability of this approximate method in partially observable scenarios.

References

[1]

Sutton, R.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9-44 (1988).

[2]

Watkins, C.: Learning from delayed rewards. PhD thesis, King's College, University of Cambridge (May 1989).

[3]

Rummery, G., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department (1994).

[4]

Sutton, R.: DYNA, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2(4), 160-163 (1991).

[5]

Barto, A., Bradtke, S., Singh, S.: Learning to act using real-time dynamic programming. Technical Report UM-CS-1993-002, Department of Computer Science, University of Massachusetts at Amherst (1993).

[6]

Boyan, J.: Least-squares temporal difference learning. In: Proc. 16th Int. Conf. Machine Learning, 49-56 (1999).

[7]

Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996).

[8]

Sutton, R.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems 8, 1038- 1044 (1996).

[9]

Boyan, J., Moore, A.: Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems 7, 369-376 (1994).

[10]

Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 215-219 (1994).

[11]

Singh, S., Jaakkola, T., Jordan, M.: Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, 361-368 (1994).

[12]

Gordon, G.: Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, School of Computer Science, Carnegie Mellon University (1995).

[13]

Tsitsiklis, J., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59-94 (1996).

[14]

Precup, D., Sutton, R., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: Proc. 18th Int. Conf. Machine Learning, 417-424 (2001).

[15]

Szepesvári, C., Smart, W.: Interpolation-based Q-learning. In: Proc. 21st Int. Conf. Machine learning, 100-107 (2004).

[16]

Tsitsiklis, J., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control AC-42(5), 674-690 (1996).

[17]

Borkar, V.: A learning algorithm for discrete-time stochastic control. Probability in the Engineering and Informational Sciences 14, 243-258 (2000).

[18]

Melo, F., Ribeiro, M.I.: Q-learning with linear function approximation. Technical Report RT-602-07, Institute for Systems and Robotics (March 2007).

[19]

Watkins, C., Dayan, P.: Technical note: Q-learning. Machine Learning 8, 279-292 (1992).

[20]

Meyn, S., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, Heidelberg (1993).

[21]

Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Proc. 12th Int. Conf. Machine Learning, 30-37 (1995).

[22]

Bertsekas, D., Borkar, V., Nedic, A.: 9. In: Improved temporal difference methods with linear function approximation. Wiley Publishers, 235-260 (2004).

[23]

Baker, W.: Learning via stochastic approximation in function space. PhD Thesis (1997).

[24]

Lusena, C., Goldsmith, J., Mundhenk, M.: Nonapproximability results for partially observable Markov decision processes. J. Artificial Intelligence Research 14, 83-103 (2001).

[25]

Papadimitriou, C., Tsitsiklis, J.: The complexity of Markov chain decision processes. Mathematics of Operations Research 12(3), 441-450 (1987).

[26]

Cassandra, A.: Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, Brown University (May 1998).

[27]

Aberdeen, D.: A (revised) survey of approximate methods for solving partially observable Markov decision processes. Technical report, National ICT Australia, Canberra, Australia (2003).

[28]

Littman, M., Cassandra, A., Kaelbling, L.: Learning policies for partially observable environments: Scaling up. In: Proc. 12th Int. Conf. Machine Learning, 362-370 (1995).

[29]

Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: Proc. Int. Joint Conf. Artificial Intelligence, 1088-1094 (1995).

[30]

He, Q., Shayman, M.: Solving POMDPs by on-policy linear approximate learning algorithm. In: Proc. Conf. Information Sciences and Systems (2000).

[31]

Glaubius, R., Smart, W.: Manifold representations for value-function approximation in reinforcement learning. Technical Report 05-19, Department of Computer Science and Engineering, Washington University in St. Louis (2005).

[32]

Keller, P., Mannor, S., Precup, D.: Automatic basis function construction for approximate dynamic programming and reinforcement learning. In: Proc. 23rd Int. Conf. Machine Learning, 449-456 (2006).

[33]

Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134(1), 215-238 (2005).

Cited By

Bozkus TMitra U(2024)Multi-Timescale Ensemble $Q$-Learning for Markov Decision Process Policy OptimizationIEEE Transactions on Signal Processing10.1109/TSP.2024.337269972(1427-1442)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TSP.2024.3372699
Goktas DPrakash AGreenwald AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Convex-concave 0-sum Markov stackelberg gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669039(66818-66832)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669039
Zhang SLi HWang MLiu MChen PLu SLiu SMurugesan KChaudhury SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)On the convergence and sample complexity analysis of deep Q-networks with ε-greedy explorationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666694(13064-13102)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666694
Show More Cited By

Q-learning with linear function approximation
1. Computing methodologies
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic representations
    2. Stochastic processes

Recommendations

Multiscale Q-learning with linear function approximation

We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. ...
Agnostic Q-learning with function approximation in deterministic systems: near-optimal bounds on approximation error and sample complexity
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems

The current paper studies the problem of agnostic Q-learning with function approximation in deterministic systems where the optimal Q -function is approximable by a function in the class Ƒ with approximation error ∝ ≥ 0. We propose a novel recursion-...
Reinforcement learning via approximation of the Q-function

Relational reinforcement learning (RRL) combines traditional reinforcement learning (RL) with a strong emphasis on a relational (rather than attribute-value) representation. Earlier work used RRL on a learning version of the classic Blocks World ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

COLT'07: Proceedings of the 20th annual conference on Learning theory

June 2007

634 pages

ISBN:9783540729259

Editors:
Nader H. Bshouty
Department of Computer Science, Technion, Haifa, Israel
,
Claudio Gentile
Dipartimento di Informatica e Comunicazione, Università dell'Insubria, Varese, Italy

Sponsors

Google Inc.
Machine Learning Journal
IBM: IBM

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 13 June 2007

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
2
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bozkus TMitra U(2024)Multi-Timescale Ensemble $Q$-Learning for Markov Decision Process Policy OptimizationIEEE Transactions on Signal Processing10.1109/TSP.2024.337269972(1427-1442)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TSP.2024.3372699
Goktas DPrakash AGreenwald AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Convex-concave 0-sum Markov stackelberg gamesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669039(66818-66832)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669039
Zhang SLi HWang MLiu MChen PLu SLiu SMurugesan KChaudhury SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)On the convergence and sample complexity analysis of deep Q-networks with ε-greedy explorationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666694(13064-13102)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666694
Min YHe JWang TGu QKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Cooperative multi-agent reinforcement learningProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619440(24785-24811)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619440
Wagenmaker AJamieson KKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Instance-dependent near-optimal policy identification in linear MDPs via online experiment designProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600702(5968-5981)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3600702
Du SLuo YWang RZhang HWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Provably efficient Q-learning with function approximation via distribution shift error checking oracleProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455011(8060-8070)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455011
Lu TSchuurmans DBoutilier C(2018)Non-delusional Q-learning and value iterationProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327661(9971-9981)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327546.3327661
Shah DXie Q(2018)Q-learning with nearest neighborsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327144.3327233(3115-3125)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327144.3327233
Philipp PRettinger ALarson KWinikoff MDas SDurfee E(2017)Reinforcement Learning for Multi-Step Expert AdviceProceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems10.5555/3091125.3091262(962-971)Online publication date: 8-May-2017
https://dl.acm.org/doi/10.5555/3091125.3091262
Bellotti FBerta RDe Gloria APrimavera L(2009)A task annotation model for sandbox Serious GamesProceedings of the 5th international conference on Computational Intelligence and Games10.5555/1719293.1719336(233-240)Online publication date: 7-Sep-2009
https://dl.acm.org/doi/10.5555/1719293.1719336

View Options

View options

Figures

Tables

Media

View Table of Conten