research-article

Balanced Q-learning: : Combining the influence of optimistic and pessimistic targets

Authors:

Thommen George Karimpanal,

Majid Abdolshah,

Svetha VenkateshAuthors Info & Claims

Volume 325, Issue C

https://doi.org/10.1016/j.artint.2023.104021

Published: 01 December 2023 Publication History

Abstract

The optimistic nature of the Q−learning target leads to an overestimation bias, which is an inherent problem associated with standard Q−learning. Such a bias fails to account for the possibility of low returns, particularly in risky scenarios. However, the existence of biases, whether overestimation or underestimation, need not necessarily be undesirable. In this paper, we analytically examine the utility of biased learning, and show that specific types of biases may be preferable, depending on the scenario. Based on this finding, we design a novel reinforcement learning algorithm, Balanced Q-learning, in which the target is modified to be a convex combination of a pessimistic and an optimistic term, whose associated weights are determined online, analytically. Such a balanced target inherently promotes risk-averse behavior, which we examine through the lens of the agent's exploration. We prove the convergence of this algorithm in a tabular setting, and empirically demonstrate its consistently good learning performance in various environments.

References

[1]

C. Watkins, Learning from delayed rewards, PhD Thesis Cambridge University, Cambridge, England, 1989.

[2]

R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, vol. 1, MIT Press, Cambridge, 1998.

[3]

S. Thrun, A. Schwartz, Issues in using function approximation for reinforcement learning, in: Proceedings of the 1993 Connectionist Models Summer School, Lawrence Erlbaum, Hillsdale, NJ, 1993.

[4]

H.V. Hasselt, Double q-learning, in: Advances in Neural Information Processing Systems (NIPS), 2010, pp. 2613–2621.

[5]

H.V. Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in: Thirtieth Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 2016.

[6]

Q. Lan, Y. Pan, A. Fyshe, M. White, Maxmin q-learning: controlling the estimation bias of q-learning, in: International Conference on Learning Representations (ICLR), 2020.

[7]

O. Anschel, N. Baram, N. Shimkin Averaged-dqn, Variance reduction and stabilization for deep reinforcement learning, in: Proceedings of the 34th International Conference on Machine Learning (ICML), vol. 70, JMLR.org, 2017, pp. 176–185.

[8]

M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edition, John Wiley & Sons, Inc., New York, NY, USA, 1994.

[9]

V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529.

[10]

X. Chen, C. Wang, Z. Zhou, K. Ross, Randomized ensembled double q-learning: learning fast without a model, in: International Conference on Learning Representations (ICLR), 2021.

[11]

C. Gaskett, Reinforcement learning under circumstances beyond its control, 2003.

[12]

Young, K.; Tian, T. (2019): Minatar: an atari-inspired testbed for more efficient reinforcement learning experiments. arXiv preprint arXiv:1903.03176.

[13]

M.G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The arcade learning environment: an evaluation platform for general agents, J. Artif. Intell. Res. 47 (2013) 253–279.

[14]

Cini, A.; D'Eramo, C.; Peters, J.; Alippi, C. (2020): Deep reinforcement learning with weighted q-learning. arXiv preprint arXiv:2003.09280.

[15]

F. Fernández, M. Veloso, Probabilistic policy reuse in a reinforcement learning agent, in: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2006, pp. 720–727. http://dl.acm.org/citation.cfm?id=1160762.

[16]

Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. (2016): Openai gym. arXiv preprint arXiv:1606.01540.

[17]

Leike, J.; Martic, M.; Krakovna, V.; Ortega, P.A.; Everitt, T.; Lefrancq, A.; Orseau, L.; Legg, S. (2017): Ai safety gridworlds. arXiv preprint arXiv:1711.09883.

[18]

Z. Ren, G. Zhu, H. Hu, B. Han, J. Chen, C. Zhang, On the estimation bias in double q-learning, in: Advances in Neural Information Processing Systems (NeurIPS), 2021, p. 34.

[19]

Lihong Li, Thomas J. Walsh, Michael L. Littman, Towards a unified theory of state abstraction for MDPs, in: International Symposium on Artificial Intelligence and Mathematics, AI&Math 2006, Fort Lauderdale, Florida, USA, January 4-6, 2006, http://anytime.cs.umass.edu/aimath06/proceedings/P21.pdf.

[20]

Fujimoto, S.; Van Hoof, H.; Meger, D. (2018): Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477.

[21]

Kuznetsov, A.; Shvechikov, P.; Grishin, A.; Vetrov, D. (2020): Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. arXiv preprint arXiv:2005.04269.

[22]

Z. Zhang, Z. Pan, M.J. Kochenderfer, Weighted double q-learning, in: International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 3455–3461.

[23]

Z. Li, X. Hou, Mixing update q-value for deep reinforcement learning, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–6.

[24]

R. Zhu, M. Rigotti, Self-correcting q-learning, in: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, vol. 35, 2021, pp. 11185–11192.

[25]

C. Szepesvari, M. Littman, Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms, in: Proceedings of International Conference of Machine Learning (ICML), vol. 96, 1996.

[26]

Kingma, D.P.; Ba Adam, J. (2014): A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Cited By

Tan TXie HLian DLarson K(2024)Adaptive order Q-learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/547(4946-4954)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/547
Tan TXie HShi XShang M(2024)A Meta-Learning Approach to Mitigating the Estimation Bias of Q-LearningACM Transactions on Knowledge Discovery from Data10.1145/368884918:9(1-23)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3688849

Index Terms

Balanced Q-learning: Combining the influence of optimistic and pessimistic targets

Index terms have been assigned to the content through auto-classification.

Recommendations

Backward Q-learning: The combination of Sarsa algorithm and Q-learning

Reinforcement learning (RL) has been applied to many fields and applications, but there are still some dilemmas between exploration and exploitation strategy for action selection policy. The well-known areas of reinforcement learning are the Q-learning ...
LinFa-Q: Accurate Q-learning with linear function approximation
Abstract
Although Q-learning has achieved remarkable success in some practical cases, it often suffers from the overestimation problem in stochastic environments, which is commonly viewed as a shortcoming of Q-learning. Overestimated values are introduced ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...

Comments

Information & Contributors

Information

Published In

cover image Artificial Intelligence

Artificial Intelligence Volume 325, Issue C

Dec 2023

324 pages

ISSN:0004-3702

Issue’s Table of Contents

The Author(s).

Publisher

Elsevier Science Publishers Ltd.

United Kingdom

Publication History

Published: 01 December 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tan TXie HLian DLarson K(2024)Adaptive order Q-learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/547(4946-4954)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/547
Tan TXie HShi XShang M(2024)A Meta-Learning Approach to Mitigating the Estimation Bias of Q-LearningACM Transactions on Knowledge Discovery from Data10.1145/368884918:9(1-23)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3688849

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents