Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Balanced Q-learning: : Combining the influence of optimistic and pessimistic targets

Published: 01 December 2023 Publication History

Abstract

The optimistic nature of the Q−learning target leads to an overestimation bias, which is an inherent problem associated with standard Q−learning. Such a bias fails to account for the possibility of low returns, particularly in risky scenarios. However, the existence of biases, whether overestimation or underestimation, need not necessarily be undesirable. In this paper, we analytically examine the utility of biased learning, and show that specific types of biases may be preferable, depending on the scenario. Based on this finding, we design a novel reinforcement learning algorithm, Balanced Q-learning, in which the target is modified to be a convex combination of a pessimistic and an optimistic term, whose associated weights are determined online, analytically. Such a balanced target inherently promotes risk-averse behavior, which we examine through the lens of the agent's exploration. We prove the convergence of this algorithm in a tabular setting, and empirically demonstrate its consistently good learning performance in various environments.

References

[1]
C. Watkins, Learning from delayed rewards, PhD Thesis Cambridge University, Cambridge, England, 1989.
[2]
R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, vol. 1, MIT Press, Cambridge, 1998.
[3]
S. Thrun, A. Schwartz, Issues in using function approximation for reinforcement learning, in: Proceedings of the 1993 Connectionist Models Summer School, Lawrence Erlbaum, Hillsdale, NJ, 1993.
[4]
H.V. Hasselt, Double q-learning, in: Advances in Neural Information Processing Systems (NIPS), 2010, pp. 2613–2621.
[5]
H.V. Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in: Thirtieth Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 2016.
[6]
Q. Lan, Y. Pan, A. Fyshe, M. White, Maxmin q-learning: controlling the estimation bias of q-learning, in: International Conference on Learning Representations (ICLR), 2020.
[7]
O. Anschel, N. Baram, N. Shimkin Averaged-dqn, Variance reduction and stabilization for deep reinforcement learning, in: Proceedings of the 34th International Conference on Machine Learning (ICML), vol. 70, JMLR.org, 2017, pp. 176–185.
[8]
M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edition, John Wiley & Sons, Inc., New York, NY, USA, 1994.
[9]
V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529.
[10]
X. Chen, C. Wang, Z. Zhou, K. Ross, Randomized ensembled double q-learning: learning fast without a model, in: International Conference on Learning Representations (ICLR), 2021.
[11]
C. Gaskett, Reinforcement learning under circumstances beyond its control, 2003.
[12]
Young, K.; Tian, T. (2019): Minatar: an atari-inspired testbed for more efficient reinforcement learning experiments. arXiv preprint arXiv:1903.03176.
[13]
M.G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The arcade learning environment: an evaluation platform for general agents, J. Artif. Intell. Res. 47 (2013) 253–279.
[14]
Cini, A.; D'Eramo, C.; Peters, J.; Alippi, C. (2020): Deep reinforcement learning with weighted q-learning. arXiv preprint arXiv:2003.09280.
[15]
F. Fernández, M. Veloso, Probabilistic policy reuse in a reinforcement learning agent, in: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2006, pp. 720–727. http://dl.acm.org/citation.cfm?id=1160762.
[16]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. (2016): Openai gym. arXiv preprint arXiv:1606.01540.
[17]
Leike, J.; Martic, M.; Krakovna, V.; Ortega, P.A.; Everitt, T.; Lefrancq, A.; Orseau, L.; Legg, S. (2017): Ai safety gridworlds. arXiv preprint arXiv:1711.09883.
[18]
Z. Ren, G. Zhu, H. Hu, B. Han, J. Chen, C. Zhang, On the estimation bias in double q-learning, in: Advances in Neural Information Processing Systems (NeurIPS), 2021, p. 34.
[19]
Lihong Li, Thomas J. Walsh, Michael L. Littman, Towards a unified theory of state abstraction for MDPs, in: International Symposium on Artificial Intelligence and Mathematics, AI&Math 2006, Fort Lauderdale, Florida, USA, January 4-6, 2006, http://anytime.cs.umass.edu/aimath06/proceedings/P21.pdf.
[20]
Fujimoto, S.; Van Hoof, H.; Meger, D. (2018): Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477.
[21]
Kuznetsov, A.; Shvechikov, P.; Grishin, A.; Vetrov, D. (2020): Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. arXiv preprint arXiv:2005.04269.
[22]
Z. Zhang, Z. Pan, M.J. Kochenderfer, Weighted double q-learning, in: International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 3455–3461.
[23]
Z. Li, X. Hou, Mixing update q-value for deep reinforcement learning, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, 2019, pp. 1–6.
[24]
R. Zhu, M. Rigotti, Self-correcting q-learning, in: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, vol. 35, 2021, pp. 11185–11192.
[25]
C. Szepesvari, M. Littman, Generalized Markov decision processes: dynamic-programming and reinforcement-learning algorithms, in: Proceedings of International Conference of Machine Learning (ICML), vol. 96, 1996.
[26]
Kingma, D.P.; Ba Adam, J. (2014): A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Cited By

View all
  • (2024)Adaptive order Q-learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/547(4946-4954)Online publication date: 3-Aug-2024
  • (2024)A Meta-Learning Approach to Mitigating the Estimation Bias of Q-LearningACM Transactions on Knowledge Discovery from Data10.1145/368884918:9(1-23)Online publication date: 14-Aug-2024

Index Terms

  1. Balanced Q-learning: Combining the influence of optimistic and pessimistic targets
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Artificial Intelligence
          Artificial Intelligence  Volume 325, Issue C
          Dec 2023
          324 pages

          Publisher

          Elsevier Science Publishers Ltd.

          United Kingdom

          Publication History

          Published: 01 December 2023

          Author Tags

          1. Reinforcement learning
          2. Maximization bias
          3. Q−learning target
          4. Optimistic updates
          5. Pessimistic updates

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Adaptive order Q-learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/547(4946-4954)Online publication date: 3-Aug-2024
          • (2024)A Meta-Learning Approach to Mitigating the Estimation Bias of Q-LearningACM Transactions on Knowledge Discovery from Data10.1145/368884918:9(1-23)Online publication date: 14-Aug-2024

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media