research-article

The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications

AUTHORs:

W. Bradley Knox,

Alessandro AllieviAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 664, Pages 5920 - 5929

https://doi.org/10.1609/aaai.v37i5.25733

Published: 07 February 2023 Publication History

Abstract

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. These sparse task metrics can be hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. One question this process raises is whether the same reward function is optimal for all algorithms, or, put differently, whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. To broadly examine ad hoc reward design, we also conduct a controlled observation study which emulates expert practitioners' typical reward design experiences. Here, we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design—of adopting a myopic strategy and weighing the relative goodness of each state-action pair—leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target.

References

[1]

Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; and Mané, D. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

[2]

Andrychowicz, O. M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. 2020. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1): 3-20.

Digital Library

[3]

Braun, V.; and Clarke, V. 2006. Using thematic analysis in psychology. Qualitative research in psychology, 3(2): 77-101.

[4]

Chiang, H.-T. L.; Faust, A.; Fiser, M.; and Francis, A. 2019. Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters, 4(2): 2007-2014.

[5]

Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

[6]

Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; and Madry, A. 2019. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations.

[7]

Faust, A.; Francis, A.; and Mehta, D. 2019. Evolving rewards to automate reinforcement learning. arXiv preprint arXiv:1905.07628.

[8]

Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and Dragan, A. 2017. Inverse reward design. Advances in neural information processing systems, 30.

[9]

He, J. Z.-Y.; and Dragan, A. D. 2021. Assisted robust reward design. arXiv preprint arXiv:2111.09884.

[10]

Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2018. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32.

[11]

Hoeffding, W. 1994. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, 409-426. Springer.

[12]

Hopkins, A.; and Booth, S. 2021. Machine learning practices outside big tech: How resource constraints challenge responsible development. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 134-145.

[13]

Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; and Levine, S. 2021. How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5): 698-721.

Digital Library

[14]

Icarte, R. T.; Klassen, T.; Valenzano, R.; and McIlraith, S. 2018. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, 2107-2116. PMLR.

[15]

Jiang, N.; Kulesza, A.; Singh, S.; and Lewis, R. 2015. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 1181-1189. Citeseer.

[16]

Juozapaitis, Z.; Koul, A.; Fern, A.; Erwig, M.; and Doshi-Velez, F. 2019. Explainable reinforcement learning via reward decomposition. In IJCAI/ECAI Workshop on explainable artificial intelligence.

[17]

Knox, W. B.; Allievi, A.; Banzhaf, H.; Schmitt, F.; and Stone, P. 2023. Reward (mis) design for autonomous driving. ArtificialIntelligence, 316: 103829.

[18]

Knox, W. B.; Hatgis-Kessell, S.; Booth, S.; Niekum, S.; Stone, P.; and Allievi, A. 2022. Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231.

[19]

Knox, W. B.; and Stone, P. 2009. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the fifth international conference on Knowledge capture, 9-16.

[20]

Knox, W. B.; and Stone, P. 2015. Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence, 225: 24-50.

Digital Library

[21]

Krakovna, V.; Uesato, J.; Mikulik, V.; Rahtz, M.; Everitt, T.; Kumar, R.; Kenton, Z.; Leike, J.; and Legg, S. 2020. Specification gaming: the flip side of AI ingenuity. DeepMind Blog.

[22]

MacGlashan, J.; Ho, M. K.; Loftin, R.; Peng, B.; Wang, G.; Roberts, D. L.; Taylor, M. E.; and Littman, M. L. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, 2285-2294. PMLR.

[23]

Mitchell, T. M.; and Mitchell, T. M. 1997. Machine learning, volume 1. McGraw-hill New York.

[24]

Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928-1937. PMLR.

[25]

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529-533.

[26]

Nachar, N.; et al. 2008. The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology, 4(1): 13-20.

[27]

Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, volume 99, 278-287.

[28]

Ng, A. Y.; Russell, S.; et al. 2000. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, volume 1, 2.

[29]

Niekum, S.; Barto, A. G.; and Spector, L. 2010. Genetic programming for reward function search. IEEE Transactions on Autonomous Mental Development, 2(2): 83-90.

Digital Library

[30]

Parker-Holder, J.; Rajan, R.; Song, X.; Biedenkapp, A.; Miao, Y.; Eimer, T.; Zhang, B.; Nguyen, V.; Calandra, R.; Faust, A.; et al. 2022. Automated reinforcement learning (autorl): A survey and open problems. Journal of Artificial Intelligence Research, 74: 517-568.

Digital Library

[31]

Ratner, E.; Hadfield-Menell, D.; and Dragan, A. D. 2018. Simplifying reward design through divide-and-conquer. arXiv preprint arXiv:1806.02501.

[32]

Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[33]

Silver, D.; Singh, S.; Precup, D.; and Sutton, R. S. 2021. Reward is enough. Artificial Intelligence, 299: 103535.

[34]

Singh, S.; Lewis, R. L.; and Barto, A. G. 2009. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, 2601-2606. Cognitive Science Society.

[35]

Sowerby, H.; Zhou, Z.; and Littman, M. L. 2022. Designing Rewards for Fast Learning. arXiv preprint arXiv:2205.15400.

[36]

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.

[37]

Van Seijen, H.; Fatemi, M.; Romoff, J.; Laroche, R.; Barnes, T.; and Tsang, J. 2017. Hybrid reward architecture for reinforcement learning. Advances in Neural Information Processing Systems, 30.

[38]

Watkins, C. J.; and Dayan, P. 1992. Q-learning. Machine learning, 8(3): 279-292.

[39]

Wu, Z.; Lian, W.; Unhelkar, V.; Tomizuka, M.; and Schaal, S. 2021. Learning dense rewards for contact-rich manipulation tasks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 6214-6221. IEEE.

[40]

Yu, T.; Quillen, D.; He, Z.; Julian, R.; Hausman, K.; Finn, C.; and Levine, S. 2020. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, 1094-1100. PMLR.

[41]

Zheng, Z.; Oh, J.; Hessel, M.; Xu, Z.; Kroiss, M.; Van Hasselt, H.; Silver, D.; and Singh, S. 2020. What can learned intrinsic rewards capture? In International Conference on Machine Learning, 11436-11446. PMLR.

[42]

Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; Dey, A. K.; et al. 2008. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, 1433-1438. Chicago, IL, USA.

Digital Library

Cited By

Muslimani CTaylor MDastani MSichman JAlechina NDignum V(2024)Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663173(2399-2401)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663173
Yang YChen LZaidi Zvan Waveren SKrishna AGombolay MGrollman DBroadbent EJu WSoh HWilliams T(2024)Enhancing Safety in Learning from Demonstration Algorithms via Control Barrier Function ShieldingProceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610977.3635002(820-829)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3610977.3635002

Recommendations

Hierarchical Average Reward Reinforcement Learning

Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL ...
Total Reward Stochastic Games and Sensitive Average Reward Strategies

In this paper, total reward stochastic games are surveyed. Total reward games are motivated as a refinement of average reward games. The total reward is defined as the limiting average of the partial sums of the stream of payoffs. It is shown that total ...
Controller Synthesis for Reward Collecting Markov Processes in Continuous Space
HSCC '17: Proceedings of the 20th International Conference on Hybrid Systems: Computation and Control

We propose and analyze a generic mathematical model for optimizing rewards in continuous-space, dynamic environments, called Reward Collecting Markov Processes. Our model is motivated by request-serving applications in robotics, where the objective is ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Muslimani CTaylor MDastani MSichman JAlechina NDignum V(2024)Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663173(2399-2401)Online publication date: 6-May-2024
https://dl.acm.org/doi/10.5555/3635637.3663173
Yang YChen LZaidi Zvan Waveren SKrishna AGombolay MGrollman DBroadbent EJu WSoh HWilliams T(2024)Enhancing Safety in Learning from Demonstration Algorithms via Control Barrier Function ShieldingProceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610977.3635002(820-829)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3610977.3635002

View Options

View options

Media

Figures

Other

Tables

View Table of Contents