Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1609/aaai.v37i5.25733guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications

Published: 07 February 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. These sparse task metrics can be hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. One question this process raises is whether the same reward function is optimal for all algorithms, or, put differently, whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. To broadly examine ad hoc reward design, we also conduct a controlled observation study which emulates expert practitioners' typical reward design experiences. Here, we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design—of adopting a myopic strategy and weighing the relative goodness of each state-action pair—leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target.

    References

    [1]
    Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schulman, J.; and Mané, D. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
    [2]
    Andrychowicz, O. M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. 2020. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1): 3-20.
    [3]
    Braun, V.; and Clarke, V. 2006. Using thematic analysis in psychology. Qualitative research in psychology, 3(2): 77-101.
    [4]
    Chiang, H.-T. L.; Faust, A.; Fiser, M.; and Francis, A. 2019. Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters, 4(2): 2007-2014.
    [5]
    Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
    [6]
    Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos, F.; Rudolph, L.; and Madry, A. 2019. Implementation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations.
    [7]
    Faust, A.; Francis, A.; and Mehta, D. 2019. Evolving rewards to automate reinforcement learning. arXiv preprint arXiv:1905.07628.
    [8]
    Hadfield-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and Dragan, A. 2017. Inverse reward design. Advances in neural information processing systems, 30.
    [9]
    He, J. Z.-Y.; and Dragan, A. D. 2021. Assisted robust reward design. arXiv preprint arXiv:2111.09884.
    [10]
    Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2018. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
    [11]
    Hoeffding, W. 1994. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, 409-426. Springer.
    [12]
    Hopkins, A.; and Booth, S. 2021. Machine learning practices outside big tech: How resource constraints challenge responsible development. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 134-145.
    [13]
    Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; and Levine, S. 2021. How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5): 698-721.
    [14]
    Icarte, R. T.; Klassen, T.; Valenzano, R.; and McIlraith, S. 2018. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, 2107-2116. PMLR.
    [15]
    Jiang, N.; Kulesza, A.; Singh, S.; and Lewis, R. 2015. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 1181-1189. Citeseer.
    [16]
    Juozapaitis, Z.; Koul, A.; Fern, A.; Erwig, M.; and Doshi-Velez, F. 2019. Explainable reinforcement learning via reward decomposition. In IJCAI/ECAI Workshop on explainable artificial intelligence.
    [17]
    Knox, W. B.; Allievi, A.; Banzhaf, H.; Schmitt, F.; and Stone, P. 2023. Reward (mis) design for autonomous driving. ArtificialIntelligence, 316: 103829.
    [18]
    Knox, W. B.; Hatgis-Kessell, S.; Booth, S.; Niekum, S.; Stone, P.; and Allievi, A. 2022. Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231.
    [19]
    Knox, W. B.; and Stone, P. 2009. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the fifth international conference on Knowledge capture, 9-16.
    [20]
    Knox, W. B.; and Stone, P. 2015. Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence, 225: 24-50.
    [21]
    Krakovna, V.; Uesato, J.; Mikulik, V.; Rahtz, M.; Everitt, T.; Kumar, R.; Kenton, Z.; Leike, J.; and Legg, S. 2020. Specification gaming: the flip side of AI ingenuity. DeepMind Blog.
    [22]
    MacGlashan, J.; Ho, M. K.; Loftin, R.; Peng, B.; Wang, G.; Roberts, D. L.; Taylor, M. E.; and Littman, M. L. 2017. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, 2285-2294. PMLR.
    [23]
    Mitchell, T. M.; and Mitchell, T. M. 1997. Machine learning, volume 1. McGraw-hill New York.
    [24]
    Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928-1937. PMLR.
    [25]
    Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529-533.
    [26]
    Nachar, N.; et al. 2008. The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology, 4(1): 13-20.
    [27]
    Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, volume 99, 278-287.
    [28]
    Ng, A. Y.; Russell, S.; et al. 2000. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, volume 1, 2.
    [29]
    Niekum, S.; Barto, A. G.; and Spector, L. 2010. Genetic programming for reward function search. IEEE Transactions on Autonomous Mental Development, 2(2): 83-90.
    [30]
    Parker-Holder, J.; Rajan, R.; Song, X.; Biedenkapp, A.; Miao, Y.; Eimer, T.; Zhang, B.; Nguyen, V.; Calandra, R.; Faust, A.; et al. 2022. Automated reinforcement learning (autorl): A survey and open problems. Journal of Artificial Intelligence Research, 74: 517-568.
    [31]
    Ratner, E.; Hadfield-Menell, D.; and Dragan, A. D. 2018. Simplifying reward design through divide-and-conquer. arXiv preprint arXiv:1806.02501.
    [32]
    Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
    [33]
    Silver, D.; Singh, S.; Precup, D.; and Sutton, R. S. 2021. Reward is enough. Artificial Intelligence, 299: 103535.
    [34]
    Singh, S.; Lewis, R. L.; and Barto, A. G. 2009. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, 2601-2606. Cognitive Science Society.
    [35]
    Sowerby, H.; Zhou, Z.; and Littman, M. L. 2022. Designing Rewards for Fast Learning. arXiv preprint arXiv:2205.15400.
    [36]
    Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
    [37]
    Van Seijen, H.; Fatemi, M.; Romoff, J.; Laroche, R.; Barnes, T.; and Tsang, J. 2017. Hybrid reward architecture for reinforcement learning. Advances in Neural Information Processing Systems, 30.
    [38]
    Watkins, C. J.; and Dayan, P. 1992. Q-learning. Machine learning, 8(3): 279-292.
    [39]
    Wu, Z.; Lian, W.; Unhelkar, V.; Tomizuka, M.; and Schaal, S. 2021. Learning dense rewards for contact-rich manipulation tasks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 6214-6221. IEEE.
    [40]
    Yu, T.; Quillen, D.; He, Z.; Julian, R.; Hausman, K.; Finn, C.; and Levine, S. 2020. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, 1094-1100. PMLR.
    [41]
    Zheng, Z.; Oh, J.; Hessel, M.; Xu, Z.; Kroiss, M.; Van Hasselt, H.; Silver, D.; and Singh, S. 2020. What can learned intrinsic rewards capture? In International Conference on Machine Learning, 11436-11446. PMLR.
    [42]
    Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; Dey, A. K.; et al. 2008. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, 1433-1438. Chicago, IL, USA.

    Cited By

    View all
    • (2024)Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663173(2399-2401)Online publication date: 6-May-2024
    • (2024)Enhancing Safety in Learning from Demonstration Algorithms via Control Barrier Function ShieldingProceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610977.3635002(820-829)Online publication date: 11-Mar-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence
    February 2023
    16496 pages
    ISBN:978-1-57735-880-0

    Sponsors

    • Association for the Advancement of Artificial Intelligence

    Publisher

    AAAI Press

    Publication History

    Published: 07 February 2023

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement LearningProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems10.5555/3635637.3663173(2399-2401)Online publication date: 6-May-2024
    • (2024)Enhancing Safety in Learning from Demonstration Algorithms via Control Barrier Function ShieldingProceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610977.3635002(820-829)Online publication date: 11-Mar-2024

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media