research-article

IQ-learn: inverse soft-Q learning for imitation

AUTHORs:

Shuvam Chakraborty,

Stefano ErmonAuthors Info & Claims

NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Article No.: 308, Pages 4028 - 4039

Published: 06 December 2021 Publication History

Abstract

In many sequential decision-making problems (e.g., robotics control, game playing, sequential prediction), human or expert data is available containing useful information about the task. However, imitation learning (IL) from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics. Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence but doesn't utilize any information involving the environment's dynamics. Many existing methods that exploit dynamics information are difficult to train in practice due to an adversarial optimization process over reward and policy approximators or biased, high variance gradient estimators. We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. On standard benchmarks, the implicitly learned rewards show a high positive correlation with the ground-truth rewards, illustrating our method can also be used for inverse reinforcement learning (IRL). Our method, Inverse soft-Q learning (IQ-Learn) obtains state-of-the-art results in offline and online imitation learning settings, significantly outperforming existing methods both in the number of required environment interactions and scalability in high-dimensional spaces, often by more than 3x.

Supplementary Material

Additional material (3540261.3540569_supp.pdf)

Supplemental material.

Download
2.25 MB

References

[1]

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. International conference on Machine learning (ICML), 2004. 1, 2, 7

Digital Library

[2]

Oleg Arenz and Gerhard Neumann. Non-adversarial imitation learning and its connections to adversarial methods. arXiv preprint arXiv:2008.03525, 2020. 8

[3]

Nir Baram, Oron Anschel, and Shie Mannor. Model-based adversarial imitation learning. stat, 1050:7, 2016. 8

[4]

Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau, Christopher Pal, and Derek Nowrouzezahrai. Adversarial soft advantage fitting: Imitation learning without policy optimization. Advances in neural information processing systems (NeurIPS), 2020. 2, 8

[5]

M. Bloem and N. Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning. 53rd IEEE Conference on Decision and Control, pages 4911–4916, 2014. 3

[6]

Alex J. Chan and Mihaela van der Schaar. Scalable bayesian inverse reinforcement learning, 2021. 8, 21

[7]

Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal wasserstein imitation learning. In ICLR 2021-Ninth International Conference on Learning Representations, 2021. 8

[8]

G Alphastar DeepMind. Mastering the real-time strategy game starcraft ii, 2019. 1

[9]

Krishnamurthy Dvijotham and Emanuel Todorov. Inverse optimal control with linearly-solvable mdps. In ICML, 2010. 8

Digital Library

[10]

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkHywl-A-. 1, 2, 7, 8

[11]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 8

Digital Library

[12]

Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In NIPS, 2017. 16

Digital Library

[13]

T. Haarnoja, Aurick Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018. 2, 3, 6, 15, 22

[14]

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. 2017. 2, 3, 6, 14

[15]

G HAYES. A robot controller using learning by imitation. In Proc. 2nd Int. Symposium on Intelligent Robotic Systems, LIFTA-IMAG, Grenoble, France, 1994. 7

[16]

Michael Herman, Tobias Gindele, Jörg Wagner, Felix Schmitt, and Wolfram Burgard. Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. International conference on artificial intelligence and statistics (AISTATS), 2016. 2

[17]

Jonathan Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, 2016. 1, 2, 3, 4, 8, 14

Digital Library

[18]

Vinamra Jain, Prashant Doshi, and Bikramjit Banerjee. Model-free irl using maximum likelihood estimation. AAAI Conference on Artificial Intelligence (AAAI), 2019. 2

Digital Library

[19]

Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar. Strictly batch imitation learning by energy-based distribution matching. Advances in neural information processing systems (NeurIPS), 2020. 2, 8, 21

[20]

Edouard Klein, Matthieu Geist, and Olivier Pietquin. Batch, off-policy and model-free apprenticeship learning. European Workshop on Reinforcement Learning (EWRL), 2011. 2

[21]

Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, and Jonathan Tompson. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2018. 1, 8

[22]

Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr. 1, 2, 5, 6, 8, 20

[23]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. 2020. URL https://arxiv.org/abs/2006.04779. 2, 7

[24]

Donghun Lee, Srivatsan Srinivasan, and Finale Doshi-Velez. Truly batch apprenticeship learning with deep successor features. International Joint Conference on Artificial Intelligence (IJCAI), 2019. 2

[25]

Mario Lucic, Karol Kurach, Marcin Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a large-scale study. In NeurIPS, 2018. 16

[26]

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ArXiv, abs/1802.05957, 2018. 16

[27]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. ArXiv, abs/1312.5602, 2013. 21, 22

[28]

Ofir Nachum, Yinlam Chow, B. Dai, and L. Li. Dualdice: Efficient estimation of off-policy stationary distribution corrections. 2019. 15

[29]

Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. International conference on Machine learning (ICML), 2000. 1, 2, 7

Digital Library

[30]

Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted and reward-regularized classification for apprenticeship learning. International conference on Autonomous agents and multi-agent systems (AAMAS), 2014. 2

[31]

Bilal Piot, Matthieu Geist, and Olivier Pietquin. Bridging the gap between imitation learning and inverse reinforcement learning. IEEE transactions on neural networks and learning systems, 28(8):1814–1826, 2016. 8

[32]

Antonin Raffin. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo, 2020. 21

[33]

Siddharth Reddy, A. Dragan, and S. Levine. Sqil: Imitation learning via reinforcement learning with sparse rewards. arXiv: Learning, 2020. 1, 2, 5, 7, 8, 20, 25

[34]

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. International conference on artificial intelligence and statistics (AISTATS), 2010. 1, 2, 8, 24

[35]

Claude Sammut, Scott Hurst, Dana Kedzier, and Donald Michie. Learning to fly. In Proceedings of the Ninth Conference on Machine Learning, pages 385–393. Elsevier, 1992. 7

[36]

Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958. URL https://doi.org/. 17

[37]

Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. A Critique of Strictly Batch Imitation Learning. arXiv:2110.02063 [cs], October 2021. 8

[38]

Gokul Swamy, Sanjiban Choudhury, Zhiwei Steven Wu, and J Andrew Bagnell. Of moments and matching: Trade-offs and treatments in imitation learning. arXiv preprint arXiv:2103.03236, 2021. 8

[39]

Lu Wang, Wenchao Yu, Xiaofeng He, Wei Cheng, Martin Renqiang Ren, Wei Wang, Bo Zong, Haifeng Chen, and Hongyuan Zha. Adversarial cooperative imitation learning for dynamic treatment regimes. In Proceedings of The Web Conference 2020, pages 1785–1795, 2020. 1

Digital Library

[40]

Lantao Yu, Tianhe Yu, Chelsea Finn, and Stefano Ermon. Meta-inverse reinforcement learning with probabilistic context variables. In NeurIPS, 2019. 26

[41]

Jinyun Zhou, Rui Wang, Xu Liu, Yifei Jiang, Shu Jiang, Jiaming Tao, Jinghao Miao, and Shiyu Song. Exploring imitation learning for autonomous driving with feedback synthesizer and differentiable rasterization. arXiv preprint arXiv:2103.01882, 2021. 1

[42]

Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010. 3

Digital Library

[43]

Brian D. Ziebart, Andrew L. Maas, J. Bagnell, and A. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008. 2, 7, 9, 23

Index Terms

IQ-learn: inverse soft-Q learning for imitation
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Index terms have been assigned to the content through auto-classification.

Recommendations

Effectively learn how to learn: a novel few-shot learning with meta-gradient memory

Recently, the importance of few-shot learning has tremendously grown due to its widespread applicability. Via few-shot learning, users can train their models with few data and maintain high generalisation ability. Meta-learning and continual learning ...
Learning to learn: automatic adaptation of learning bias
AAAI'94: Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence

Traditionally, large areas of research in machine learning have concentrated on pattern recognition and its application to many diversified problems both within the realm of AI as well as outside of it. Over several decades of intensified research, an ...
LEARN++: an incremental learning algorithm for multilayer perceptron networks
ICASSP '00: Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 06

We introduce a supervised learning algorithm that gives neural network classification algorithms the capability of learning incrementally from new data without forgetting what has been learned in earlier training sessions. Schapire's (1990) boosting ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

December 2021

30517 pages

ISBN:9781713845393

Copyright © 2021 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2021

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten