Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3540261.3540921guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Adversarial intrinsic motivation for reinforcement learning

Published: 10 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Learning with an objective to minimize the mismatch with a reference distribution has been shown to be useful for generative modeling and imitation learning. In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks. Specifically, this paper focuses on goal-conditioned reinforcement learning where the idealized (unachievable) target distribution has full measure at the goal. This paper introduces a quasimetric specific to Markov Decision Processes (MDPs) and uses this quasimet-ric to estimate the above Wasserstein-1 distance. It further shows that the policy that minimizes this Wasserstein-1 distance is the policy that reaches the goal in as few steps as possible. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function. Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and directs the agent's exploration to find the goal efficiently. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to other rewards that encourage exploration or accelerate learning.

    Supplementary Material

    Additional material (3540261.3540921_supp.pdf)
    Supplemental material.

    References

    [1]
    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. arXiv:1707.01495 [cs], February 2018. URL http://arxiv.org/abs/1707.01495.arXiv: 1707.01495.
    [2]
    Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards. In Advances in Neural Information Processing Systems, pages 13566–13577, 2019.
    [3]
    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning, pages 214–223, July 2017. URL http://proceedings.mlr.press/v70/arjovsky17a.html. ISSN: 2640-3498 Section: Machine Learning.
    [4]
    Dilip Arumugam, P. Henderson, and P. Bacon. An information-theoretic perspective on credit assignment in reinforcement learning. ArXiv, abs/2103.06224, 2021.
    [5]
    Gianluca Baldassarre. What are intrinsic motivations? a biological perspective. In 2011 IEEE international conference on development and learning (ICDL), volume 2, pages 1–8. IEEE, 2011.
    [6]
    Gianluca Baldassarre, Tom Stafford, Marco Mirolli, Peter Redgrave, Richard M Ryan, and Andrew Barto. Intrinsic motivations and open-ended development in animals, humans, and robots: an overview. Frontiers in psychology, 5:985, 2014.
    [7]
    Adrien Baranes and Pierre-Yves Oudeyer. R-iac: Robust intrinsically motivated exploration and active learning. IEEE Transactions on Autonomous Mental Development, 1(3):155–169, 2009.
    [8]
    Andrew G. Barto. Intrinsic Motivation and Reinforcement Learning. In Gianluca Baldassarre and Marco Mirolli, editors, Intrinsically Motivated Learning in Natural and Artificial Systems, pages 17–47. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. ISBN 978-3-642-32374-4 978-3-642-32375-1. URL http://link.springer.com/10.1007/978-3-642-32375-1_2.
    [9]
    Andrew G Barto and Ozgür Simsek. Intrinsic motivation for reinforcement learning systems. In Proceedings of the Thirteenth Yale Workshop on Adaptive and Learning Systems, pages 113–118, 2005.
    [10]
    Andrew G Barto, Satinder Singh, and Nuttapong Chentanez. Intrinsically motivated learning of hierarchical collections of skills. In Proceedings of the 3rd International Conference on Development and Learning, pages 112–19. Cambridge, MA, 2004.
    [11]
    Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in neural information processing systems, pages 1471–1479, 2016.
    [12]
    Marc G. Bellemare, Ivo Danihelka, Will Dabney, S. Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and R. Munos. The cramer distance as a solution to biased wasserstein gradients. ArXiv, abs/1705.10743, 2017. URL https://arxiv.org/abs/1705.10743.
    [13]
    Glen Berseth, Daniel Geng, Coline Devin, Nicholas Rhinehart, Chelsea Finn, Dinesh Jayaraman, and Sergey Levine. Smirl: Surprise minimizing reinforcement learning in unstable environments. arXiv preprint arXiv:1912.05510, 2019.
    [14]
    Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. arXiv:1705.07642 [stat], May 2017. URL http://arxiv.org/abs/1705.07642.arXiv: 1705.07642.
    [15]
    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
    [16]
    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019.
    [17]
    Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
    [18]
    Jack Clark and Dario Amodei. Faulty reward functions in the wild, Dec 2016. URL https://openai.com/blog/faulty-reward-functions/.
    [19]
    Robert Dadashi, Léonard Hussenot, Matthieu Geist, and Olivier Pietquin. Primal Wasserstein Imitation Learning. arXiv:2006.04678 [cs, stat], June 2020. URL http://arxiv.org/abs/2006.04678. arXiv: 2006.04678.
    [20]
    Y. Ding, Carlos Florensa, Mariano Phielipp, and P. Abbeel. Goal-conditioned imitation learning. In NeurIPS, 2019.
    [21]
    Benjamin Eysenbach, Sergey Levine, and Ruslan Salakhutdinov. Replacing rewards with examples: Example-based policy search via recursive classification. arXiv preprint arXiv:2103.12656, 2021.
    [22]
    Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. C-learning: Learning to achieve goals via recursive classification. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=tc5qisoB-C.
    [23]
    Yannis Flet-Berliac, Johan Ferret, Olivier Pietquin, Philippe Preux, and Matthieu Geist. Adversarially guided actor-critic. In ICLR 2021-International Conference on Learning Representations, 2021.
    [24]
    Sébastien Forestier, Rémy Portelas, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190, 2017.
    [25]
    Justin Fu, Katie Luo, and Sergey Levine. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. arXiv:1710.11248 [cs], August 2018. URL http://arxiv.org/abs/1710.11248. arXiv: 1710.11248.
    [26]
    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
    [27]
    Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Ali Eslami, and Oriol Vinyals. Synthesizing programs for images using reinforced adversarial learning. In International Conference on Machine Learning, pages 1666–1675. PMLR, 2018.
    [28]
    Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1259–1277. PMLR, 2020.
    [29]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.
    [30]
    Jacqueline Gottlieb, Pierre-Yves Oudeyer, Manuel Lopes, and Adrien Baranes. Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends in cognitive sciences, 17(11):585–593, 2013.
    [31]
    Matthieu Guillot and Gautier Stauffer. The stochastic shortest path problem: a polyhedral combinatorics perspective. European Journal of Operational Research, 285(1):148–158, 2020.
    [32]
    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved Training of Wasserstein GANs. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5767–5777. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf.
    [33]
    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361, 2017.
    [34]
    Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
    [35]
    Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, and Sergey Levine. Dynamical distance learning for semi-supervised and unsupervised skill discovery. In International Conference on Learning Representations, 2020.
    [36]
    Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. In International Conference on Learning Representations, 2016. URL https://arxiv.org/abs/1511.04143.
    [37]
    Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681–2691. PMLR, 2019.
    [38]
    Ashley Hill, Antonin Raffin, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
    [39]
    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29:4565–4573, 2016.
    [40]
    Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010. ISSN 1533-7928. URL http://jmlr.org/papers/v11/jaksch10a.html.
    [41]
    Filip Jevtić. Combinatorial Structure of Finite Metric Spaces. PhD thesis, The University of Texas at Dallas, August 2018.
    [42]
    Yuu Jinnai, Jee Won Park, Marlos C. Machado, and George Konidaris. Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeIyaVtwB.
    [43]
    Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, pages 1094–1099. Citeseer, 1993.
    [44]
    Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2):209–232, 2002.
    [45]
    W Bradley Knox, Alessandro Allievi, Holger Banzhaf, Felix Schmitt, and Peter Stone. Reward (mis) design for autonomous driving. arXiv preprint arXiv:2104.13906, 2021.
    [46]
    Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdinov. Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274, 2019.
    [47]
    JG Liao and Arthur Berg. Sharpening jensen's inequality. The American Statistician, 2018.
    [48]
    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
    [49]
    Kanglin Liu and Guoping Qiu. Lipschitz constrained gans via boundedness and continuity. Neural Computing and Applications, pages 1–13, 2020.
    [50]
    Douglas C Montgomery. Design and analysis of experiments. John wiley & sons, 2017.
    [51]
    Soroush Nasiriany. Disco rl: Distribution-conditioned reinforcement learning for general-purpose policies. Master's thesis, EECS Department, University of California, Berkeley, Aug 2020. URL http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-151.html.
    [52]
    Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
    [53]
    Tianwei Ni, Harshit Sikchi, Yufei Wang, Tejus Gupta, Lisa Lee, and Benjamin Eysenbach. F-irl: Inverse reinforcement learning via state marginal matching. In Conference on Robot Learning, 2020.
    [54]
    Scott Niekum. Evolved intrinsic reward functions for reinforcement learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 1955–1956, 2010.
    [55]
    Chris Nota and Philip S Thomas. Is the policy gradient a gradient? In International Conference on Autonomous Agents and Multi-Agent Systems, pages 939–947, 2020.
    [56]
    Pierre-Yves Oudeyer and Frederic Kaplan. How can we define intrinsic motivation? In the 8th International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems. Lund University Cognitive Studies, Lund: LUCS, Brighton, 2008.
    [57]
    Pierre-Yves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
    [58]
    Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
    [59]
    Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019. URL http://arxiv.org/abs/1803.00567.
    [60]
    Martin L Puterman. Markov decision processes. Handbooks in operations research and management science, 2:331–434, 1990.
    [61]
    Antonin Raffin. Rl baselines zoo. https://github.com/araffin/rl-baselines-zoo, 2018.
    [62]
    Vieri Giuliano Santucci, Gianluca Baldassarre, and Marco Mirolli. Which is the best intrinsic motivation signal for learning multiple skills? Frontiers in neurorobotics, 7:22, 2013.
    [63]
    T. Schaul, Daniel Horgan, K. Gregor, and D. Silver. Universal value function approximators. In ICML, 2015.
    [64]
    Massimiliano Schembri, Marco Mirolli, and Gianluca Baldassarre. Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot. In 2007 IEEE 6th International Conference on Development and Learning, pages 282–287. IEEE, 2007.
    [65]
    Jürgen Schmidhuber. Curious model-building control systems. In Proc. international joint conference on neural networks, pages 1458–1463, 1991.
    [66]
    Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991.
    [67]
    Özgür Şimşek and Andrew G Barto. An intrinsic reward mechanism for efficient exploration. In Proceedings of the 23rd international conference on Machine learning, pages 833–840, 2006.
    [68]
    Satinder Singh, Andrew G Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288, 2005.
    [69]
    Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, June 2010. ISSN 1943-0612. Conference Name: IEEE Transactions on Autonomous Mental Development.
    [70]
    M. Smyth. Quasi uniformities: Reconciling domains with metric spaces. In MFPS, 1987.
    [71]
    Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, pages 2190–2198, 2010.
    [72]
    Jonathan Sorg, Satinder P Singh, and Richard L Lewis. Internal rewards mitigate agent boundedness. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 1007–1014, 2010.
    [73]
    Jan Stanczuk, Christian Etmann, L. Kreusser, and C. Schönlieb. Wasserstein gans work because they fail (to approximate the wasserstein distance). ArXiv, abs/2103.01678, 2021.
    [74]
    Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
    [75]
    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
    [76]
    Faraz Torabi, Garrett Warnell, and Peter Stone. Generative Adversarial Imitation from Observation. arXiv:1807.06158 [cs, stat], June 2019. URL http://arxiv.org/abs/1807.06158. arXiv: 1807.06158.
    [77]
    Srinivas Venkattaramanujam, E. Crawford, T. Doan, and Doina Precup. Self-supervised learning of distance functions for goal-conditioned reinforcement learning. ArXiv, abs/1907.02998, 2019.
    [78]
    Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
    [79]
    Huang Xiao, Michael Herman, Joerg Wagner, Sebastian Ziesche, Jalal Etesami, and Thai Hong Linh. Wasserstein Adversarial Imitation Learning. arXiv:1906.08113 [cs, stat], June 2019. URL http://arxiv.org/abs/1906.08113. arXiv: 1906.08113.
    [80]
    Yunzhi Zhang, Pieter Abbeel, and Lerrel Pinto. Automatic Curriculum Learning through Value Disagreement. arXiv:2006.09641 [cs, stat], June 2020. URL http://arxiv.org/abs/2006.09641. arXiv: 2006.09641.
    [81]
    Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On Learning Intrinsic Rewards for Policy Gradient Methods. arXiv:1804.06459 [cs, stat], June 2018. URL http://arxiv.org/abs/1804.06459. arXiv: 1804.06459.
    [82]
    Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu, Manuel Kroiss, Hado van Has-selt, David Silver, and Satinder Singh. What Can Learned Intrinsic Rewards Capture? arXiv:1912.05500 [cs], July 2020. URL http://arxiv.org/abs/1912.05500. arXiv: 1912.05500.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
    December 2021
    30517 pages

    Publisher

    Curran Associates Inc.

    Red Hook, NY, United States

    Publication History

    Published: 10 June 2024

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media