research-article

State-action similarity-based representations for off-policy evaluation

AUTHORs:

Brahma S. Pavse,

Josiah P. HannaAuthors Info & Claims

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Article No.: 1834, Pages 42298 - 42329

Published: 10 December 2023 Publication History

Abstract

In reinforcement learning, off-policy evaluation (OPE) is the problem of estimating the expected return of an evaluation policy given a fixed dataset that was collected by running one or more different policies. One of the more empirically successful algorithms for OPE has been the fitted q-evaluation (FQE) algorithm that uses temporal difference updates to learn an action-value function, which is then used to estimate the expected return of the evaluation policy. Typically, the original fixed dataset is fed directly into FQE to learn the action-value function of the evaluation policy. Instead, in this paper, we seek to enhance the data-efficiency of FQE by first transforming the fixed dataset using a learned encoder, and then feeding the transformed dataset into FQE. To learn such an encoder, we introduce an OPE-tailored state-action behavioral similarity metric, and use this metric and the fixed dataset to learn an encoder that models this metric. Theoretically, we show that this metric allows us to bound the error in the resulting OPE estimate. Empirically, we show that other state-action similarity metrics lead to representations that cannot represent the action-value function of the evaluation policy, and that our state-action representation method boosts the data-efficiency of FQE and lowers OPE error relative to other OPE-based representation learning methods on challenging OPE tasks. We also empirically show that the learned representations significantly mitigate divergence of FQE under varying distribution shifts. Our code is available here: https://github.com/Badger-RL/ROPE.

References

[1]

Rishabh Agarwal, Marios C. Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, 2021a. URL https://openreview.net/forum?id=qda7-sVg84.

[2]

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 2021b.

[3]

Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ArXiv, abs/1607.06450, 2016.

[4]

Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic Markov Decision Processes, November 2019. URL http://arxiv.org/abs/1911.09291.arXiv:1911.09291 [cs, stat].

[5]

Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, and Mark Rowland. MICo: Improved representations via sampling-based state similarity for Markov decision processes. arXiv:2106.08229 [cs], January 2022. URL http://arxiv.org/abs/2106.08229. arXiv: 2106.08229.

[6]

Jonathan Chang, Kaiwen Wang, Nathan Kallus, and Wen Sun. Learning Bellman Complete Representations for Offline Policy Evaluation. In Proceedings of the 39th International Conference on Machine Learning, pages 2938-2971. PMLR, June 2022. URL https://proceedings.mlr.press/v162/chang22b.html. ISSN: 2640-3498.

[7]

Robert Dadashi, Shideh Rezaeifar, Nino Vieillard, Léonard Hussenot, Olivier Pietquin, and Matthieu Geist. Offline reinforcement learning with pseudometric learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2307-2318. PMLR, 18-24 Jul 2021. URL https://proceedings.mlr.press/v139/dadashi21a.html.

[8]

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, pages 1447-1456. PMLR, July 2018. URL https://proceedings.mlr.press/v80/farajtabar18a.html. ISSN: 2640-3498.

[9]

Norm Ferns and Doina Precup. Bisimulation metrics are optimal value functions. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI'14, page 210-219, Arlington, Virginia, USA, 2014. AUAI Press. ISBN 9780974903910.

Digital Library

[10]

Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI '04, page 162-169, Arlington, Virginia, USA, 2004. AUAI Press. ISBN 0974903906.

Digital Library

[11]

Norm Ferns, Prakash Panangaden, and Doina Precup. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662-1714, 2011.

Digital Library

[12]

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2020.

[13]

Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Thomas Paine. Benchmarks for deep off-policy evaluation. In ICLR, 2021. URL https: //openreview.net/forum?id=kWSeGEeHvF8.

[14]

Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. Deep-MDP: Learning Continuous Latent Space Models for Representation Learning. Technical Report arXiv:1906.02736, arXiv, June 2019. URL http://arxiv.org/abs/1906.02736. arXiv:1906.02736 [cs, stat] type: article.

[15]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861-1870. PMLR, 10-15 Jul 2018. URL https://proceedings.mlr.press/v80/haarnoja18b.html.

[16]

Josiah Hanna, Peter Stone, and Scott Niekum. Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation. In Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2017. event-place: Sao Paolo, Brazil.

[17]

Josiah P. Hanna, Scott Niekum, and Peter Stone. Importance Sampling in Reinforcement Learning with an Estimated Behavior Policy. Machine Learning (MLJ), 110(6):1267-1317, May 2021.

[18]

Nan Jiang and Lihong Li. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. May 2016. URL http://arxiv.org/abs/1511.03722. arXiv: 1511.03722.

[19]

Mete Kemertas and Tristan Aumentado-Armstrong. Towards robust bisimulation metric learning. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 4764-4777, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/256bf8e6923a52fda8ddf7dc050a1148-Abstract.html.

[20]

Dexter Kozen. Coinductive proof principles for stochastic processes. In Proceedings of the 21st Annual IEEE Symposium on Logic in Computer Science, LICS '06, page 359-366, USA, 2006. IEEE Computer Society. ISBN 0769526314.

Digital Library

[21]

Hoang M. Le, Cameron Voloshin, and Yisong Yue. Batch Policy Learning under Constraints. In International Conference on Machine Learning (ICML). arXiv, March 2019. URL http://arxiv.org/abs/1903.08738. arXiv:1903.08738 [cs, math, stat].

[22]

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems, November 2020. URL http://arxiv.org/abs/2005.01643. arXiv:2005.01643 [cs, stat].

[23]

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a Unified Theory of State Abstraction for MDPs. page 10, 2006.

[24]

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation. arXiv:1810.12429 [cs, stat], October 2018. URL http://arxiv.org/abs/1810.12429. arXiv: 1810.12429.

[25]

Steve Matthews. The topology of partial metric spaces. 1992.

[26]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533, February 2015. ISSN 00280836.

[27]

Tom Le Paine, Cosmin Paduraru, Andrea Michi, Çaglar Gülçehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for offline reinforcement learning. CoRR, abs/2007.09055, 2020. URL https://arxiv.org/abs/2007.09055.

[28]

Brahma S Pavse and Josiah P Hanna. Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction. 2023.

Digital Library

[29]

Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-Policy Temporal-Difference Learning with Function Approximation.

[30]

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[31]

Martin Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In Joáo Gama, Rui Camacho, Pavel B. Brazdil, Alípio Mário Jorge, and Luís Torgo, editors, Machine Learning: ECML 2005, Lecture Notes in Computer Science, pages 317-328, Berlin, Heidelberg, 2005. Springer. ISBN 978-3-540-31692-3.

Digital Library

[32]

Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3 (1):9-44, August 1988. ISSN 1573-0565.

[33]

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html.

Digital Library

[34]

Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd International Conference on Machine Learning, ICML '05, page 880-887, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805.

Digital Library

[35]

Georgios Theocharous, Philip S Thomas, and Mohammad Ghavamzadeh. Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees. page 7, 2015.

[36]

Philip S. Thomas and Emma Brunskill. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning, April 2016. URL http://arxiv.org/abs/1604.00923. arXiv:1604.00923 [cs].

[37]

Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-Confidence Off-Policy Evaluation. page 7.

[38]

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax Weight and Q-Function Learning for Off-Policy Evaluation, October 2020. URL http://arxiv.org/abs/1910.12809. Number: arXiv:1910.12809 arXiv:1910.12809 [cs, stat].

[39]

Cédric Villani. Optimal transport: Old and new. 2008.

[40]

Cameron Voloshin, Hoang Minh Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=IsK8iKbL-I.

[41]

Ruosong Wang, Dean P. Foster, and Sham M. Kakade. What are the statistical limits of offline RL with linear function approximation? CoRR, abs/2010.11895, 2020. URL https://arxiv.org/abs/2010.11895.

[42]

Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham Kakade. Instabilities of Offline RL with Pre-Trained Neural Representation. In Proceedings of the 38th International Conference on Machine Learning, pages 10948-10960. PMLR, July 2021. URL https://proceedings.mlr.press/v139/wang21z.html. ISSN: 2640-3498.

[43]

Mengjiao Yang and Ofir Nachum. Representation Matters: Offline Pretraining for Sequential Decision Making. In Proceedings of the 38th International Conference on Machine Learning, pages 11784-11794. PMLR, July 2021. URL https://proceedings.mlr.press/v139/yang21h.html. ISSN: 2640-3498.

[44]

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6551-6561. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper/2020/file/488e4104520c6aab692863cc1dba45af-Paper.pdf.

[45]

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-Policy Evaluation via the Regularized Lagrangian. arXiv:2007.03438 [cs, math, stat], July 2020b. URL http://arxiv.org/abs/2007.03438. arXiv: 2007.03438.

[46]

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning Invariant Representations for Reinforcement Learning without Reconstruction, April 2021a. URL http://arxiv.org/abs/2006.10742. arXiv:2006.10742 [cs, stat].

[47]

Michael R Zhang, Tom Le Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, Ziyu Wang, and Mohammad Norouzi. AUTOREGRESSIVE DYNAMICS MODELS FOR OFFLINE POLICY EVALUATION AND OPTIMIZATION. 2021b.

[48]

Szymon Łukaszyk. A new concept of probability metric and its applications in approximation of scattered data sets. Computational Mechanics, 33:299-304, 2004.

Recommendations

A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation
Abstract
On-policy deep reinforcement learning (DRL) has the inherent advantage of using multi-step interaction data for policy learning. However, on-policy DRL still faces challenges in improving the sample efficiency of policy evaluations. Therefore, we ...
META-Learning State-based Eligibility Traces for More Sample-Efficient Policy Evaluation
AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems

Temporal-Difference (TD) learning is a standard and very successful reinforcement learning approach, at the core of both algorithms that learn the value of a given policy, as well as algorithms which learn how to improve policies. TD-learning with ...
An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension
Abstract
Offline Reinforcement Learning (Offline RL) is able to learn from pre-collected offline data without real-time interaction with the environment by policy regularization via distributional constraints or support set constraints. However, since the ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

December 2023

80772 pages

Copyright © 2023 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 10 December 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten