Learning Action Embeddings for Off-Policy Evaluation

Cief, Matej; Golebiowski, Jacek; Schmidt, Philipp; Abedjan, Ziawasch; Bekasov, Artur

doi:10.1007/978-3-031-56027-9_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14608))

Included in the following conference series:

European Conference on Information Retrieval

1016 Accesses

Abstract

Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.

M. Cief—Work done during an internship at Amazon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Policy Generation from Latent Embeddings for Reinforcement Learning

A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation

Article 09 September 2024

Selective network discovery via deep reinforcement learning on embedded spaces

Article Open access 20 March 2021

Notes

1.
Code to reproduce this and further experiments is available at https://github.com/amazon-science/ope-learn-action-embeddings.

References

Chuklin, A.: Click Models for Web Search, vol. 7, no. 3, pp. 1–115 (2015)
Google Scholar
Dhrymes, P.J.: Topics in Advanced Econometrics: Probability Foundations, vol. 1. Springer, Heidelberg (1989). https://doi.org/10.1007/978-1-4612-4548-3
Book Google Scholar
Dudík, M., Erhan, D., Langford, J., Li, L.: Doubly robust policy evaluation and optimization. Stat. Sci. 29(4), 485–511 (2014). ISSN 0883–4237, 2168–8745. https://doi.org/10.1214/14-sts500. https://projecteuclid.org/journals/statistical-science/volume-29/issue-4/Doubly-Robust-Policy-Evaluation-and-Optimization/10.1214/14-STS500.full
Efron, B.: The efficiency of logistic regression compared to normal discriminant analysis. J. Am. Stat. Assoc. 70(352), 892–898 (1975)
Article MathSciNet Google Scholar
Farajtabar, M., Chow, Y., Ghavamzadeh, M.: More robust doubly robust off-policy evaluation. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1447–1456. PMLR (2018). https://proceedings.mlr.press/v80/farajtabar18a.html. iSSN: 2640–3498
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Kallus, N., Zhou, A.: Policy evaluation and optimization with continuous treatments. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 1243–1251. PMLR (2018). https://proceedings.mlr.press/v84/kallus18a.html. iSSN: 2640–3498
Metelli, A.M., Russo, A., Restelli, M.: Subgaussian and differentiable importance sampling for off-policy evaluation and learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 8119–8132. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6ba23a-Abstract.html
Peng, J., et al.: Offline policy evaluation in large action spaces via outcome-oriented action grouping. In: Proceedings of the ACM Web Conference 2023, WWW 2023, pp. 1220–1230. Association for Computing Machinery, New York (2023). ISBN 978-1-4503-9416-1. https://doi.org/10.1145/3543507.3583448. https://dl.acm.org/doi/10.1145/3543507.3583448
Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994). ISSN 0162–1459. https://doi.org/10.1080/01621459.1994.10476818
Sachdeva, N., Su, Y., Joachims, T.: Off-policy bandits with deficient support. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2020, pp. 965–975. Association for Computing Machinery, New York (2020). ISBN 978-1-4503-7998-4. https://doi.org/10.1145/3394486.3403139. https://dl.acm.org/doi/10.1145/3394486.3403139
Saito, Y., Aihara, S., Matsutani, M., Narita, Y.: Open bandit dataset and pipeline: towards realistic and reproducible off-policy evaluation (2021). https://doi.org/10.48550/arXiv.2008.07146. arXiv:2008.07146 [cs, stat]
Saito, Y., Joachims, T.: Off-policy evaluation for large action spaces via embeddings. In: Proceedings of the 39th International Conference on Machine Learning, pp. 19089–19122. PMLR (2022). https://proceedings.mlr.press/v162/saito22a.html. iSSN: 2640–3498
Saito, Y., Ren, Q., Joachims, T.: Off-policy evaluation for large action spaces via conjunct effect modeling. In: Proceedings of the 40th International Conference on Machine Learning, pp. 29734–29759. PMLR (2023). https://proceedings.mlr.press/v202/saito23b.html. iSSN: 2640–3498
Su, Y., Dimakopoulou, M., Krishnamurthy, A., Dudik, M.: Doubly robust off-policy evaluation with shrinkage. In: Proceedings of the 37th International Conference on Machine Learning, pp. 9167–9176. PMLR (2020). https://proceedings.mlr.press/v119/su20a.html. iSSN: 2640–3498
Su, Y., Wang, L., Santacatterina, M., Joachims, T.: CAB: continuous adaptive blending for policy evaluation and learning. In: Proceedings of the 36th International Conference on Machine Learning, pp. 6005–6014. PMLR (2019). https://proceedings.mlr.press/v97/su19a.html. iSSN: 2640–3498
Swaminathan, A.: Counterfactual Evaluation and Learning From Logged User Feedback. Ph.D. thesis, Cornell University, Ithaca, NY, United States (2017). https://ecommons.cornell.edu/handle/1813/51557
Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/hash/39027dfad5138c9ca0c474d71db915c3-Abstract.html
Swaminathan, A., et al.: Off-policy evaluation for slate recommendation. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/hash/5352696a9ca3397beb79f116f3a33991-Abstract.html
Wang, Y.X., Agarwal, A., Dudik, M.: Optimal and adaptive off-policy evaluation in contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning, pp. 3589–3597. PMLR (2017). https://proceedings.mlr.press/v70/wang17a.html. iSSN: 2640–3498
Zhou, L.: A Survey on Contextual Multi-armed Bandits (2016). https://doi.org/10.48550/arXiv.1508.03326. arXiv:1508.03326 [cs]

Download references

Acknowledgement

We want to thank Mohamed Sadek for his contributions to the codebase. The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.

Author information

Authors and Affiliations

Brno University of Technology, Brno, Czech Republic
Matej Cief
Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia
Matej Cief
Amazon, Berlin, Germany
Jacek Golebiowski & Philipp Schmidt
Leibniz University Hannover, Hanover, Germany
Ziawasch Abedjan
Amazon, London, UK
Artur Bekasov

Authors

Matej Cief
View author publications
You can also search for this author in PubMed Google Scholar
Jacek Golebiowski
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Ziawasch Abedjan
View author publications
You can also search for this author in PubMed Google Scholar
Artur Bekasov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matej Cief .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cief, M., Golebiowski, J., Schmidt, P., Abedjan, Z., Bekasov, A. (2024). Learning Action Embeddings for Off-Policy Evaluation. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. https://doi.org/10.1007/978-3-031-56027-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-56027-9_7
Published: 20 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56026-2
Online ISBN: 978-3-031-56027-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Action Embeddings for Off-Policy Evaluation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Policy Generation from Latent Embeddings for Reinforcement Learning

A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation

Selective network discovery via deep reinforcement learning on embedded spaces

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Learning Action Embeddings for Off-Policy Evaluation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Policy Generation from Latent Embeddings for Reinforcement Learning

A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation

Selective network discovery via deep reinforcement learning on embedded spaces

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation