Abstract
Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims [13] propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
M. Cief—Work done during an internship at Amazon.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Code to reproduce this and further experiments is available at https://github.com/amazon-science/ope-learn-action-embeddings.
References
Chuklin, A.: Click Models for Web Search, vol. 7, no. 3, pp. 1–115 (2015)
Dhrymes, P.J.: Topics in Advanced Econometrics: Probability Foundations, vol. 1. Springer, Heidelberg (1989). https://doi.org/10.1007/978-1-4612-4548-3
Dudík, M., Erhan, D., Langford, J., Li, L.: Doubly robust policy evaluation and optimization. Stat. Sci. 29(4), 485–511 (2014). ISSN 0883–4237, 2168–8745. https://doi.org/10.1214/14-sts500. https://projecteuclid.org/journals/statistical-science/volume-29/issue-4/Doubly-Robust-Policy-Evaluation-and-Optimization/10.1214/14-STS500.full
Efron, B.: The efficiency of logistic regression compared to normal discriminant analysis. J. Am. Stat. Assoc. 70(352), 892–898 (1975)
Farajtabar, M., Chow, Y., Ghavamzadeh, M.: More robust doubly robust off-policy evaluation. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1447–1456. PMLR (2018). https://proceedings.mlr.press/v80/farajtabar18a.html. iSSN: 2640–3498
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Kallus, N., Zhou, A.: Policy evaluation and optimization with continuous treatments. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp. 1243–1251. PMLR (2018). https://proceedings.mlr.press/v84/kallus18a.html. iSSN: 2640–3498
Metelli, A.M., Russo, A., Restelli, M.: Subgaussian and differentiable importance sampling for off-policy evaluation and learning. In: Advances in Neural Information Processing Systems, vol. 34, pp. 8119–8132. Curran Associates, Inc. (2021). https://proceedings.neurips.cc/paper/2021/hash/4476b929e30dd0c4e8bdbcc82c6ba23a-Abstract.html
Peng, J., et al.: Offline policy evaluation in large action spaces via outcome-oriented action grouping. In: Proceedings of the ACM Web Conference 2023, WWW 2023, pp. 1220–1230. Association for Computing Machinery, New York (2023). ISBN 978-1-4503-9416-1. https://doi.org/10.1145/3543507.3583448. https://dl.acm.org/doi/10.1145/3543507.3583448
Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994). ISSN 0162–1459. https://doi.org/10.1080/01621459.1994.10476818
Sachdeva, N., Su, Y., Joachims, T.: Off-policy bandits with deficient support. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2020, pp. 965–975. Association for Computing Machinery, New York (2020). ISBN 978-1-4503-7998-4. https://doi.org/10.1145/3394486.3403139. https://dl.acm.org/doi/10.1145/3394486.3403139
Saito, Y., Aihara, S., Matsutani, M., Narita, Y.: Open bandit dataset and pipeline: towards realistic and reproducible off-policy evaluation (2021). https://doi.org/10.48550/arXiv.2008.07146. arXiv:2008.07146 [cs, stat]
Saito, Y., Joachims, T.: Off-policy evaluation for large action spaces via embeddings. In: Proceedings of the 39th International Conference on Machine Learning, pp. 19089–19122. PMLR (2022). https://proceedings.mlr.press/v162/saito22a.html. iSSN: 2640–3498
Saito, Y., Ren, Q., Joachims, T.: Off-policy evaluation for large action spaces via conjunct effect modeling. In: Proceedings of the 40th International Conference on Machine Learning, pp. 29734–29759. PMLR (2023). https://proceedings.mlr.press/v202/saito23b.html. iSSN: 2640–3498
Su, Y., Dimakopoulou, M., Krishnamurthy, A., Dudik, M.: Doubly robust off-policy evaluation with shrinkage. In: Proceedings of the 37th International Conference on Machine Learning, pp. 9167–9176. PMLR (2020). https://proceedings.mlr.press/v119/su20a.html. iSSN: 2640–3498
Su, Y., Wang, L., Santacatterina, M., Joachims, T.: CAB: continuous adaptive blending for policy evaluation and learning. In: Proceedings of the 36th International Conference on Machine Learning, pp. 6005–6014. PMLR (2019). https://proceedings.mlr.press/v97/su19a.html. iSSN: 2640–3498
Swaminathan, A.: Counterfactual Evaluation and Learning From Logged User Feedback. Ph.D. thesis, Cornell University, Ithaca, NY, United States (2017). https://ecommons.cornell.edu/handle/1813/51557
Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015). https://proceedings.neurips.cc/paper/2015/hash/39027dfad5138c9ca0c474d71db915c3-Abstract.html
Swaminathan, A., et al.: Off-policy evaluation for slate recommendation. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/hash/5352696a9ca3397beb79f116f3a33991-Abstract.html
Wang, Y.X., Agarwal, A., Dudik, M.: Optimal and adaptive off-policy evaluation in contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning, pp. 3589–3597. PMLR (2017). https://proceedings.mlr.press/v70/wang17a.html. iSSN: 2640–3498
Zhou, L.: A Survey on Contextual Multi-armed Bandits (2016). https://doi.org/10.48550/arXiv.1508.03326. arXiv:1508.03326 [cs]
Acknowledgement
We want to thank Mohamed Sadek for his contributions to the codebase. The research conducted by Matej Cief (also with slovak.AI) was partially supported by TAILOR, a project funded by EU Horizon 2020 under GA No. 952215, https://doi.org/10.3030/952215.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cief, M., Golebiowski, J., Schmidt, P., Abedjan, Z., Bekasov, A. (2024). Learning Action Embeddings for Off-Policy Evaluation. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. https://doi.org/10.1007/978-3-031-56027-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-56027-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56026-2
Online ISBN: 978-3-031-56027-9
eBook Packages: Computer ScienceComputer Science (R0)