research-article

The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems

Authors:

Amir H. Jadidinejad,

Craig Macdonald,

Iadh OunisAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 40, Issue 1

Article No.: 4, Pages 1 - 22

https://doi.org/10.1145/3458509

Published: 08 September 2021 Publication History

Abstract

Recommendation systems are often evaluated based on user’s interactions that were collected from an existing, already deployed recommendation system. In this situation, users only provide feedback on the exposed items and they may not leave feedback on other items since they have not been exposed to them by the deployed system. As a result, the collected feedback dataset that is used to evaluate a new model is influenced by the deployed system, as a form of closed loop feedback. In this article, we show that the typical offline evaluation of recommender systems suffers from the so-called Simpson’s paradox. Simpson’s paradox is the name given to a phenomenon observed when a significant trend appears in several different sub-populations of observational data but disappears or is even reversed when these sub-populations are combined together. Our in-depth experiments based on stratified sampling reveal that a very small minority of items that are frequently exposed by the deployed system plays a confounding factor in the offline evaluation of recommendation systems. In addition, we propose a novel evaluation methodology that takes into account the confounder, i.e., the deployed system’s characteristics. Using the relative comparison of many recommendation models as in the typical offline evaluation of recommender systems, and based on the Kendall rank correlation coefficient, we show that our proposed evaluation methodology exhibits statistically significant improvements of 14% and 40% on the examined open loop datasets (Yahoo! and Coat), respectively, in reflecting the true ranking of systems with an open loop (randomised) evaluation in comparison to the standard evaluation.

References

[1]

Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. In Proceedings of the 41th International ACM SIGIR Conference on Research and Development in Information Retrieval.

[2]

Qingyao Ai, Jiaxin Mao, Yiqun Liu, and W. Bruce Croft. 2018. Unbiased learning to rank: Theory and practice. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management.

[3]

Jeff Alstott, Ed Bullmore, and Dietmar Plenz. 2014. Powerlaw: A python package for analysis of heavy-tailed distributions. PLOS ONE 9, 1 (2014), e95816.

[4]

Ashton Anderson, Lucas Maystre, Ian Anderson, Rishabh Mehrotra, and Mounia Lalmas. 2020. Algorithmic effects on the diversity of consumption on spotify. In Proceedings of The Web Conference 2020.

Digital Library

[5]

Peter C. Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research 46, 3 (2011), 399–424.

[6]

Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Information Retrieval Journal 23, 4 (2020), 387–410.

Digital Library

[7]

Allison J. B. Chaney, Brandon M. Stewart, and Barbara E. Engelhardt. 2018. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems.

[8]

C. R. Charig, D. R. Webb, S. R. Payne, and J. E. Wickham. 1986. Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy.BMJ 292, 6524 (1986), 879–882.

[9]

Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the 4th ACM Conference on Recommender Systems.

Digital Library

[10]

Fernando Diaz, Bhaskar Mitra, Michael D. Ekstrand, Asia J. Biega, and Ben Carterette. 2020. Evaluating stochastic rankings with expected exposure. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management.

Digital Library

[11]

Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline evaluation to make decisions about playlistrecommendation algorithms. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining.

Digital Library

[12]

F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.

Digital Library

[13]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International World Wide Web Conference.

Digital Library

[14]

Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22, 1 (2004), 5–53.

Digital Library

[15]

Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining.

[16]

Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2019. How sensitive is recommendation systems’ offline evaluation to popularity?. In Proceedings of the Workshop on Offline Evaluation for Recommender Systems.

[17]

Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using exploration to alleviate closed-loop effects in recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[18]

Ray Jiang, Silvia Chiappa, Tor Lattimore, András György, and Pushmeet Kohli. 2019. Degenerate feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society.

Digital Library

[19]

Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual evaluation and learning for search, recommendation and ad placement. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[20]

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining.

Digital Library

[21]

Rogier A. Kievit, Willem E. Frankenhuis, Lourens J. Waldorp, and Denny Borsboom. 2013. Simpson’s paradox in psychological science: A practical guide. Frontiers in Psychology 4 (2013), 513.

[22]

Yehuda Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[23]

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.

Digital Library

[24]

Daniel D. Lee and H. Sebastian Seung. 2000. Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems.

[25]

Craig Macdonald, Rodrygo L. Santos, and Iadh Ounis. 2013. The whens and hows of learning to rank for web search. Information Retrieval 16, 5 (2013), 584–628.

Digital Library

[26]

Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM Conference on Recommender Systems.

[27]

Harrie Oosterhuis and Maarten de Rijke. 2020. Policy-aware unbiased learning to rank for top-k rankings. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[28]

Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, and Elena Zheleva. 2020. Correcting for selection bias in learning-to-rank systems. In Proceedings of The Web Conference.

Digital Library

[29]

Judea Pearl. 2014. Comment: Understanding simpson’s paradox. The American Statistician 68, 1 (2014), 43–52.

[30]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management.

Digital Library

[31]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2011. Evaluating Search Engine Relevance with Click-Based Metrics.

[32]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.

[33]

Marco Rossetti, Fabio Stella, and Markus Zanker. 2016. Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems.

Digital Library

[34]

Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic matrix factorization. In Proceedings of the 20th International Conference on Neural Information Processing Systems.

Digital Library

[35]

Claude Sammut and Geoffrey I. Webb (Eds.). 2010. Holdout Evaluation, Encyclopedia of Machine Learning. 506–507.

[36]

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 33rd International Conference on International Conference on Machine Learning.

[37]

Harald Steck. 2010. Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[38]

Harald Steck. 2011. Item popularity and recommendation accuracy. In Proceedings of the 5th ACM Conference on Recommender Systems.

Digital Library

[39]

James H Steiger. 1980. Tests for comparing elements of a correlation matrix.Psychological Bulletin 87, 2 (1980), 245–251.

[40]

Wenlong Sun, Sami Khenissi, Olfa Nasraoui, and Patrick Shafto. 2019. Debiasing the human-recommender system feedback loop in collaborative filtering. In Companion Proceedings of The 2019 World Wide Web Conference.

Digital Library

[41]

Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research 16, 52 (2015), 1731–1755.

Digital Library

[42]

Steven K. Thompson. 2012. Stratified Sampling. John Wiley and Sons, Ltd, Chapter 11, 139–156.

[43]

Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On the robustness and discriminative power of information retrieval metrics for top-N recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems.

Digital Library

[44]

Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.

Digital Library

[45]

Yixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. 2018. The deconfounded recommender: A causal inference approach to recommendation. arXiv:1808.06581. Retrieved from https://arxiv.org/abs/1808.06581.

[46]

Markus Weimer, Alexandros Karatzoglou, and Alex Smola. 2008. Improving maximum margin matrix factorization. Mach. Learn. 72, 3 (2008), 263–276.

Digital Library

[47]

Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems.

Digital Library

Cited By

Xu HXu YYang Y(2025)Separating and Learning Latent Confounders to Enhancing User Preferences ModelingDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_5(67-82)Online publication date: 12-Jan-2025
https://doi.org/10.1007/978-981-97-5555-4_5
Al Jurdi WAbdo JDemerjian JMakhoul A(2024)Group Validation in Recommender Systems: Framework for Multi-layer Performance EvaluationACM Transactions on Recommender Systems10.1145/36408202:1(1-25)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3640820
Jeunen OPotapov IUstimenko ABaeza-Yates RBonchi F(2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671687
Show More Cited By

Index Terms

The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Bridging the Gap Between User-centric and Offline Evaluation of Personalized Recommendation Systems
UMAP '18: Adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization

In this paper, we propose to evaluate recommender systems by conducting both offline and user-centric evaluations, while considering multiple quality aspects in realistic settings. This comprehensive evaluation would provide insight on how to improve ...
A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation
RepSys '13: Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation

Offline evaluations are the most common evaluation method for research paper recommender systems. However, no thorough discussion on the appropriateness of offline evaluations has taken place, despite some voiced criticism. We conducted a study in which ...
Estimating Error and Bias in Offline Evaluation Results
CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval

Offline evaluations of recommender systems attempt to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations provide researchers and developers with first approximations of the likely ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 40, Issue 1

January 2022

599 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3483337

Editor:
Min Zhang
Tsinghua University, China

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2021

Accepted: 01 March 2021

Revised: 01 December 2020

Received: 01 July 2020

Published in TOIS Volume 40, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

EPSRC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
373
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)4

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu HXu YYang Y(2025)Separating and Learning Latent Confounders to Enhancing User Preferences ModelingDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_5(67-82)Online publication date: 12-Jan-2025
https://doi.org/10.1007/978-981-97-5555-4_5
Al Jurdi WAbdo JDemerjian JMakhoul A(2024)Group Validation in Recommender Systems: Framework for Multi-layer Performance EvaluationACM Transactions on Recommender Systems10.1145/36408202:1(1-25)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3640820
Jeunen OPotapov IUstimenko ABaeza-Yates RBonchi F(2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671687
Li MCandan KSapino M(2024)Causally Informed Factorization Machines2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825754(448-455)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825754
Sonoda AToriumi FNakajima H(2024)User Experiments on the Effect of the Diversity of Consumption on News ServicesIEEE Access10.1109/ACCESS.2024.336777012(31841-31852)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3367770
Gulzar YAlwan AAbdullah RAbualkishik AOumrani M(2023)OCA: Ordered Clustering-Based Algorithm for E-Commerce Recommendation SystemSustainability10.3390/su1504294715:4(2947)Online publication date: 6-Feb-2023
https://doi.org/10.3390/su15042947
Su XLi PZhu X(2023)The Influence of Herd Mentality on Rating Bias and Popularity Bias: A Bi-Process Debiasing Recommendation Model Based on Matrix FactorizationBehavioral Sciences10.3390/bs1301006313:1(63)Online publication date: 10-Jan-2023
https://doi.org/10.3390/bs13010063
Ferrari Dacrema MCastells PBasilico JCremonesi P(2023)Workshop on Learning and Evaluating Recommendations with Impressions (LERI)Proceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608756(1248-1251)Online publication date: 14-Sep-2023
https://dl.acm.org/doi/10.1145/3604915.3608756
Wang WLin XWang LFeng FMa YChua T(2023)Causal Disentangled Recommendation against User Preference ShiftsACM Transactions on Information Systems10.1145/359302242:1(1-27)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3593022
Wang JHe JXu WLi RChu WSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Learning to Discover Various Simpson's ParadoxesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599859(5092-5103)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599859
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents