Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems

Published: 08 September 2021 Publication History

Abstract

Recommendation systems are often evaluated based on user’s interactions that were collected from an existing, already deployed recommendation system. In this situation, users only provide feedback on the exposed items and they may not leave feedback on other items since they have not been exposed to them by the deployed system. As a result, the collected feedback dataset that is used to evaluate a new model is influenced by the deployed system, as a form of closed loop feedback. In this article, we show that the typical offline evaluation of recommender systems suffers from the so-called Simpson’s paradox. Simpson’s paradox is the name given to a phenomenon observed when a significant trend appears in several different sub-populations of observational data but disappears or is even reversed when these sub-populations are combined together. Our in-depth experiments based on stratified sampling reveal that a very small minority of items that are frequently exposed by the deployed system plays a confounding factor in the offline evaluation of recommendation systems. In addition, we propose a novel evaluation methodology that takes into account the confounder, i.e., the deployed system’s characteristics. Using the relative comparison of many recommendation models as in the typical offline evaluation of recommender systems, and based on the Kendall rank correlation coefficient, we show that our proposed evaluation methodology exhibits statistically significant improvements of 14% and 40% on the examined open loop datasets (Yahoo! and Coat), respectively, in reflecting the true ranking of systems with an open loop (randomised) evaluation in comparison to the standard evaluation.

References

[1]
Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W. Bruce Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. In Proceedings of the 41th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[2]
Qingyao Ai, Jiaxin Mao, Yiqun Liu, and W. Bruce Croft. 2018. Unbiased learning to rank: Theory and practice. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management.
[3]
Jeff Alstott, Ed Bullmore, and Dietmar Plenz. 2014. Powerlaw: A python package for analysis of heavy-tailed distributions. PLOS ONE 9, 1 (2014), e95816.
[4]
Ashton Anderson, Lucas Maystre, Ian Anderson, Rishabh Mehrotra, and Mounia Lalmas. 2020. Algorithmic effects on the diversity of consumption on spotify. In Proceedings of The Web Conference 2020.
[5]
Peter C. Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research 46, 3 (2011), 399–424.
[6]
Rocío Cañamares, Pablo Castells, and Alistair Moffat. 2020. Offline evaluation options for recommender systems. Information Retrieval Journal 23, 4 (2020), 387–410.
[7]
Allison J. B. Chaney, Brandon M. Stewart, and Barbara E. Engelhardt. 2018. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems.
[8]
C. R. Charig, D. R. Webb, S. R. Payne, and J. E. Wickham. 1986. Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy.BMJ 292, 6524 (1986), 879–882.
[9]
Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the 4th ACM Conference on Recommender Systems.
[10]
Fernando Diaz, Bhaskar Mitra, Michael D. Ekstrand, Asia J. Biega, and Ben Carterette. 2020. Evaluating stochastic rankings with expected exposure. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management.
[11]
Alois Gruson, Praveen Chandar, Christophe Charbuillet, James McInerney, Samantha Hansen, Damien Tardieu, and Ben Carterette. 2019. Offline evaluation to make decisions about playlistrecommendation algorithms. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining.
[12]
F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.
[13]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International World Wide Web Conference.
[14]
Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22, 1 (2004), 5–53.
[15]
Y. Hu, Y. Koren, and C. Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In Proceedings of the 8th IEEE International Conference on Data Mining.
[16]
Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2019. How sensitive is recommendation systems’ offline evaluation to popularity?. In Proceedings of the Workshop on Offline Evaluation for Recommender Systems.
[17]
Amir H. Jadidinejad, Craig Macdonald, and Iadh Ounis. 2020. Using exploration to alleviate closed-loop effects in recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
[18]
Ray Jiang, Silvia Chiappa, Tor Lattimore, András György, and Pushmeet Kohli. 2019. Degenerate feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society.
[19]
Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual evaluation and learning for search, recommendation and ad placement. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[20]
Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining.
[21]
Rogier A. Kievit, Willem E. Frankenhuis, Lourens J. Waldorp, and Denny Borsboom. 2013. Simpson’s paradox in psychological science: A practical guide. Frontiers in Psychology 4 (2013), 513.
[22]
Yehuda Koren. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[23]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
[24]
Daniel D. Lee and H. Sebastian Seung. 2000. Algorithms for non-negative matrix factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems.
[25]
Craig Macdonald, Rodrygo L. Santos, and Iadh Ounis. 2013. The whens and hows of learning to rank for web search. Information Retrieval 16, 5 (2013), 584–628.
[26]
Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the 3rd ACM Conference on Recommender Systems.
[27]
Harrie Oosterhuis and Maarten de Rijke. 2020. Policy-aware unbiased learning to rank for top-k rankings. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
[28]
Zohreh Ovaisi, Ragib Ahsan, Yifan Zhang, Kathryn Vasilaky, and Elena Zheleva. 2020. Correcting for selection bias in learning-to-rank systems. In Proceedings of The Web Conference.
[29]
Judea Pearl. 2014. Comment: Understanding simpson’s paradox. The American Statistician 68, 1 (2014), 43–52.
[30]
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management.
[31]
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2011. Evaluating Search Engine Relevance with Click-Based Metrics.
[32]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.
[33]
Marco Rossetti, Fabio Stella, and Markus Zanker. 2016. Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems.
[34]
Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic matrix factorization. In Proceedings of the 20th International Conference on Neural Information Processing Systems.
[35]
Claude Sammut and Geoffrey I. Webb (Eds.). 2010. Holdout Evaluation, Encyclopedia of Machine Learning. 506–507.
[36]
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 33rd International Conference on International Conference on Machine Learning.
[37]
Harald Steck. 2010. Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[38]
Harald Steck. 2011. Item popularity and recommendation accuracy. In Proceedings of the 5th ACM Conference on Recommender Systems.
[39]
James H Steiger. 1980. Tests for comparing elements of a correlation matrix.Psychological Bulletin 87, 2 (1980), 245–251.
[40]
Wenlong Sun, Sami Khenissi, Olfa Nasraoui, and Patrick Shafto. 2019. Debiasing the human-recommender system feedback loop in collaborative filtering. In Companion Proceedings of The 2019 World Wide Web Conference.
[41]
Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research 16, 52 (2015), 1731–1755.
[42]
Steven K. Thompson. 2012. Stratified Sampling. John Wiley and Sons, Ltd, Chapter 11, 139–156.
[43]
Daniel Valcarce, Alejandro Bellogín, Javier Parapar, and Pablo Castells. 2018. On the robustness and discriminative power of information retrieval metrics for top-N recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems.
[44]
Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.
[45]
Yixin Wang, Dawen Liang, Laurent Charlin, and David M. Blei. 2018. The deconfounded recommender: A causal inference approach to recommendation. arXiv:1808.06581. Retrieved from https://arxiv.org/abs/1808.06581.
[46]
Markus Weimer, Alexandros Karatzoglou, and Alex Smola. 2008. Improving maximum margin matrix factorization. Mach. Learn. 72, 3 (2008), 263–276.
[47]
Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM Conference on Recommender Systems.

Cited By

View all
  • (2025)Separating and Learning Latent Confounders to Enhancing User Preferences ModelingDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_5(67-82)Online publication date: 12-Jan-2025
  • (2024)Group Validation in Recommender Systems: Framework for Multi-layer Performance EvaluationACM Transactions on Recommender Systems10.1145/36408202:1(1-25)Online publication date: 7-Mar-2024
  • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
  • Show More Cited By

Index Terms

  1. The Simpson’s Paradox in the Offline Evaluation of Recommendation Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Information Systems
    ACM Transactions on Information Systems  Volume 40, Issue 1
    January 2022
    599 pages
    ISSN:1046-8188
    EISSN:1558-2868
    DOI:10.1145/3483337
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 September 2021
    Accepted: 01 March 2021
    Revised: 01 December 2020
    Received: 01 July 2020
    Published in TOIS Volume 40, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Offline evaluation
    2. Simpson’s paradox
    3. experimental design
    4. selection bias
    5. popularity bias

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • EPSRC

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Separating and Learning Latent Confounders to Enhancing User Preferences ModelingDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_5(67-82)Online publication date: 12-Jan-2025
    • (2024)Group Validation in Recommender Systems: Framework for Multi-layer Performance EvaluationACM Transactions on Recommender Systems10.1145/36408202:1(1-25)Online publication date: 7-Mar-2024
    • (2024)On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-n RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671687(1222-1233)Online publication date: 25-Aug-2024
    • (2024)Causally Informed Factorization Machines2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825754(448-455)Online publication date: 15-Dec-2024
    • (2024)User Experiments on the Effect of the Diversity of Consumption on News ServicesIEEE Access10.1109/ACCESS.2024.336777012(31841-31852)Online publication date: 2024
    • (2023)OCA: Ordered Clustering-Based Algorithm for E-Commerce Recommendation SystemSustainability10.3390/su1504294715:4(2947)Online publication date: 6-Feb-2023
    • (2023)The Influence of Herd Mentality on Rating Bias and Popularity Bias: A Bi-Process Debiasing Recommendation Model Based on Matrix FactorizationBehavioral Sciences10.3390/bs1301006313:1(63)Online publication date: 10-Jan-2023
    • (2023)Workshop on Learning and Evaluating Recommendations with Impressions (LERI)Proceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608756(1248-1251)Online publication date: 14-Sep-2023
    • (2023)Causal Disentangled Recommendation against User Preference ShiftsACM Transactions on Information Systems10.1145/359302242:1(1-27)Online publication date: 18-Aug-2023
    • (2023)Learning to Discover Various Simpson's ParadoxesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599859(5092-5103)Online publication date: 6-Aug-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media