Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3539618.3592004acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Public Access

Inference at Scale: Significance Testing for Large Search and Recommendation Experiments

Published: 18 July 2023 Publication History
  • Get Citation Alerts
  • Abstract

    A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.

    References

    [1]
    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
    [2]
    George EP Box, William H Hunter, Stuart Hunter, et al. 1978. Statistics for experimenters. Vol. 664. John Wiley and sons New York.
    [3]
    Kelly A Byrne. 2020. Borah: Dell HPC Intel (High Performance Computing Cluster). (2020).
    [4]
    Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib proceedings. MCB UP Ltd.
    [5]
    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Ms marco: Benchmarking ranking models in the large-data regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1566--1576.
    [6]
    Michael D Ekstrand. 2020. Lenskit for python: Next-generation software for recommender systems experiments. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2999--3006.
    [7]
    Ronald Aylmer Fisher. 1936. Design of experiments. British Medical Journal 1, 3923 (1936), 554.
    [8]
    Norbert Fuhr. 2018. Some common mistakes in IR evaluation, and how they can be avoided. In Acm sigir forum, Vol. 51. ACM New York, NY, USA, 32--41.
    [9]
    F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1--19.
    [10]
    Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web. 507--517.
    [11]
    David Hull. 1993. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. 329--338.
    [12]
    Ngozi Ihemelandu and Michael D Ekstrand. 2021. Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse. arXiv preprint arXiv:2109.06424 (2021).
    [13]
    Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz. 2021. Significant improvements over the state of the art? a case study of the ms marco document ranking leaderboard. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2283--2287.
    [14]
    Eric W Noreen. 1989. Computer-intensive methods for testing hypotheses. Wiley New York.
    [15]
    Javier Parapar, David E Losada, and Álvaro Barreiro. 2021. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 655--664.
    [16]
    Javier Parapar, David E Losada, Manuel A Presedo-Quindimil, and Alvaro Barreiro. 2020. Using score distributions to compare statistical significance tests for information retrieval evaluation. Journal of the Association for Information Science and Technology 71, 1 (2020), 98--113.
    [17]
    Kandethody M Ramachandran and Chris P Tsokos. 2020. Mathematical statistics with applications in R. Academic Press.
    [18]
    V Rijsbergen and C Keith Joost. 1979. Information Retrieval Butterworths London. (1979).
    [19]
    Tetsuya Sakai. 2016. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 5--14.
    [20]
    Tetsuya Sakai. 2018. Laboratory experiments in information retrieval. The information retrieval series 40 (2018).
    [21]
    Mark Sanderson and Justin Zobel. 2005. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. 162--169.
    [22]
    Mark D Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 623--632.
    [23]
    Julián Urbano, Matteo Corsi, and Alan Hanjalic. 2021. How do metric score distributions affect the type I error rate of statistical significance tests in information retrieval?. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 245--250.
    [24]
    Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical significance testing in information retrieval: an empirical analysis of type I, type II and type III errors. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 505--514.
    [25]
    Julián Urbano and Thomas Nagler. 2018. Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval. 695--704.
    [26]
    Ellen M Voorhees. 2001. The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum for european languages. Springer, 355--370.
    [27]
    Dennis Wackerly, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical statistics with applications. Cengage Learning.
    [28]
    Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. 2019. Moving to a world beyond p 0.05"., 19 pages.
    [29]
    Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 307--314.

    Cited By

    View all

    Index Terms

    1. Inference at Scale: Significance Testing for Large Search and Recommendation Experiments

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
          July 2023
          3567 pages
          ISBN:9781450394086
          DOI:10.1145/3539618
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 18 July 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. evaluation
          2. statistical inference

          Qualifiers

          • Short-paper

          Funding Sources

          Conference

          SIGIR '23
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 792 of 3,983 submissions, 20%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)69
          • Downloads (Last 6 weeks)2
          Reflects downloads up to 09 Aug 2024

          Other Metrics

          Citations

          Cited By

          View all

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media