short-paper

Public Access

Inference at Scale: Significance Testing for Large Search and Recommendation Experiments

Authors:

Ngozi Ihemelandu,

Michael D. EkstrandAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2087 - 2091

https://doi.org/10.1145/3539618.3592004

Published: 18 July 2023 Publication History

Abstract

A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.

References

[1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).

[2]

George EP Box, William H Hunter, Stuart Hunter, et al. 1978. Statistics for experimenters. Vol. 664. John Wiley and sons New York.

[3]

Kelly A Byrne. 2020. Borah: Dell HPC Intel (High Performance Computing Cluster). (2020).

[4]

Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib proceedings. MCB UP Ltd.

[5]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Ms marco: Benchmarking ranking models in the large-data regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1566--1576.

Digital Library

[6]

Michael D Ekstrand. 2020. Lenskit for python: Next-generation software for recommender systems experiments. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2999--3006.

Digital Library

[7]

Ronald Aylmer Fisher. 1936. Design of experiments. British Medical Journal 1, 3923 (1936), 554.

[8]

Norbert Fuhr. 2018. Some common mistakes in IR evaluation, and how they can be avoided. In Acm sigir forum, Vol. 51. ACM New York, NY, USA, 32--41.

[9]

F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1--19.

[10]

Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web. 507--517.

Digital Library

[11]

David Hull. 1993. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. 329--338.

Digital Library

[12]

Ngozi Ihemelandu and Michael D Ekstrand. 2021. Statistical Inference: The Missing Piece of RecSys Experiment Reliability Discourse. arXiv preprint arXiv:2109.06424 (2021).

[13]

Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, and Emine Yilmaz. 2021. Significant improvements over the state of the art? a case study of the ms marco document ranking leaderboard. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2283--2287.

Digital Library

[14]

Eric W Noreen. 1989. Computer-intensive methods for testing hypotheses. Wiley New York.

[15]

Javier Parapar, David E Losada, and Álvaro Barreiro. 2021. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 655--664.

Digital Library

[16]

Javier Parapar, David E Losada, Manuel A Presedo-Quindimil, and Alvaro Barreiro. 2020. Using score distributions to compare statistical significance tests for information retrieval evaluation. Journal of the Association for Information Science and Technology 71, 1 (2020), 98--113.

Digital Library

[17]

Kandethody M Ramachandran and Chris P Tsokos. 2020. Mathematical statistics with applications in R. Academic Press.

[18]

V Rijsbergen and C Keith Joost. 1979. Information Retrieval Butterworths London. (1979).

[19]

Tetsuya Sakai. 2016. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006--2015. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 5--14.

Digital Library

[20]

Tetsuya Sakai. 2018. Laboratory experiments in information retrieval. The information retrieval series 40 (2018).

[21]

Mark Sanderson and Justin Zobel. 2005. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. 162--169.

Digital Library

[22]

Mark D Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 623--632.

Digital Library

[23]

Julián Urbano, Matteo Corsi, and Alan Hanjalic. 2021. How do metric score distributions affect the type I error rate of statistical significance tests in information retrieval?. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval. 245--250.

Digital Library

[24]

Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical significance testing in information retrieval: an empirical analysis of type I, type II and type III errors. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 505--514.

Digital Library

[25]

Julián Urbano and Thomas Nagler. 2018. Stochastic simulation of test collections: Evaluation scores. In The 41st international ACM SIGIR conference on research & development in information retrieval. 695--704.

Digital Library

[26]

Ellen M Voorhees. 2001. The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum for european languages. Springer, 355--370.

[27]

Dennis Wackerly, William Mendenhall, and Richard L Scheaffer. 2014. Mathematical statistics with applications. Cengage Learning.

[28]

Ronald L Wasserstein, Allen L Schirm, and Nicole A Lazar. 2019. Moving to a world beyond p 0.05"., 19 pages.

[29]

Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 307--314.

Digital Library

Cited By

Ihemelandu NEkstrand M(2024)Multiple Testing for IR and Recommendation System ExperimentsAdvances in Information Retrieval10.1007/978-3-031-56063-7_37(449-457)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_37

Index Terms

Inference at Scale: Significance Testing for Large Search and Recommendation Experiments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Multiple Testing for IR and Recommendation System Experiments
Advances in Information Retrieval
Abstract
While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison ...
Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. ...
Statistical Significance Testing in Information Retrieval: Theory and Practice
ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval

The past 20 years have seen a great improvement in the rigor of information retrieval experimentation, due primarily to two factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtrieval Conference [28]), and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Science Foundation

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
86
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ihemelandu NEkstrand M(2024)Multiple Testing for IR and Recommendation System ExperimentsAdvances in Information Retrieval10.1007/978-3-031-56063-7_37(449-457)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56063-7_37

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents