research-article

Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

Authors:

Javier Parapar,

David E. Losada,

Álvaro BarreiroAuthors Info & Claims

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

Pages 655 - 664

https://doi.org/10.1145/3412841.3441945

Published: 22 April 2021 Publication History

Abstract

Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or falseness of the null hypothesis. Following our methodology, we computed a series of simulations that estimate the proportion of Type I and Type II errors made by different tests. Results conclusively suggest that the Wilcoxon test is the most reliable test and, thus, IR practitioners should adopt it as the reference tool to assess differences between IR systems.

References

[1]

R. Clifford Blair and James J. Higgins. 1985. Comparison of the Power of the Paired Samples t Test to That of Wilcoxon's Signed-Ranks Test Under Various Population Shapes. (1985).

[2]

Robert J. Boik. 1987. The Fisher-Pitman permutation test: A non-robust alternative to the normal theory F test when variances are heterogeneous. 40, 1 (1987), 26--42.

[3]

B. M. Brown. 1982. Robustness Against Inequality of Variances. (1982).

[4]

Morton B. Brown and Alan B. Forsythe. 1974. Robust tests for the equality of variances. (1974).

[5]

Ben Carterette. 2017. Statistical Significance Testing in Information Retrieval: Theory and Practice. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017) (SIGIR '17). ACM, 1387--1389.

Digital Library

[6]

William Jay Conover. 1999. Practical nonparametric statistics (3rd ed.). Wiley.

[7]

Gordon V. Cormack and Thomas R. Lynam. [n.d.]. Validity and Power of T-test for Comparing MAP and GMAP. In Proceedings SIGIR '07 (2007). ACM, 753--754.

[8]

Yifan Huang, Haiyan Xu, Violeta Calian, and Jason C. Hsu. 2006. To Permute or Not to Permute. 22, 18 (2006), 2244--2248.

Digital Library

[9]

Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A. Aslam. 2010. Score Distribution Models: Assumptions, Intuition, and Robustness to Score Manipulation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2010) (SIGIR '10). ACM, 242--249.

[10]

Marco Marozzi. 2013. Nonparametric simultaneous tests for location and scale testing: A comparison of several methods. (2013).

[11]

Javier Parapar, David E. Losada, Manuel A. Presedo-Quindimil, and Alvaro Barreiro. 2020. Using score distributions to compare statistical significance tests for information retrieval evaluation. Journal of the Association for Information Science and Technology 71, 1 (2020), 98--113.

Digital Library

[12]

Nornadiah Mohd Razali and Yap Bee Wah. 2011. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. (2011). https://doi.org/

[13]

Stephen Robertson, Evangelos Kanoulas, and Emine Yilmaz. 2013. Modelling Score Distributions Without Actual Scores. In Proceedings ICTIR 2013 (2013). ACM, 20:85--20:92.

Digital Library

[14]

Tetsuya Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum 48, 1 (June 2014), 3--12.

Digital Library

[15]

Tetsuya Sakai. 2016. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006--2015. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (2016) (SIGIR '16). ACM, 5--14.

Digital Library

[16]

Tetsuya Sakai. 2016. Two Sample T-tests for IR Evaluation: Student or Welch?. In Proceedings SIGIR 2016 (2016). ACM, 1045--1048.

Digital Library

[17]

Tetsuya Sakai. 2018. Multiple Comparison Procedures. In Laboratory Experiments in Information Retrieval, Springer (Ed.). Singapore, 59--80.

[18]

Mark Sanderson and Justin Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In Proceedings SIGIR 2005 (2005). ACM, 162--169.

Digital Library

[19]

Shlomo S. Sawilowsky and R. Clifford Blair. 1992. A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures From Population Normality. (1992).

[20]

Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proceedings CIKM 2007 (2007). ACM, 623--632.

Digital Library

[21]

Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019) (SIGIR'19). ACM, 505--514.

Digital Library

[22]

Julian Urbano and Thomas Nagler. 2018. Stochastic Simulation of Test Collections: Evaluation Scores. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018) (SIGIR '18). ACM, 695--704.

Digital Library

[23]

Ellen M. Voorhees. 2001. Evaluation by Highly Relevant Documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2001) (SIGIR '01). ACM, 74--82.

Digital Library

[24]

Justin Zobel. 1998. How Reliable Are the Results of Large-scale Information Retrieval Experiments?. In Proceedings SIGIR 1998 (1998). ACM, 307--314.

Digital Library

Cited By

Ferro NSanderson MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Uncontextualized significance considered dangerousProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657827(261-270)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657827
Joseph LDeo RCasillas-Pèrez DPrasad RRaj NSalcedo-Sanz S(2024)Multi-Step-Ahead Wind Speed Forecast System: Hybrid Multivariate Decomposition and Feature Selection-Based Gated Additive Tree Ensemble ModelIEEE Access10.1109/ACCESS.2024.339289912(58750-58777)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3392899
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3597201
Show More Cited By

Index Terms

Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation
1. Information systems
  1. Information retrieval

Recommendations

Bayesian Inference for Information Retrieval Evaluation
ICTIR '15: Proceedings of the 2015 International Conference on The Theory of Information Retrieval

A key component of experimentation in IR is statistical hypothesis testing, which researchers and developers use to make inferences about the effectiveness of their system relative to others. A statistical hypothesis test can tell us the likelihood that ...
Statistical Significance Testing in Theory and in Practice
ICTIR '19: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval

The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval ...
A power comparison and simulation study of goodness-of-fit tests

We give the results of a comprehensive simulation study of the power properties of prominent goodness-of-fit tests. For testing the normal N(@m,@s^2), we propose a new omnibus goodness-of-fit statistic C which is a combination of the Shapiro-Wilk ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

March 2021

2075 pages

ISBN:9781450381048

DOI:10.1145/3412841

Conference Chairs:
Chih-Cheng Hung
Kennesaw State University
,
Jiman Hong
Soongsil University, South Korea
,
Program Chairs:
Alessio Bechini
University of Pisa, Italy
,
Eunjee Song
Baylor University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SAC '21

Sponsor:

SIGAPP

SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing

March 22 - 26, 2021

Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
104
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ferro NSanderson MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Uncontextualized significance considered dangerousProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657827(261-270)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657827
Joseph LDeo RCasillas-Pèrez DPrasad RRaj NSalcedo-Sanz S(2024)Multi-Step-Ahead Wind Speed Forecast System: Hybrid Multivariate Decomposition and Feature Selection-Based Gated Additive Tree Ensemble ModelIEEE Access10.1109/ACCESS.2024.339289912(58750-58777)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3392899
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3597201
Otero DParapar JFerro NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614916
Ihemelandu NEkstrand MChen HDuh WHuang HKato MMothe JPoblete B(2023)Inference at Scale: Significance Testing for Large Search and Recommendation ExperimentsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592004(2087-2091)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3592004
Fernández-Pichel MPrada-Corral MLosada DPichel JGamallo P(2023)An unsupervised perplexity-based method for boilerplate removalNatural Language Engineering10.1017/S135132492300004930:1(132-149)Online publication date: 21-Feb-2023
https://doi.org/10.1017/S1351324923000049
Ferro NSanderson MSelcuk Candan KLiu HAkoglu LLuna Dong XTang J(2022)How Do You Test a Test?Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498406(280-288)Online publication date: 11-Feb-2022
https://dl.acm.org/doi/10.1145/3488560.3498406
Fernández-Pichel MLosada DPichel J(2022)A multistage retrieval system for health-related misinformation detectionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105211115:COnline publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1016/j.engappai.2022.105211
Urbano JCorsi MHanjalic AHasibi FFang YAizawa A(2021)How do Metric Score Distributions affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472242(245-250)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3471158.3472242
Parapar JRadlinski F(2021)Towards Unified Metrics for Accuracy and Diversity for Recommender SystemsProceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3474234(75-84)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3460231.3474234
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents