Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3412841.3441945acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

Published: 22 April 2021 Publication History

Abstract

Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or falseness of the null hypothesis. Following our methodology, we computed a series of simulations that estimate the proportion of Type I and Type II errors made by different tests. Results conclusively suggest that the Wilcoxon test is the most reliable test and, thus, IR practitioners should adopt it as the reference tool to assess differences between IR systems.

References

[1]
R. Clifford Blair and James J. Higgins. 1985. Comparison of the Power of the Paired Samples t Test to That of Wilcoxon's Signed-Ranks Test Under Various Population Shapes. (1985).
[2]
Robert J. Boik. 1987. The Fisher-Pitman permutation test: A non-robust alternative to the normal theory F test when variances are heterogeneous. 40, 1 (1987), 26--42.
[3]
B. M. Brown. 1982. Robustness Against Inequality of Variances. (1982).
[4]
Morton B. Brown and Alan B. Forsythe. 1974. Robust tests for the equality of variances. (1974).
[5]
Ben Carterette. 2017. Statistical Significance Testing in Information Retrieval: Theory and Practice. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017) (SIGIR '17). ACM, 1387--1389.
[6]
William Jay Conover. 1999. Practical nonparametric statistics (3rd ed.). Wiley.
[7]
Gordon V. Cormack and Thomas R. Lynam. [n.d.]. Validity and Power of T-test for Comparing MAP and GMAP. In Proceedings SIGIR '07 (2007). ACM, 753--754.
[8]
Yifan Huang, Haiyan Xu, Violeta Calian, and Jason C. Hsu. 2006. To Permute or Not to Permute. 22, 18 (2006), 2244--2248.
[9]
Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A. Aslam. 2010. Score Distribution Models: Assumptions, Intuition, and Robustness to Score Manipulation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2010) (SIGIR '10). ACM, 242--249.
[10]
Marco Marozzi. 2013. Nonparametric simultaneous tests for location and scale testing: A comparison of several methods. (2013).
[11]
Javier Parapar, David E. Losada, Manuel A. Presedo-Quindimil, and Alvaro Barreiro. 2020. Using score distributions to compare statistical significance tests for information retrieval evaluation. Journal of the Association for Information Science and Technology 71, 1 (2020), 98--113.
[12]
Nornadiah Mohd Razali and Yap Bee Wah. 2011. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests. (2011). https://doi.org/
[13]
Stephen Robertson, Evangelos Kanoulas, and Emine Yilmaz. 2013. Modelling Score Distributions Without Actual Scores. In Proceedings ICTIR 2013 (2013). ACM, 20:85--20:92.
[14]
Tetsuya Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum 48, 1 (June 2014), 3--12.
[15]
Tetsuya Sakai. 2016. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006--2015. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (2016) (SIGIR '16). ACM, 5--14.
[16]
Tetsuya Sakai. 2016. Two Sample T-tests for IR Evaluation: Student or Welch?. In Proceedings SIGIR 2016 (2016). ACM, 1045--1048.
[17]
Tetsuya Sakai. 2018. Multiple Comparison Procedures. In Laboratory Experiments in Information Retrieval, Springer (Ed.). Singapore, 59--80.
[18]
Mark Sanderson and Justin Zobel. 2005. Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In Proceedings SIGIR 2005 (2005). ACM, 162--169.
[19]
Shlomo S. Sawilowsky and R. Clifford Blair. 1992. A More Realistic Look at the Robustness and Type II Error Properties of the t Test to Departures From Population Normality. (1992).
[20]
Mark D. Smucker, James Allan, and Ben Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proceedings CIKM 2007 (2007). ACM, 623--632.
[21]
Julián Urbano, Harlley Lima, and Alan Hanjalic. 2019. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019) (SIGIR'19). ACM, 505--514.
[22]
Julian Urbano and Thomas Nagler. 2018. Stochastic Simulation of Test Collections: Evaluation Scores. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018) (SIGIR '18). ACM, 695--704.
[23]
Ellen M. Voorhees. 2001. Evaluation by Highly Relevant Documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2001) (SIGIR '01). ACM, 74--82.
[24]
Justin Zobel. 1998. How Reliable Are the Results of Large-scale Information Retrieval Experiments?. In Proceedings SIGIR 1998 (1998). ACM, 307--314.

Cited By

View all
  • (2024)Uncontextualized significance considered dangerousProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657827(261-270)Online publication date: 10-Jul-2024
  • (2024)Multi-Step-Ahead Wind Speed Forecast System: Hybrid Multivariate Decomposition and Feature Selection-Based Gated Additive Tree Ensemble ModelIEEE Access10.1109/ACCESS.2024.339289912(58750-58777)Online publication date: 2024
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
  • Show More Cited By

Index Terms

  1. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
    March 2021
    2075 pages
    ISBN:9781450381048
    DOI:10.1145/3412841
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 April 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. information retrieval
    2. simulation
    3. statistical testing

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SAC '21
    Sponsor:
    SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing
    March 22 - 26, 2021
    Virtual Event, Republic of Korea

    Acceptance Rates

    Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

    Upcoming Conference

    SAC '25
    The 40th ACM/SIGAPP Symposium on Applied Computing
    March 31 - April 4, 2025
    Catania , Italy

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)27
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Uncontextualized significance considered dangerousProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657827(261-270)Online publication date: 10-Jul-2024
    • (2024)Multi-Step-Ahead Wind Speed Forecast System: Hybrid Multivariate Decomposition and Feature Selection-Based Gated Additive Tree Ensemble ModelIEEE Access10.1109/ACCESS.2024.339289912(58750-58777)Online publication date: 2024
    • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 18-Aug-2023
    • (2023)How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication MethodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614916(1960-1970)Online publication date: 21-Oct-2023
    • (2023)Inference at Scale: Significance Testing for Large Search and Recommendation ExperimentsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3592004(2087-2091)Online publication date: 19-Jul-2023
    • (2023)An unsupervised perplexity-based method for boilerplate removalNatural Language Engineering10.1017/S135132492300004930:1(132-149)Online publication date: 21-Feb-2023
    • (2022)How Do You Test a Test?Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498406(280-288)Online publication date: 11-Feb-2022
    • (2022)A multistage retrieval system for health-related misinformation detectionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105211115:COnline publication date: 1-Oct-2022
    • (2021)How do Metric Score Distributions affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3471158.3472242(245-250)Online publication date: 11-Jul-2021
    • (2021)Towards Unified Metrics for Accuracy and Diversity for Recommender SystemsProceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3474234(75-84)Online publication date: 13-Sep-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media