Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3269206.3271751acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections

When Rank Order Isn't Enough: New Statistical-Significance-Aware Correlation Measures

Published: 17 October 2018 Publication History


Because it is expensive to construct test collections for Cranfield-based evaluation of information retrieval systems, a variety of lower-cost methods have been proposed. The reliability of these methods is often validated by measuring rank correlation (e.g., Kendall's tau) between known system rankings on the full test collection vs. observed system rankings on the lower-cost one. However, existing rank correlation measures do not consider the statistical significance of score differences between systems in the observed rankings. To address this, we propose two statistical-significance-aware rank correlation measures, one of which is a head-weighted version of the other. We first show empirical differences between our proposed measures and existing ones. We then compare the measures while benchmarking four system evaluation methods: pooling, crowdsourcing, evaluation with incomplete judgments, and automatic system ranking. We show that use of our measures can lead to different experimental conclusions regarding reliability of alternative low-cost evaluation methods.


Omar Alonso, Daniel E Rose, and Benjamin Stewart. 2008. Crowdsourcing for relevance evaluation. In ACM SIGIR Forum, Vol. 42. 9--15.
Javed A. Aslam and Virgil Pavlu. 2007. A practical sampling strategy for efficient retrieval evaluation. Technical Report (2007).
Javed A. Aslam, Virgiliu Pavlu, and Robert Savell. 2003. A Unified Model for Metasearch, Pooling, and System Evaluation. In Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM '03). 484--491.
Chris Buckley and Ellen M Voorhees. 2004. Retrieval evaluation with incomplete information. In SIGIR '04. 25--32.
Ben Carterette. 2009. On rank correlation and the distance between rankings. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '09). 436--443.
Ben Carterette. 2017. But Is It Statistically Significant?: Statistical Significance in IR Research, 1995--2014. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). 1125--1128.
Ben Carterette, James Allan, and Ramesh Sitaraman. 2006. Minimal test collections for retrieval evaluation. In SIGIR '06. 268--275.
Charles L Clarke, Nick Craswell, Ian Soboroff, and Gordon Cormack. 2010a. Overview of the TREC 2010 Web track. In TREC 2010 .
Charles L Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2010b. Overview of the TREC 2011 Web track. In TREC 2011 .
Charles L. Clarke, Nick Craswell, and Ellen M. Voorhees. 2012. Overview of the TREC 2012 Web Track. In TREC 2012 .
Cyril W Cleverdon. 1959. The evaluation of systems used in information retrieval. In Proceedings of the international conference on scientific information, Vol. 1. National Academy of Sciences Washington, DC, 687--698.
Kevyn Collins-Thompson, Paul Bennett, Fernando Diaz, Charles L A Clarke, and Ellen M Voorhees. 2013. TREC 2013 Web Track Overview. In TREC 2013 .
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees. 2015. TREC 2014 web track overview.
Gordon V Cormack and Thomas R Lynam. 2007. Power and bias of subset pooling strategies. In SIGIR '07. 837--838.
Gordon V Cormack, Christopher R Palmer, and Charles LA Clarke. 1998. Efficient construction of large test collections. In SIGIR '98. 282--289.
Giorgio Maria Di Nunzio and Gianmaria Silvello. 2015. A Graphical View of Distance Between Rankings: The Point and Area Measures. In IIR .
Ronald Fagin, Ravi Kumar, and Dakshinamurthi Sivakumar. 2003. Comparing top k lists. SIAM Journal on discrete mathematics, Vol. 17, 1 (2003), 134--160.
Ning Gao, Mossaab Bagdouri, and Douglas W Oard. 2016. Pearson rank: a head-weighted gap-sensitive score-based correlation coefficient. In SIGIR '16 .
Ning Gao and Douglas Oard. 2015. A head-weighted gap-sensitive correlation coefficient. In SIGIR '15 . 799--802.
Ning Gao, William Webber, and Douglas W. Oard. 2014. Reducing reliance on relevance judgments for system comparison by using expectation-maximization. In ECIR '14 .
Donna Harman. 2011. Information retrieval evaluation . Number 19 in Synthesis lectures on information concepts, retrieval, and services. Morgan & Claypool, San Rafael, Calif.
Claudia Hauff, Djoerd Hiemstra, Leif Azzopardi, and Franciska De Jong. 2010. A case for automatic system evaluation. In ECIR '10. 153--165.
David Hawking. 2000. Overview of the TREC-9 Web Track. In TREC 9 .
Sascha Henzgen and Eyke Hüllermeier. 2015. Weighted rank correlation: a flexible approach based on fuzzy order relations. In ECML PKDD '15. 422--437.
Harold Hotelling. 1940. The selection of variates for use in prediction with some comments on the general problem of nuisance parameters. The Annals of Mathematical Statistics, Vol. 11, 3 (1940), 271--283.
Ronald L Iman and WJ Conover. 1987. A measure of top--down correlation. Technometrics, Vol. 29, 3 (1987), 351--357.
Hyun Joon Jung and Matthew Lease. 2012. Evaluating Classifiers Without Expert Labels . Technical Report. University of Texas at Austin. arXiv:1212.0960.
Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, Vol. 30, 1/2 (1938), 81--93.
Ravi Kumar and Sergei Vassilvitskii. 2010. Generalized distances between rankings. In Proceedings of the 19th international conference on World wide web (WWW '10). 571--580.
Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, and Matthew Lease. 2018. Crowd vs. Expert: What Can Relevance Judgment Rationales Teach Us About Assessor Disagreement?. In SIGIR 2018. 805--814.
David E Losada, Javier Parapar, and Alvaro Barreiro. 2017. Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Information Processing & Management, Vol. 53, 5 (2017), 1005--1025.
Tahani A Maturi and Ezz H Abdelfattah. 2008. A new weighted rank correlation. Journal of mathematics and statistics., Vol. 4, 4 (2008), 226--230.
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. 2016. Why is that relevant? Collecting annotator rationales for relevance judgments. In HCOMP '16. 139--148.
Massimo Melucci. 2009. Weighted rank correlation in information retrieval evaluation. In AIRs '09 . 75--86.
Rabia Nuray and Fazli Can. 2006. Automatic ranking of information retrieval systems using data fusion. Information Processing & Management, Vol. 42, 3 (2006).
Karl Pearson. 1895. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, Vol. 58 ( 1895), 240--242.
Herbert Robbins. 1985. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers . Springer, 169--177.
Tetsuya Sakai. 2007. Alternatives to bpref. In SIGIR '07. 71--78.
Tetsuya Sakai. 2014. Statistical reform in information retrieval?. In ACM SIGIR Forum, Vol. 48. ACM, 3--12.
Mark Sanderson and Ian Soboroff. 2007. Problems with Kendall's tau. In SIGIR '07. 839--840.
Grace S Shieh. 1998. A weighted Kendall's tau statistic. Statistics & probability letters, Vol. 39, 1 (1998), 17--24.
Mark D Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM '07). 623--632.
Ian Soboroff. 2006. Dynamic test collections: measuring search effectiveness on the live web. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). ACM, 276--283.
Ian Soboroff, Charles Nicholas, and Patrick Cahan. 2001. Ranking Retrieval Systems Without Relevance Judgments. In SIGIR '01 .
Luchen Tan and Charles LA Clarke. 2015. A family of rank similarity measures based on maximized effectiveness difference. IEEE Transactions on Knowledge and Data Engineering, Vol. 27, 11 (2015), 2865--2877.
Sebastiano Vigna. 2015. A weighted correlation index for rankings with ties. In Proceedings of the 24th international conference on World Wide Web (WWW '15). 1166--1176.
Ellen M. Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, Vol. 36, 5 (2000).
Ellen M Voorhees and Donna Harman. {n. d.}. Overview of TREC 2001. In TREC 2001 .
Emine Yilmaz and Javed A Aslam. 2006. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06). 102--111.
Emine Yilmaz, Javed A. Aslam, and Stephen Robertson. 2008. A new rank correlation coefficient for information retrieval. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '08). 587--594.

Cited By

View all
  • (2021)DiffIR: Exploring Differences in Ranking Models' BehaviorProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462784(2595-2599)Online publication date: 11-Jul-2021
  • (2019)Quantifying Bias and Variance of System RankingsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331356(1089-1092)Online publication date: 18-Jul-2019

Index Terms

  1. When Rank Order Isn't Enough: New Statistical-Significance-Aware Correlation Measures



    Information & Contributors


    Published In

    cover image ACM Conferences
    CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
    October 2018
    2362 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2018


    Request permissions for this article.

    Check for updates

    Author Tags

    1. evaluation
    2. ir system ranking
    3. rank correlation


    • Research-article

    Funding Sources

    • Qatar National Research Fund


    CIKM '18

    Acceptance Rates

    CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 25 Dec 2024

    Other Metrics


    Cited By

    View all
    • (2021)DiffIR: Exploring Differences in Ranking Models' BehaviorProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462784(2595-2599)Online publication date: 11-Jul-2021
    • (2019)Quantifying Bias and Variance of System RankingsProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331356(1089-1092)Online publication date: 18-Jul-2019

    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.








    Share this Publication link

    Share on social media