research-article

The effect of pooling and evaluation depth on IR metrics

Authors:

Alistair Moffat,

J. Shane CulpepperAuthors Info & Claims

Information Retrieval Journal, Volume 19, Issue 4

Pages 416 - 445

https://doi.org/10.1007/s10791-016-9282-6

Published: 01 August 2016 Publication History

Abstract

Batch IR evaluations are usually performed in a framework that consists of a document collection, a set of queries, a set of relevance judgments, and one or more effectiveness metrics. A large number of evaluation metrics have been proposed, with two primary families having emerged: recall-based metrics, and utility-based metrics. In both families, the pragmatics of forming judgments mean that it is usual to evaluate the metric to some chosen depth such as

k = 20

or

k = 100

, without necessarily fully considering the ramifications associated with that choice. Our aim is this paper is to explore the relative risks arising with fixed-depth evaluation in the two families, and document the complex interplay between metric evaluation depth and judgment pooling depth. Using a range of TREC resources including NewsWire data and the ClueWeb collection, we: (1) examine the implications of finite pooling on the subsequent usefulness of different test collections, including specifying options for truncated evaluation; and (2) determine the extent to which various metrics correlate with themselves when computed to different evaluation depths using those judgments. We demonstrate that the judgment pools constructed for the ClueWeb collections lack resilience, and are suited primarily to the application of top-heavy utility-based metrics rather than recall-based metrics; and that on the majority of the established test collections, and across a range of evaluation depths, recall-based metrics tend to be more volatile in the system rankings they generate than are utility-based metrics. That is, experimentation using utility-based metrics is more robust to choices such as the evaluation depth employed than is experimentation using recall-based metrics. This distinction should be noted by researchers as they plan and execute system-versus-system retrieval experiments.

References

[1]

Aslam, J. A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 541–548).

[2]

Aslam, J. A., Yilmaz, E., & Pavlu, V. (2005). The maximum entropy method for analyzing retrieval measures. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 27–34).

[3]

Bailey, P., Moffat, A., Scholer, F., & Thomas, P. (2015). User variability and IR system evaluation. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 625–634).

[4]

Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 33–40).

[5]

Buckley, C., & Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 25–32).

[6]

Buckley C, Dimmick D, Soboroff I, and Voorhees EM Bias and the limits of pooling for large collections Information Retrieval Journal 2007 10 491-508

[7]

Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 63–70).

[8]

Carterette, B., Kanoulas, E., & Yilmaz, E. (2010). Low cost evaluation in information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (p. 903).

[9]

Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the Conference on Information and Knowledge Management (pp. 621–630). ACM.

[10]

Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. V. (2010). Overview of the TREC 2010 Web track. In Proceedings of TREC.

[11]

Demartini, G., & Mizzaro, S. (2006). A classification of IR effectiveness metrics. In Proceedings of the European Conference on IR Research (pp. 488–491). Berlin, Heidelberg: Springer.

[12]

Järvelin K and Kekäläinen J Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems 2002 20 4 422-446

[13]

Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for NDCG. In Proceedings of Conference on Information and Knowledge Management (pp. 611–620). ACM.

[14]

Moffat, A. (2013). Seven numeric properties of effectiveness metrics. In Proceedings of Asian Information Retrieval Societies Conference (pp. 1–12).

[15]

Moffat, A., Bailey, P., Scholer, F., & Thomas, P. (2015). INST: An adaptive metric for information retrieval evaluation. In Proceedings of the Australasian Document Computing Symposium (pp. 5:1–5:4).

[16]

Moffat, A., Thomas, P., & Scholer, F. (2013). Users versus models: What observation tells us about effectiveness metrics. In Proceedings of Conference on Information and Knowledge Management (pp. 659–668).

[17]

Moffat, A., Webber, W., & Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 375–382).

[18]

Moffat A and Zobel J Rank-biased precision for measurement of retrieval effectiveness ACM Transactions on Information Systems 2008 27 1 2

[19]

Ravana, S. D., & Moffat, A. (2010). Score estimation, incomplete judgments, and significance testing in IR evaluation. In Proceedings of the Asian Information Retrieval Societies Conference (pp. 97–109).

[20]

Roberston, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the ACM-SIGIR Interenational Conference on Research and Development in Information Retrieval (pp. 603–610).

[21]

Sakai, T. (2004). New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NII Testbeds and Communities for Information Access and Research.

[22]

Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 525–532). New York, NY: ACM Press.

[23]

Sakai, T. (2007). Alternatives to BPref. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 71–78).

[24]

Sakai, T. (2014). Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases: PROMISE Winter School 2013, Bressanone, Italy, February 4–8, 2013. Revised Tutorial Lectures (pp. 116–163). Berlin, Heidelberg: Springer.

[25]

Sakai T and Kando N On information retrieval metrics designed for evaluation with incomplete relevance assessments Information Retrieval Journal 2008 11 5 447-470

[26]

Sanderson M Test collection based evaluation of information retrieval systems Foundations and Trends in Information Retrieval 2010 4 4 247-375

[27]

Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 74–82). ACM.

[28]

Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (pp. 355–370). Berlin, Heidelberg: Springer.

[29]

Voorhees EM and Harman DK TREC: Experiment and evaluation in information retrieval 2005 Cambridge The MIT Press

[30]

Webber, W., Moffat, A., & Zobel, J. (2010). The effect of pooling and evaluation depth on metric stability. In Proceedings of the Workshop Evaluation Information Access (pp. 7–15).

[31]

Webber W, Moffat A, and Zobel J A similarity measure for indefinite rankings ACM Transactions on Information Systems 2010 28 4 20

[32]

Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the Conference on Information and Knowledge Management (pp. 102–111).

[33]

Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 587–594). ACM.

[34]

Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 603–610).

[35]

Yilmaz E and Robertson S On the choice of effectiveness measures for learning to rank Information Retrieval Journal 2010 13 3 271-290

[36]

Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 307–314).

Cited By

Khramtsova EZhuang SBaktashmotlagh MZuccon GHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Leveraging LLMs for Unsupervised Dense Retriever RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657798(1307-1317)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657798
Sakai TKim JKang I(2023)A Versatile Framework for Evaluating Ranked Lists in Terms of Group Fairness and RelevanceACM Transactions on Information Systems10.1145/358976342:1(1-36)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3589763
Zobel J(2023)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3582524.3582540
Show More Cited By

Index Terms

The effect of pooling and evaluation depth on IR metrics
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness
    2. Retrieval models and ranking

Index terms have been assigned to the content through auto-classification.

Recommendations

On the Effect of Ranking Axioms on IR Evaluation Metrics
ICTIR '22: Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval

The study of IR evaluation metrics through axiomatic analysis enables a better understanding of their numerical properties. Some works have modelled the effectiveness of retrieval metrics with axioms that capture desirable properties on the set of ...
An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the Rank-Biased Utility Metric
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Many evaluation metrics have been defined to evaluate the effectiveness ad-hoc retrieval and search result diversification systems. However, it is often unclear which evaluation metric should be used to analyze the performance of retrieval systems given ...
Pooling-based continuous evaluation of information retrieval systems
Abstract
The dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled ...

Comments

Information & Contributors

Information

Published In

cover image Information Retrieval

Information Retrieval Volume 19, Issue 4

Aug 2016

95 pages

ISSN:1386-4564

Issue’s Table of Contents

© Springer Science+Business Media New York 2016.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 August 2016

Accepted: 06 June 2016

Received: 15 February 2016

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council (AU)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khramtsova EZhuang SBaktashmotlagh MZuccon GHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Leveraging LLMs for Unsupervised Dense Retriever RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657798(1307-1317)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657798
Sakai TKim JKang I(2023)A Versatile Framework for Evaluating Ranked Lists in Terms of Group Fairness and RelevanceACM Transactions on Information Systems10.1145/358976342:1(1-36)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3589763
Zobel J(2023)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3582524.3582540
Deckers NFröbe MKiesel JPandolfo GSchröder CStein BPotthast M(2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
https://dl.acm.org/doi/10.1145/3576840.3578327
Fröbe MGienapp LPotthast MHagen M(2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 2-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-28244-7_20
Hofstätter SKhattab OAlthammer SSertkan MHanbury AAl Hasan MXiong L(2022)Introducing Neural Bag of Whole-Words with ColBERTerProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557367(737-747)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557367
Chen YLiu SLiu ZSun WBaltrunas LSchroeder B(2022)WANDS: Dataset for Product Search Relevance AssessmentAdvances in Information Retrieval10.1007/978-3-030-99736-6_9(128-141)Online publication date: 10-Apr-2022
https://dl.acm.org/doi/10.1007/978-3-030-99736-6_9
Paik JAgrawal YRishi SShah V(2021)Truncated Models for Probabilistic Weighted RetrievalACM Transactions on Information Systems10.1145/347683740:3(1-24)Online publication date: 8-Dec-2021
https://dl.acm.org/doi/10.1145/3476837
Sakai T(2021)On the Instability of Diminishing Return IR MeasuresAdvances in Information Retrieval10.1007/978-3-030-72113-8_38(572-586)Online publication date: 28-Mar-2021
https://dl.acm.org/doi/10.1007/978-3-030-72113-8_38
Gallagher LMallia ACulpepper JSuel TCambazoglu Bd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Feature Extraction for Large-Scale Text CollectionsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412773(3015-3022)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412773
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents