Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

The effect of pooling and evaluation depth on IR metrics

Published: 01 August 2016 Publication History

Abstract

Batch IR evaluations are usually performed in a framework that consists of a document collection, a set of queries, a set of relevance judgments, and one or more effectiveness metrics. A large number of evaluation metrics have been proposed, with two primary families having emerged: recall-based metrics, and utility-based metrics. In both families, the pragmatics of forming judgments mean that it is usual to evaluate the metric to some chosen depth such as k=20 or k=100, without necessarily fully considering the ramifications associated with that choice. Our aim is this paper is to explore the relative risks arising with fixed-depth evaluation in the two families, and document the complex interplay between metric evaluation depth and judgment pooling depth. Using a range of TREC resources including NewsWire data and the ClueWeb collection, we: (1) examine the implications of finite pooling on the subsequent usefulness of different test collections, including specifying options for truncated evaluation; and (2) determine the extent to which various metrics correlate with themselves when computed to different evaluation depths using those judgments. We demonstrate that the judgment pools constructed for the ClueWeb collections lack resilience, and are suited primarily to the application of top-heavy utility-based metrics rather than recall-based metrics; and that on the majority of the established test collections, and across a range of evaluation depths, recall-based metrics tend to be more volatile in the system rankings they generate than are utility-based metrics. That is, experimentation using utility-based metrics is more robust to choices such as the evaluation depth employed than is experimentation using recall-based metrics. This distinction should be noted by researchers as they plan and execute system-versus-system retrieval experiments.

References

[1]
Aslam, J. A., Pavlu, V., & Yilmaz, E. (2006). A statistical method for system evaluation using incomplete judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 541–548).
[2]
Aslam, J. A., Yilmaz, E., & Pavlu, V. (2005). The maximum entropy method for analyzing retrieval measures. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 27–34).
[3]
Bailey, P., Moffat, A., Scholer, F., & Thomas, P. (2015). User variability and IR system evaluation. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 625–634).
[4]
Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 33–40).
[5]
Buckley, C., & Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 25–32).
[6]
Buckley C, Dimmick D, Soboroff I, and Voorhees EM Bias and the limits of pooling for large collections Information Retrieval Journal 2007 10 491-508
[7]
Büttcher, S., Clarke, C. L. A., Yeung, P. C. K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 63–70).
[8]
Carterette, B., Kanoulas, E., & Yilmaz, E. (2010). Low cost evaluation in information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (p. 903).
[9]
Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the Conference on Information and Knowledge Management (pp. 621–630). ACM.
[10]
Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. V. (2010). Overview of the TREC 2010 Web track. In Proceedings of TREC.
[11]
Demartini, G., & Mizzaro, S. (2006). A classification of IR effectiveness metrics. In Proceedings of the European Conference on IR Research (pp. 488–491). Berlin, Heidelberg: Springer.
[12]
Järvelin K and Kekäläinen J Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems 2002 20 4 422-446
[13]
Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for NDCG. In Proceedings of Conference on Information and Knowledge Management (pp. 611–620). ACM.
[14]
Moffat, A. (2013). Seven numeric properties of effectiveness metrics. In Proceedings of Asian Information Retrieval Societies Conference (pp. 1–12).
[15]
Moffat, A., Bailey, P., Scholer, F., & Thomas, P. (2015). INST: An adaptive metric for information retrieval evaluation. In Proceedings of the Australasian Document Computing Symposium (pp. 5:1–5:4).
[16]
Moffat, A., Thomas, P., & Scholer, F. (2013). Users versus models: What observation tells us about effectiveness metrics. In Proceedings of Conference on Information and Knowledge Management (pp. 659–668).
[17]
Moffat, A., Webber, W., & Zobel, J. (2007). Strategic system comparisons via targeted relevance judgments. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 375–382).
[18]
Moffat A and Zobel J Rank-biased precision for measurement of retrieval effectiveness ACM Transactions on Information Systems 2008 27 1 2
[19]
Ravana, S. D., & Moffat, A. (2010). Score estimation, incomplete judgments, and significance testing in IR evaluation. In Proceedings of the Asian Information Retrieval Societies Conference (pp. 97–109).
[20]
Roberston, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the ACM-SIGIR Interenational Conference on Research and Development in Information Retrieval (pp. 603–610).
[21]
Sakai, T. (2004). New performance metrics based on multigrade relevance: Their application to question answering. In Proceedings of the NII Testbeds and Communities for Information Access and Research.
[22]
Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 525–532). New York, NY: ACM Press.
[23]
Sakai, T. (2007). Alternatives to BPref. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 71–78).
[24]
Sakai, T. (2014). Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases: PROMISE Winter School 2013, Bressanone, Italy, February 4–8, 2013. Revised Tutorial Lectures (pp. 116–163). Berlin, Heidelberg: Springer.
[25]
Sakai T and Kando N On information retrieval metrics designed for evaluation with incomplete relevance assessments Information Retrieval Journal 2008 11 5 447-470
[26]
Sanderson M Test collection based evaluation of information retrieval systems Foundations and Trends in Information Retrieval 2010 4 4 247-375
[27]
Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 74–82). ACM.
[28]
Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (pp. 355–370). Berlin, Heidelberg: Springer.
[29]
Voorhees EM and Harman DK TREC: Experiment and evaluation in information retrieval 2005 Cambridge The MIT Press
[30]
Webber, W., Moffat, A., & Zobel, J. (2010). The effect of pooling and evaluation depth on metric stability. In Proceedings of the Workshop Evaluation Information Access (pp. 7–15).
[31]
Webber W, Moffat A, and Zobel J A similarity measure for indefinite rankings ACM Transactions on Information Systems 2010 28 4 20
[32]
Yilmaz, E., & Aslam, J. A. (2006). Estimating average precision with incomplete and imperfect judgments. In Proceedings of the Conference on Information and Knowledge Management (pp. 102–111).
[33]
Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 587–594). ACM.
[34]
Yilmaz, E., Kanoulas, E., & Aslam, J. A. (2008). A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 603–610).
[35]
Yilmaz E and Robertson S On the choice of effectiveness measures for learning to rank Information Retrieval Journal 2010 13 3 271-290
[36]
Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval (pp. 307–314).

Cited By

View all
  • (2024)Leveraging LLMs for Unsupervised Dense Retriever RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657798(1307-1317)Online publication date: 10-Jul-2024
  • (2023)A Versatile Framework for Evaluating Ranked Lists in Terms of Group Fairness and RelevanceACM Transactions on Information Systems10.1145/358976342:1(1-36)Online publication date: 18-Aug-2023
  • (2023)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 27-Jan-2023
  • Show More Cited By

Index Terms

  1. The effect of pooling and evaluation depth on IR metrics
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Information Retrieval
      Information Retrieval  Volume 19, Issue 4
      Aug 2016
      95 pages

      Publisher

      Kluwer Academic Publishers

      United States

      Publication History

      Published: 01 August 2016
      Accepted: 06 June 2016
      Received: 15 February 2016

      Author Tags

      1. Evaluation metrics comparison
      2. Pooling and evaluation depth
      3. Experimentation

      Qualifiers

      • Research-article

      Funding Sources

      • Australian Research Council (AU)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 25 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Leveraging LLMs for Unsupervised Dense Retriever RankingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657798(1307-1317)Online publication date: 10-Jul-2024
      • (2023)A Versatile Framework for Evaluating Ranked Lists in Terms of Group Fairness and RelevanceACM Transactions on Information Systems10.1145/358976342:1(1-36)Online publication date: 18-Aug-2023
      • (2023)When Measurement MisleadsACM SIGIR Forum10.1145/3582524.358254056:1(1-20)Online publication date: 27-Jan-2023
      • (2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
      • (2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 2-Apr-2023
      • (2022)Introducing Neural Bag of Whole-Words with ColBERTerProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557367(737-747)Online publication date: 17-Oct-2022
      • (2022)WANDS: Dataset for Product Search Relevance AssessmentAdvances in Information Retrieval10.1007/978-3-030-99736-6_9(128-141)Online publication date: 10-Apr-2022
      • (2021)Truncated Models for Probabilistic Weighted RetrievalACM Transactions on Information Systems10.1145/347683740:3(1-24)Online publication date: 8-Dec-2021
      • (2021)On the Instability of Diminishing Return IR MeasuresAdvances in Information Retrieval10.1007/978-3-030-72113-8_38(572-586)Online publication date: 28-Mar-2021
      • (2020)Feature Extraction for Large-Scale Text CollectionsProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412773(3015-3022)Online publication date: 19-Oct-2020
      • Show More Cited By

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media