research-article

Comparing metrics across TREC and NTCIR: the robustness to system bias

Author:

Tetsuya SakaiAuthors Info & Claims

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 581 - 590

https://doi.org/10.1145/1458082.1458159

Published: 26 October 2008 Publication History

Abstract

Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.

References

[1]

Ahlgren, P. and Grönqvist, L.: Evaluation of Retrieval Effectiveness with Incomplete Relevance Data: Theoretical and Experimental Comparison of Three Measures, Information Processing and Management, Volume 44, pp. 212--225, 2008.

Digital Library

[2]

Aslam, J. A. and Yilmaz, E.: Inferring Document Relevance from Incomplete Information, ACM CIKM 2007 Proceedings, pp. 633--642, 2007.

Digital Library

[3]

Baillie, M., Azzopardi, L. and Ruthven, I.: Evaluating Epistemic Uncertainty under Incomplete Assessments, Information Processing and Management, 44(2), pp. 811--837, 2008.

Digital Library

[4]

Bompada, T. et al.: On the Robustness of Relevance Measures with Incomplete Judgments, ACM SIGIR 2007 Proceedings, pp. 359--366, 2007.

Digital Library

[5]

Buckley, C. and Voorhees, E. M.: Evaluating Evaluation Measure Stability, ACM SIGIR 2000 Proceedings, pp. 33--40, 2000.

Digital Library

[6]

Buckley, C. and Voorhees, E. M.: Retrieval Evaluation with Incomplete Information, ACM SIGIR 2004 Proceedings, pp. 25--32, 2004.

Digital Library

[7]

Buckley, C. et al.: Bias and the Limits of Pooling for Large Collections, Information Retrieval, Vol. 10, Number 6, pp. 491--508, 2007.

Digital Library

[8]

Burges, C. et al.: Learning to Rank using Gradient Descent, ACM ICML 2005 Proceedings, pp. 89--96, 2005.

Digital Library

[9]

Büttcher et al.: Reliable Information Retrieval Evaluation with Incomplete and Biased Judgements, ACM SIGIR 2007 Proceedings., pp. 63--70, 2007.

Digital Library

[10]

Carterette, B.: Robust Test Collections for Retrieval Evaluation, ACM SIGIR 2007 Proceedings, pp. 55--62, 2007.

Digital Library

[11]

Carterette, B. et al.: Evaluation Over Thousands of Queries, ACM SIGIR 2008 Proceedings, pp. 651--658, 2008.

Digital Library

[12]

Järvelin, K. and Kekäläinen, J.: Cumulated Gain-Based Evaluation of IR Techniques, ACM TOIS, Vol. 20, No. 4, pp. 422--446, 2002.

Digital Library

[13]

Kando, N.: Overview of the Sixth NTCIR Workshop, NTCIR-6 Proceedings, pp. i--ix, 2007.

[14]

Moffat, A. and Zobel, J.: Rank-Biased Precision for Measurement of Retrieval Effectiveness, ACM TOIS, to appear, 2008.

Digital Library

[15]

Robertson, S.: A New Interpretation of Average Precision, ACM SIGIR 2008 Proceedings, pp. 689--690, 2008.

Digital Library

[16]

Sakai, T. and Sparck Jones, K.: Generic Summaries for Indexing in Information Retrieval, ACM SIGIR 2001 Proceedings, pp. 190--198, 2001.

Digital Library

[17]

Sakai, T.: Evaluating Evaluation Metrics based on the Bootstrap, ACM SIGIR 2006 Proceedings, pp. 525--532, 2006.

Digital Library

[18]

Sakai, T.: On the Reliability of Information Retrieval Metrics based on Graded Relevance, Information Processing and Management, 43(2), pp. 531--548, 2007.

Digital Library

[19]

Sakai, T.: On Penalising Late Arrival of Relevant Documents in Information Retrieval Evaluation with Graded Relevance, Proceedings of the First International Workshop on Evaluating Information Acess (EVIA 2007), pp. 32--43, 2007. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/EVIA/1.pdf

[20]

Sakai, T.: Alternatives to Bpref, ACM SIGIR 2007 Proceedings, pp. 71--78, 2007.

Digital Library

[21]

Sakai, T.: Evaluating Information Retrieval Metrics based on Bootstrap Hypothesis Tests, IPSJ Transactions on Databases, Vol.48, No. SIG 9 (TOD35), pp. 11--28, 2007. Also available in IPSJ Digital Courier, Vol.3, pp.625--642, 2007. http://www.jstage.jst.go.jp/article/ipsjdc/3/0/625/_pdf

[22]

Sakai, T. and Kando, N.: On Information Retrieval Metrics Designed for Evaluation with Incomplete Relevance Assessments, Information Retrieval, 2008. http://www.springerlink.com/content/k41j1152140326l4/fulltext.pdf (open access)

Digital Library

[23]

Sakai, T.: Comparing Metrics across TREC and NTCIR: The Robustness to Pool Depth Bias, ACM SIGIR 2008, pp. 691--692, 2008.

Digital Library

[24]

Sanderson, M. and Joho, H.: Forming Test Collections with No System Pooling, ACM SIGIR 2004 Proceedings, pp. 33--40, 2004.

Digital Library

[25]

Sormunen, E.: Liberal Relevance Criteria of TREC - Countingon Negligible Documents? ACM SIGIR 2002 Proceedings, pp. 324--330, 2002.

Digital Library

[26]

Voorhees, E. M.: The Philosophy of Information Retrieval Evaluation, CLEF 2001 Proceedings, LNCS 2406, pp. 355--370, 2002.

Digital Library

[27]

Voorhees, E. M.: Overview of the TREC 2003 Robust Retrieval Track, TREC 2003 Proceedings, 2004.

[28]

Voorhees, E. M.: Overview of the TREC 2004 Robust Retrieval Track, TREC 2004 Proceedings, 2005.

[29]

Webber, W., Moffat, A., Zobel, J. and Sakai, T.: Precision-At-Ten Considered Redundant, ACM SIGIR 2008 Proceedings, pp. 695--696, 2008.

Digital Library

[30]

Yilmaz, E. and Aslam, J. A.: Estimating Average Precision with Incomplete and Imperfect Judgments, CIKM 2006 Proceedings, 2006.

Digital Library

[31]

Zobel, J.: How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM SIGIR '98 Proceedings, pp. 307--314, 1998.

Digital Library

Cited By

Deckers NFröbe MKiesel JPandolfo GSchröder CStein BPotthast M(2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
https://dl.acm.org/doi/10.1145/3576840.3578327
Fröbe MGienapp LPotthast MHagen M(2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28244-7_20
Liu JLiu J(2023)Formally Modeling Users in Information RetrievalA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_2(23-64)Online publication date: 18-Feb-2023
https://doi.org/10.1007/978-3-031-23229-9_2
Show More Cited By

Index Terms

Comparing metrics across TREC and NTCIR: the robustness to system bias
1. Information systems
  1. Information retrieval

Recommendations

Comparing metrics across TREC and NTCIR:: the robustness to pool depth bias
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Alternatives to Bpref
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been ...
Evaluating evaluation metrics based on the bootstrap
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

October 2008

1562 pages

ISBN:9781595939913

DOI:10.1145/1458082

General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM08

Sponsor:

CIKM08: Conference on Information and Knowledge Management

October 26 - 30, 2008

California, Napa Valley, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deckers NFröbe MKiesel JPandolfo GSchröder CStein BPotthast M(2023)The Infinite Index: Information Retrieval on Generative Text-To-Image ModelsProceedings of the 2023 Conference on Human Information Interaction and Retrieval10.1145/3576840.3578327(172-186)Online publication date: 19-Mar-2023
https://dl.acm.org/doi/10.1145/3576840.3578327
Fröbe MGienapp LPotthast MHagen M(2023)Bootstrapped nDCG Estimation in the Presence of Unjudged DocumentsAdvances in Information Retrieval10.1007/978-3-031-28244-7_20(313-329)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28244-7_20
Liu JLiu J(2023)Formally Modeling Users in Information RetrievalA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_2(23-64)Online publication date: 18-Feb-2023
https://doi.org/10.1007/978-3-031-23229-9_2
Hofstätter SKhattab OAlthammer SSertkan MHanbury AAl Hasan MXiong L(2022)Introducing Neural Bag of Whole-Words with ColBERTerProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557367(737-747)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557367
Sakai TTao SZeng Z(2022)Relevance Assessments for Web Search Evaluation: Should We Randomise or Prioritise the Pooled Documents?ACM Transactions on Information Systems10.1145/349483340:4(1-35)Online publication date: 11-Jan-2022
https://dl.acm.org/doi/10.1145/3494833
Voorhees E(2019)The Evolution of CranfieldInformation Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_2(45-69)Online publication date: 14-Aug-2019
https://doi.org/10.1007/978-3-030-22948-1_2
Moffat AScholer FYang Z(2018)Estimating Measurement Uncertainty for Information Retrieval Effectiveness MetricsJournal of Data and Information Quality10.1145/323957210:3(1-22)Online publication date: 29-Sep-2018
https://dl.acm.org/doi/10.1145/3239572
Lu XMoffat ACulpepper JKando NSakai TJoho HLi Hde Vries AWhite R(2017)Can Deep Effectiveness Metrics Be Evaluated Using Shallow Judgment Pools?Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080793(35-44)Online publication date: 7-Aug-2017
https://dl.acm.org/doi/10.1145/3077136.3080793
Jayasinghe GWebber WSanderson MCulpepper J(2014)Improving test collection pools with machine learningProceedings of the 19th Australasian Document Computing Symposium10.1145/2682862.2682864(2-9)Online publication date: 26-Nov-2014
https://dl.acm.org/doi/10.1145/2682862.2682864
Sakai T(2014)Metrics, Statistics, TestsBridging Between Information Retrieval and Databases10.1007/978-3-642-54798-0_6(116-163)Online publication date: 2014
https://doi.org/10.1007/978-3-642-54798-0_6
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents