research-article

Relevance assessment: are judges exchangeable and does it matter

Authors:

Arjen P. de Vries,

Emine YilmazAuthors Info & Claims

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 667 - 674

https://doi.org/10.1145/1390334.1390447

Published: 20 July 2008 Publication History

Abstract

We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task.

Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. Bronze standard judges may not be able to substitute for topic and task experts, due to changes in the relative performance of assessed systems, and gold standard judges are preferred.

References

[1]

R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, to appear.

Digital Library

[2]

J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proc. SIGIR, 2006.

Digital Library

[3]

P. Bailey, N. Craswell, I. Soboroff, and A. P. de Vries. The CSIRO enterprise search test collection. SIGIR Forum, 41(2), December 2007.

Digital Library

[4]

C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proc. SIGIR, 2004.

Digital Library

[5]

R. Burgin. Variations in relevance judgments and the evaluation of retrieval performance. Information Processing & Management, 28(5):619--627, Sep-Oct 1992.

Digital Library

[6]

J. Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249--254, 1996.

Digital Library

[7]

C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report ASLIB part 2, Cranfield Institute of Technology, 1970.

[8]

J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37--46, 1960.

[9]

G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Proc. SIGIR, 1998.

Digital Library

[10]

B. D. Eugenio and M. Glass. The kappa statistic: a second look. Computational Linguistics, 30(1):95--101, 2004.

Digital Library

[11]

S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1):37--49, 1996.

Digital Library

[12]

K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In Proc. SIGIR, 2000.

Digital Library

[13]

K. S. Jones and K. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32:59--75, 1976.

[14]

M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4:343--359, 1969.

[15]

S. Mizzaro. Measuring the agreement among relevance judges. In Proc. MIRA 99: Evaluating Interactive Information Retrieval, April 1999.

Digital Library

[16]

R. Rietveld and R. van Hout. Statistical Techniques for the Study of Language and Language Behaviour. Mouton de Gruyter, 1993.

[17]

S. Sigel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.

[18]

I. Soboroff. A comparison of pooled and sampled relevance judgments. In Proc. SIGIR, 2007.

Digital Library

[19]

E. Sormunen. Liberal relevance criteria of TREC: counting on negligible documents? In Proc. SIGIR, 2002.

Digital Library

[20]

A. Trotman and D. Jenkinson. IR Evaluation Using Multiple Assessors per Topic. In Proc. ADCS, 2007.

[21]

A. Trotman, N. Pharo, and D. Jenkinson. Can we at least agree on something? In Proc. SIGIR Workshop on Focused Retrieval, 2007.

[22]

E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. SIGIR, 1998.

Digital Library

[23]

E. M. Voorhees and D. Harman. Overview of the Fifth Text REtrieval Conference (TREC-5). NIST, 1996.

[24]

E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proc. CIKM, 2006.

Digital Library

[25]

E. Yilmaz, E. Kanoulas, and J. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, 2008.

Digital Library

Cited By

Oosterhuis HJagerman RQin ZWang XBendersky MBaeza-Yates RBonchi F(2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671883
Zendel OCulpepper JScholer FThomas P(2024)Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch ProcessingProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638322(340-345)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638322
Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Show More Cited By

Recommendations

Gauging the Quality of Relevance Assessments using Inter-Rater Agreement
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

In recent years, gathering relevance judgments through non-topic originators has become an increasingly important problem in Information Retrieval. Relevance judgments can be used to measure the effectiveness of a system, and are often needed to build ...
Relevance Judgments: Preferences, Scores and Ties
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Conventionally, relevance judgments were assessed using ordinal relevance scales such as binary and Sormunen categories [9]. Such judgments record how much overlap there is between the document and the topic. However they have been argued as unreliable ...
Automated opinion detection: Implications of the level of agreement between human raters

The ability to agree with the TREC Blog06 opinion assessments was measured for seven human assessors and compared with the submitted results of the Blog06 participants. The assessors achieved a fair level of agreement between their assessments, although ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

July 2008

934 pages

ISBN:9781605581644

DOI:10.1145/1390334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '08

Sponsor:

SIGIR '08: The 31st Annual International ACM SIGIR Conference

July 20 - 24, 2008

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

127
Total Citations
View Citations
912
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)6

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Oosterhuis HJagerman RQin ZWang XBendersky MBaeza-Yates RBonchi F(2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671883
Zendel OCulpepper JScholer FThomas P(2024)Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch ProcessingProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638322(340-345)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638322
Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Thomas PSpielman SCraswell NMitra BHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Large Language Models can Accurately Predict Searcher PreferencesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657707(1930-1940)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657707
O’Halloran TMcManus BHarbison AGrossman MCormack G(2024)Comparison of Tools and Methods for Technology-Assisted ReviewInformation Management10.1007/978-3-031-64359-0_9(106-126)Online publication date: 18-Jul-2024
https://doi.org/10.1007/978-3-031-64359-0_9
Zhu DNimmagadda SWong KReiners T(2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
https://doi.org/10.62036/ISD.2022.38
Chu ZSakai TAi QLiu Y(2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
https://dl.acm.org/doi/10.1145/3624918.3625317
Sakai TTao SChen NLi YMaistro MChu ZFerro N(2023)On the Ordering of Pooled Web Pages, Gold Assessments, and Bronze AssessmentsACM Transactions on Information Systems10.1145/360022742:1(1-31)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1145/3600227
Rashidi LZobel JMoffat A(2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
https://dl.acm.org/doi/10.1145/3596511
Massiah JYilmaz EJiao YKazai GFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)On the Reliability of User Feedback for Evaluating the Quality of Conversational AgentsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615286(4185-4189)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615286
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents