Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1390334.1390447acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Relevance assessment: are judges exchangeable and does it matter

Published: 20 July 2008 Publication History

Abstract

We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task.
Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. Bronze standard judges may not be able to substitute for topic and task experts, due to changes in the relative performance of assessed systems, and gold standard judges are preferred.

References

[1]
R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, to appear.
[2]
J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In Proc. SIGIR, 2006.
[3]
P. Bailey, N. Craswell, I. Soboroff, and A. P. de Vries. The CSIRO enterprise search test collection. SIGIR Forum, 41(2), December 2007.
[4]
C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proc. SIGIR, 2004.
[5]
R. Burgin. Variations in relevance judgments and the evaluation of retrieval performance. Information Processing & Management, 28(5):619--627, Sep-Oct 1992.
[6]
J. Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249--254, 1996.
[7]
C. W. Cleverdon. The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report ASLIB part 2, Cranfield Institute of Technology, 1970.
[8]
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37--46, 1960.
[9]
G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Proc. SIGIR, 1998.
[10]
B. D. Eugenio and M. Glass. The kappa statistic: a second look. Computational Linguistics, 30(1):95--101, 2004.
[11]
S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. JASIS, 47(1):37--49, 1996.
[12]
K. Järvelin and J. Kekäläinen. IR evaluation methods for retrieving highly relevant documents. In Proc. SIGIR, 2000.
[13]
K. S. Jones and K. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32:59--75, 1976.
[14]
M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4:343--359, 1969.
[15]
S. Mizzaro. Measuring the agreement among relevance judges. In Proc. MIRA 99: Evaluating Interactive Information Retrieval, April 1999.
[16]
R. Rietveld and R. van Hout. Statistical Techniques for the Study of Language and Language Behaviour. Mouton de Gruyter, 1993.
[17]
S. Sigel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.
[18]
I. Soboroff. A comparison of pooled and sampled relevance judgments. In Proc. SIGIR, 2007.
[19]
E. Sormunen. Liberal relevance criteria of TREC: counting on negligible documents? In Proc. SIGIR, 2002.
[20]
A. Trotman and D. Jenkinson. IR Evaluation Using Multiple Assessors per Topic. In Proc. ADCS, 2007.
[21]
A. Trotman, N. Pharo, and D. Jenkinson. Can we at least agree on something? In Proc. SIGIR Workshop on Focused Retrieval, 2007.
[22]
E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. SIGIR, 1998.
[23]
E. M. Voorhees and D. Harman. Overview of the Fifth Text REtrieval Conference (TREC-5). NIST, 1996.
[24]
E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In Proc. CIKM, 2006.
[25]
E. Yilmaz, E. Kanoulas, and J. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In Proc. SIGIR, 2008.

Cited By

View all
  • (2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
  • (2024)Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch ProcessingProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638322(340-345)Online publication date: 10-Mar-2024
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. inter-rater agreement
  2. test collection relevance judgements

Qualifiers

  • Research-article

Conference

SIGIR '08
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)6
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
  • (2024)Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch ProcessingProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638322(340-345)Online publication date: 10-Mar-2024
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • (2024)Large Language Models can Accurately Predict Searcher PreferencesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657707(1930-1940)Online publication date: 10-Jul-2024
  • (2024)Comparison of Tools and Methods for Technology-Assisted ReviewInformation Management10.1007/978-3-031-64359-0_9(106-126)Online publication date: 18-Jul-2024
  • (2023)Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information RetrievalProceedings of the 30th International Conference on Information Systems Development10.62036/ISD.2022.38Online publication date: 2023
  • (2023)Chuweb21D: A Deduped English Document Collection for Web Search TasksProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3624918.3625317(63-72)Online publication date: 26-Nov-2023
  • (2023)On the Ordering of Pooled Web Pages, Gold Assessments, and Bronze AssessmentsACM Transactions on Information Systems10.1145/360022742:1(1-31)Online publication date: 21-Aug-2023
  • (2023)The Impact of Judgment Variability on the Consistency of Offline Effectiveness MeasuresACM Transactions on Information Systems10.1145/359651142:1(1-31)Online publication date: 18-Aug-2023
  • (2023)On the Reliability of User Feedback for Evaluating the Quality of Conversational AgentsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615286(4185-4189)Online publication date: 21-Oct-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media