research-article

On Fine-Grained Relevance Scales

Authors:

Eddy Maddalena,

Gianluca Demartini,

Stefano MizzaroAuthors Info & Claims

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 675 - 684

https://doi.org/10.1145/3209978.3210052

Published: 27 June 2018 Publication History

Abstract

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

References

[1]

Omar Alonso and Stefano Mizzaro . 2012. Using crowdsourcing for TREC relevance assessment. Inf. Process. Manage. Vol. 48, 6 (2012), 1053--1066.

Digital Library

[2]

Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan . 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). ACM, New York, NY, USA, 621--630.

Digital Library

[3]

Charles LA Clarke, Nick Craswell, and Ian Soboroff . 2004. Overview of the TREC 2004 Terabyte Track. In TREC, Vol. Vol. 4. 74.

[4]

Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees . 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.

[5]

Carsten Eickhoff . 2018. Cognitive Biases in Crowdsourcing. In WSDM 2018. To appear.

Digital Library

[6]

Ujwal Gadiraju, Alessandro Checco, Neha Gupta, and Gianluca Demartini . 2017. Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 1, 3, Article bibinfoarticleno49 (Sept. . 2017), bibinfonumpages29 pages.

Digital Library

[7]

George A Gescheider . 2013. Psychophysics: the fundamentals. Psychology Press.

[8]

Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Gabriella Kazai, and Vishwa Vinay . 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1--5, 2012. Proceedings. 182--194.

Digital Library

[9]

Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake . 2011. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting Vol. 57, 1 (2011), 1--14.

[10]

Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. Vol. 20, 4 (Oct. . 2002), 422--446.

Digital Library

[11]

Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) Vol. 20, 4 (2002), 422--446.

Digital Library

[12]

Jiepu Jiang, Daqing He, and James Allan . 2017. Comparing In Situ and Multidimensional Relevance Judgments Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 405--414.

Digital Library

[13]

Klaus Krippendorff . 2007. Computing Krippendorff's alpha reliability. Departmental papers (ASC) (2007), 43.

[14]

Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini . 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge Fourth AAAI Conference on Human Computation and Crowdsourcing.

[15]

Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin . 2017. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS) Vol. 35, 3 (2017), 19.

Digital Library

[16]

Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed . 2016. Why is that relevant? Collecting annotator rationales for relevance judgments Fourth AAAI Conference on Human Computation and Crowdsourcing.

[17]

Tetsuya Sakai . 2007. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Inf. Process. Manage. Vol. 43, 2 (March . 2007), 531--548.

Digital Library

[18]

Tefko Saracevic . 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology Vol. 58, 13 (2007), 1915--1933.

Digital Library

[19]

Eero Sormunen . 2002. Liberal Relevance Criteria of TREC -: Counting on Negligible Documents? Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 324--330.

Digital Library

[20]

Rong Tang, William M Shaw Jr, and Jack L Vevea . 1999. Towards the identification of the optimal number of relevance categories. Journal of the Association for Information Science and Technology Vol. 50, 3 (1999), 254.

Digital Library

[21]

Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena . 2015. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 565--574.

Digital Library

[22]

Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi . 2014. Community-based Bayesian Aggregation Models for Crowdsourcing Proceedings of the 23rd International Conference on World Wide Web (WWW '14). ACM, 155--164.

Digital Library

Cited By

Siro CAliannejadi Mde Rijke MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657712
Lammerts PLippmann PHsu YCasati FYang J(2023)How do you feel? Measuring User-Perceived Value for Rejecting Machine Decisions in Hate Speech DetectionProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604655(834-844)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.1145/3600211.3604655
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3597201
Show More Cited By

Index Terms

On Fine-Grained Relevance Scales
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

On Transforming Relevance Scales
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Information Retrieval (IR) researchers have often used existing IR evaluation collections and transformed the relevance scale in which judgments have been collected, e.g., to use metrics that assume binary judgments like Mean Average Precision. Such ...
On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation
Abstract
Relevance is a key concept in information retrieval and widely used for the evaluation of search systems using test collections. We present a comprehensive study of the effect of the choice of relevance scales on the evaluation of ...
Highlights
- We collect relevance judgments for 4 crowdsourced scales.
- We compare the crowd ...
On the role of human and machine metadata in relevance judgment tasks
Abstract
In order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual ...
Highlights
- Introducing metadata improves the efficiency of crowd workers performing relevance judgements.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

June 2018

1509 pages

ISBN:9781450356572

DOI:10.1145/3209978

General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Union's Horizon 2020 research and innovation programme

Conference

SIGIR '18

Sponsor:

SIGIR

SIGIR '18: The 41st International ACM SIGIR conference on research and development in Information Retrieval

July 8 - 12, 2018

MI, Ann Arbor, USA

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
398
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Siro CAliannejadi Mde Rijke MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657712
Lammerts PLippmann PHsu YCasati FYang J(2023)How do you feel? Measuring User-Perceived Value for Rejecting Machine Decisions in Hate Speech DetectionProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604655(834-844)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.1145/3600211.3604655
Roitero KBarbera DSoprano MDemartini GMizzaro SSakai T(2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3597201
Xie XDong QWang BLv FYao TGan WWu ZLi XLi HLiu YMa JChen HDuh WHuang HKato MMothe JPoblete B(2023)T2Ranking: A Large-scale Chinese Benchmark for Passage RankingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591874(2681-2690)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591874
Roitero KChecco AMizzaro SDemartini G(2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511960
Yan XLuo CClarke CCraswell NVoorhees ECastells PAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531991
Moffat A(2022)Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and MeaningIEEE Access10.1109/ACCESS.2022.321166810(105564-105577)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3211668
Roitero KMaddalena EMizzaro SScholer F(2022)On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10268858:6Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.ipm.2021.102688
Ceschia SRoitero KDemartini GMizzaro SDi Gaspero LSchaerf A(2022)Task design in complex crowdsourcing experimentsComputers and Operations Research10.1016/j.cor.2022.105995148:COnline publication date: 21-Oct-2022
https://dl.acm.org/doi/10.1016/j.cor.2022.105995
Wu ZLiu YMao JZhang MMa S(2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
https://doi.org/10.1007/s11390-022-2031-y
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents