Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3209978.3210052acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

On Fine-Grained Relevance Scales

Published: 27 June 2018 Publication History

Abstract

In Information Retrieval evaluation, the classical approach of adopting binary relevance judgments has been replaced by multi-level relevance judgments and by gain-based metrics leveraging such multi-level judgment scales. Recent work has also proposed and evaluated unbounded relevance scales by means of Magnitude Estimation (ME) and compared them with multi-level scales. While ME brings advantages like the ability for assessors to always judge the next document as having higher or lower relevance than any of the documents they have judged so far, it also comes with some drawbacks. For example, it is not a natural approach for human assessors to judge items as they are used to do on the Web (e.g., 5-star rating). In this work, we propose and experimentally evaluate a bounded and fine-grained relevance scale having many of the advantages and dealing with some of the issues of ME. We collect relevance judgments over a 100-level relevance scale (S100) by means of a large-scale crowdsourcing experiment and compare the results with other relevance scales (binary, 4-level, and ME) showing the benefit of fine-grained scales over both coarse-grained and unbounded scales as well as highlighting some new results on ME. Our results show that S100 maintains the flexibility of unbounded scales like ME in providing assessors with ample choice when judging document relevance (i.e., assessors can fit relevance judgments in between of previously given judgments). It also allows assessors to judge on a more familiar scale (e.g., on 10 levels) and to perform efficiently since the very first judging task.

References

[1]
Omar Alonso and Stefano Mizzaro . 2012. Using crowdsourcing for TREC relevance assessment. Inf. Process. Manage. Vol. 48, 6 (2012), 1053--1066.
[2]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan . 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). ACM, New York, NY, USA, 621--630.
[3]
Charles LA Clarke, Nick Craswell, and Ian Soboroff . 2004. Overview of the TREC 2004 Terabyte Track. In TREC, Vol. Vol. 4. 74.
[4]
Kevyn Collins-Thompson, Craig Macdonald, Paul Bennett, Fernando Diaz, and Ellen M Voorhees . 2015. TREC 2014 web track overview. Technical Report. MICHIGAN UNIV ANN ARBOR.
[5]
Carsten Eickhoff . 2018. Cognitive Biases in Crowdsourcing. In WSDM 2018. To appear.
[6]
Ujwal Gadiraju, Alessandro Checco, Neha Gupta, and Gianluca Demartini . 2017. Modus Operandi of Crowd Workers: The Invisible Role of Microtask Work Environments. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. Vol. 1, 3, Article bibinfoarticleno49 (Sept. . 2017), bibinfonumpages29 pages.
[7]
George A Gescheider . 2013. Psychophysics: the fundamentals. Psychology Press.
[8]
Mehdi Hosseini, Ingemar J. Cox, Natasa Milic-Frayling, Gabriella Kazai, and Vishwa Vinay . 2012. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1--5, 2012. Proceedings. 182--194.
[9]
Quan Huynh-Thu, Marie-Neige Garcia, Filippo Speranza, Philip Corriveau, and Alexander Raake . 2011. Study of rating scales for subjective quality assessment of high-definition video. IEEE Transactions on Broadcasting Vol. 57, 1 (2011), 1--14.
[10]
Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. Vol. 20, 4 (Oct. . 2002), 422--446.
[11]
Kalervo J"arvelin and Jaana Kek"al"ainen . 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) Vol. 20, 4 (2002), 422--446.
[12]
Jiepu Jiang, Daqing He, and James Allan . 2017. Comparing In Situ and Multidimensional Relevance Judgments Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, New York, NY, USA, 405--414.
[13]
Klaus Krippendorff . 2007. Computing Krippendorff's alpha reliability. Departmental papers (ASC) (2007), 43.
[14]
Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini . 2016. Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge Fourth AAAI Conference on Human Computation and Crowdsourcing.
[15]
Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin . 2017. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Transactions on Information Systems (TOIS) Vol. 35, 3 (2017), 19.
[16]
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed . 2016. Why is that relevant? Collecting annotator rationales for relevance judgments Fourth AAAI Conference on Human Computation and Crowdsourcing.
[17]
Tetsuya Sakai . 2007. On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Inf. Process. Manage. Vol. 43, 2 (March . 2007), 531--548.
[18]
Tefko Saracevic . 2007. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance. Journal of the American Society for Information Science and Technology Vol. 58, 13 (2007), 1915--1933.
[19]
Eero Sormunen . 2002. Liberal Relevance Criteria of TREC -: Counting on Negligible Documents? Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '02). ACM, New York, NY, USA, 324--330.
[20]
Rong Tang, William M Shaw Jr, and Jack L Vevea . 1999. Towards the identification of the optimal number of relevance categories. Journal of the Association for Information Science and Technology Vol. 50, 3 (1999), 254.
[21]
Andrew Turpin, Falk Scholer, Stefano Mizzaro, and Eddy Maddalena . 2015. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 565--574.
[22]
Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi . 2014. Community-based Bayesian Aggregation Models for Crowdsourcing Proceedings of the 23rd International Conference on World Wide Web (WWW '14). ACM, 155--164.

Cited By

View all
  • (2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
  • (2023)How do you feel? Measuring User-Perceived Value for Rejecting Machine Decisions in Hate Speech DetectionProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604655(834-844)Online publication date: 8-Aug-2023
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ir evaluation
  2. relevance scales

Qualifiers

  • Research-article

Funding Sources

  • European Union's Horizon 2020 research and innovation programme

Conference

SIGIR '18
Sponsor:

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657712(1952-1962)Online publication date: 10-Jul-2024
  • (2023)How do you feel? Measuring User-Perceived Value for Rejecting Machine Decisions in Hate Speech DetectionProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3600211.3604655(834-844)Online publication date: 8-Aug-2023
  • (2023)How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance JudgmentsACM Transactions on Information Systems10.1145/359720142:1(1-26)Online publication date: 22-May-2023
  • (2023)T2Ranking: A Large-scale Chinese Benchmark for Passage RankingProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591874(2681-2690)Online publication date: 19-Jul-2023
  • (2022)Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance JudgmentsProceedings of the ACM Web Conference 202210.1145/3485447.3511960(319-327)Online publication date: 25-Apr-2022
  • (2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022
  • (2022)Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and MeaningIEEE Access10.1109/ACCESS.2022.321166810(105564-105577)Online publication date: 2022
  • (2022)On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluationInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10268858:6Online publication date: 22-Apr-2022
  • (2022)Task design in complex crowdsourcing experimentsComputers and Operations Research10.1016/j.cor.2022.105995148:COnline publication date: 21-Oct-2022
  • (2022)Leveraging Document-Level and Query-Level Passage Cumulative Gain for Document RankingJournal of Computer Science and Technology10.1007/s11390-022-2031-y37:4(814-838)Online publication date: 30-Jul-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media