Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3511960acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments

Published: 25 April 2022 Publication History

Abstract

In Information Retrieval (IR) evaluation, preference judgments are collected by presenting to the assessors a pair of documents and asking them to select which of the two, if any, is the most relevant. This is an alternative to the classic relevance judgment approach, in which human assessors judge the relevance of a single document on a scale; such an alternative allows to make relative rather than absolute judgments of relevance. While preference judgments are easier for human assessors to perform, the number of possible document pairs to be judged is usually so high that it makes it unfeasible to judge them all. Thus, following a similar idea to pooling strategies for single document relevance judgments where the goal is to sample the most useful documents to be judged, in this work we focus on analyzing alternative ways to sample document pairs to judge, in order to maximize the value of a fixed number of preference judgments that can feasibly be collected. Such value is defined as how well we can evaluate IR systems given a budget, that is, a fixed number of human preference judgments that may be collected. By relying on several datasets featuring relevance judgments gathered by means of experts and crowdsourcing, we experimentally compare alternative strategies to select document pairs and show how different strategies lead to different IR evaluation result quality levels. Our results show that, by using the appropriate procedure, it is possible to achieve good IR evaluation results with a limited number of preference judgments, thus confirming the feasibility of using preference judgments to create IR evaluation collections.

References

[1]
Javed A. Aslam and Mark Montague. 2001. Models for Metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA) (SIGIR ’01). Association for Computing Machinery, New York, NY, USA, 276–284. https://doi.org/10.1145/383952.384007
[2]
Maryam Bashir, Jesse Anderton, Jie Wu, Peter B. Golbus, Virgil Pavlu, and Javed A. Aslam. 2013. A Document Rating System for Preference Judgements. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland) (SIGIR ’13). Association for Computing Machinery, New York, NY, USA, 909–912. https://doi.org/10.1145/2484028.2484170
[3]
Ralph Allan Bradley and Milton E. Terry. 1952. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39, 3/4 (1952), 324–345.
[4]
Ben Carterette and Paul N. Bennett. 2008. Evaluation Measures for Preference Judgments. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 685–686. https://doi.org/10.1145/1390334.1390451
[5]
Ben Carterette, Paul N. Bennett, David Maxwell Chickering, and Susan T. Dumais. 2008. Here or There: Preference Judgments for Relevance. In Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval (Glasgow, UK) (ECIR’08). Springer-Verlag, Berlin, Heidelberg, 16–27.
[6]
Praveen Chandar and Ben Carterette. 2012. Using Preference Judgments for Novel Document Retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (Portland, Oregon, USA) (SIGIR ’12). Association for Computing Machinery, New York, NY, USA, 861–870. https://doi.org/10.1145/2348283.2348398
[7]
Xi Chen, Paul N. Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise Ranking Aggregation in a Crowdsourced Setting. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (Rome, Italy) (WSDM ’13). Association for Computing Machinery, New York, NY, USA, 193–202. https://doi.org/10.1145/2433396.2433420
[8]
Charles L.A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Offline Evaluation without Gain. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (Virtual Event, Norway) (ICTIR ’20). Association for Computing Machinery, New York, NY, USA, 185–192. https://doi.org/10.1145/3409256.3409816
[9]
Charles L. A. Clarke, Mark D. Smucker, and Alexandra Vtyurina. 2020. Offline Evaluation by Maximum Similarity to an Ideal Ranking. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. Association for Computing Machinery, New York, NY, USA, 225–234. https://doi.org/10.1145/3340531.3411915
[10]
Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2021. Assessing Top-k Preferences. ACM Transactions on Information Systems 39, 3, Article 33(2021), 21 pages. https://doi.org/10.1145/3451161
[11]
Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Boston, MA, USA) (SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 758–759. https://doi.org/10.1145/1571941.1572114
[12]
B. Dhivakar, S.V. Saravanan, and R. Abirama Krishnan. 2012. Statistical Score Calculation of Information Retrieval Systems using Data Fusion Technique. Computer Science and Engineering 2, 5 (2012), 43–45.
[13]
Peter Emerson. 2013. The original Borda count and partial voting. Social Choice and Welfare 40, 2 (2013), 353–358. https://doi.org/10.1007/s00355-011-0603-9
[14]
David Hawking, Ellen M. Voorhees, Nick Craswell, and Peter Bailey. 1999. Overview of the TREC-8 Web Track. In Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17-19, 1999(NIST Special Publication, Vol. 500-246), Ellen M. Voorhees and Donna K. Harman (Eds.). National Institute of Standards and Technology (NIST).
[15]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems 20, 4 (Oct. 2002), 422–446. https://doi.org/10.1145/582415.582418
[16]
Jong-June Jeon and Yongdai Kim. 2013. Revisiting the Bradley-Terry model and its application to information retrieval. Journal of the Korean Data and Information Science Society 24, 5(2013), 1089–1099.
[17]
Oren Kurland and J. Shane Culpepper. 2018. Fusion in Information Retrieval: SIGIR 2018 Half-Day Tutorial. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 1383–1386. https://doi.org/10.1145/3209978.3210186
[18]
Aldo Lipani, Guido Zuccon, Mihai Lupu, Bevan Koopman, and Allan Hanbury. 2016. The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (Newark, Delaware, USA) (ICTIR ’16). Association for Computing Machinery, New York, NY, USA, 105–108. https://doi.org/10.1145/2970398.2970429
[19]
Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval 3, 3 (March 2009), 225–331. https://doi.org/10.1561/1500000016
[20]
R. Duncan Luce. 1959. Individual choice behavior.John Wiley, Oxford, England. xii–153 pages.
[21]
Eddy Maddalena, Stefano Mizzaro, Falk Scholer, and Andrew Turpin. 2017. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation. ACM Transactions on Information Systems 35, 3, Article 19(2017), 32 pages.
[22]
Rabia Nuray and Fazli Can. 2006. Automatic ranking of information retrieval systems using data fusion. Information Processing & Management 42, 3 (2006), 595–614.
[23]
Kira Radinsky and Nir Ailon. 2011. Ranking from Pairs and Triplets: Information Quality, Evaluation Methods and Query Complexity. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (Hong Kong, China) (WSDM ’11). Association for Computing Machinery, New York, NY, USA, 105–114. https://doi.org/10.1145/1935826.1935850
[24]
Kevin Roitero, Andrea Brunello, Giuseppe Serra, and Stefano Mizzaro. 2020. Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations. Information Processing & Management 57, 2 (2020), 102–149.
[25]
Kevin Roitero, Eddy Maddalena, Gianluca Demartini, and Stefano Mizzaro. 2018. On Fine-Grained Relevance Scales. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI, USA) (SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 675–684. https://doi.org/10.1145/3209978.3210052
[26]
Kevin Roitero, Eddy Maddalena, Stefano Mizzaro, and Falk Scholer. 2021. On the effect of relevance scales in crowdsourcing relevance assessments for Information Retrieval evaluation. Information Processing & Management 58, 6 (2021), 102–688. https://doi.org/10.1016/j.ipm.2021.102688
[27]
Mark E. Rorvig. 1990. The Simple Scalability of Documents. Journal of the American Society for Information Science 41, 8(1990), 590–598.
[28]
Tetsuya Sakai and Zhaohao Zeng. 2020. Good Evaluation Measures Based on Document Preferences. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 359–368. https://doi.org/10.1145/3397271.3401115
[29]
Tetsuya Sakai and Zhaohao Zeng. 2020. Retrieval Evaluation Measures that Agree with Users’ SERP Preferences: Traditional, Preference-based, and Diversity Measures. ACM Transactions on Information Systems 39, 2 (2020), 1–35.
[30]
Riku Togashi and Tetsuya Sakai. 2019. Generalising Kendall’s Tau for Noisy and Incomplete Preference Judgements. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval(Santa Clara, CA, USA) (ICTIR ’19). Association for Computing Machinery, New York, NY, USA, 193–196. https://doi.org/10.1145/3341981.3344246
[31]
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise Crowd Judgments: Preference, Absolute, and Ratio. In Proceedings of the 23rd Australasian Document Computing Symposium (Dunedin, New Zealand) (ADCS ’18). Association for Computing Machinery, New York, NY, USA, Article 3, 8 pages. https://doi.org/10.1145/3291992.3291995
[32]
Emine Yilmaz, Javed A. Aslam, and Stephen Robertson. 2008. A New Rank Correlation Coefficient for Information Retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 587–594. https://doi.org/10.1145/1390334.1390435
[33]
Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 603–610. https://doi.org/10.1145/1390334.1390437
[34]
Bing Zhou and Yiyu Yao. 2010. Evaluating information retrieval system performance based on user preference. Journal of Intelligent Information Systems 34, 3 (01 Jun 2010), 227–248. https://doi.org/10.1007/s10844-009-0096-5
[35]
Dongqing Zhu and Ben Carterette. 2010. An Analysis of Assessor Behavior in Crowdsourced Preference Judgments. In Proceedings of the SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation (Geneva). 17–20.

Cited By

View all
  • (2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
  • (2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
  • (2023)Extending Label Aggregation Models with a Gaussian Process to Denoise Crowdsourcing LabelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591685(729-738)Online publication date: 19-Jul-2023
  • Show More Cited By

Index Terms

  1. Preferences on a Budget: Prioritizing Document Pairs when Crowdsourcing Relevance Judgments
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '22: Proceedings of the ACM Web Conference 2022
      April 2022
      3764 pages
      ISBN:9781450390965
      DOI:10.1145/3485447
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 April 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Crowdsourcing
      2. Preference Judgments
      3. Relevance Assessment

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • ARC Discovery Project
      • ARC Training Centre for Information Resilience

      Conference

      WWW '22
      Sponsor:
      WWW '22: The ACM Web Conference 2022
      April 25 - 29, 2022
      Virtual Event, Lyon, France

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)47
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Reliable Information Retrieval Systems Performance Evaluation: A ReviewIEEE Access10.1109/ACCESS.2024.337723912(51740-51751)Online publication date: 2024
      • (2023)A Preference Judgment Tool for Authoritative AssessmentProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591801(3100-3104)Online publication date: 19-Jul-2023
      • (2023)Extending Label Aggregation Models with a Gaussian Process to Denoise Crowdsourcing LabelsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591685(729-738)Online publication date: 19-Jul-2023
      • (2023)Implications and New Directions for IR Research and PracticesA Behavioral Economics Approach to Interactive Information Retrieval10.1007/978-3-031-23229-9_7(181-201)Online publication date: 18-Feb-2023
      • (2022)Human Preferences as Dueling BanditsProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531991(567-577)Online publication date: 6-Jul-2022

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media