research-article

Open access

Perspectives on Large Language Models for Relevance Judgment

Authors:

Guglielmo Faggioli,

Charles L. A. Clarke,

Gianluca Demartini,

Matthias Hagen,

Evangelos Kanoulas,

Martin Potthast,

Henning WachsmuthAuthors Info & Claims

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 39 - 50

https://doi.org/10.1145/3578337.3605136

Published: 09 August 2023 Publication History

Abstract

When asked, large language models~(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for~LLMs to support relevance judgments along with concerns and issues that arise. We devise a human--machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of~LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR~researchers.

References

[1]

Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. 2018. A System for Efficient High-Recall Retrieval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08--12, 2018. ACM, 1317--1320. https://doi.org/10.1145/3209978.3210176

Digital Library

[2]

Omar Alonso and Ricardo Baeza-Yates. 2011. Design and Implementation of Relevance Assessments Using Crowdsourcing. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18--21, 2011. Proceedings (Lecture Notes in Computer Science, Vol. 6611). Springer, 153--164. https://doi.org/10.1007/978--3--642--20161--5_16

[3]

Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using Mechanical Turk for Relevance Assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, Vol. 15. 16.

[4]

Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. 2021. BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction. In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021. ACM, 2857--2861. https://doi.org/10.1145/3459637.3482063

Digital Library

[5]

Negar Arabzadeh, Mahsa Seifikar, and Charles L. A. Clarke. 2022. Unsupervised Question Clarity Prediction through Retrieved Item Coherency. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17--21, 2022. ACM, 3811--3816. https://doi.org/10.1145/3511808.3557719

Digital Library

[6]

Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers, and Alexander Löser. 2019. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Trans. Assoc. Comput. Linguistics 7 (2019), 169--184. https://doi.org/10.1162/tacl_a_00261

[7]

Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo Test Collections for Learning Web Search Ranking Functions. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25--29, 2011. ACM, 1073--1082. https://doi.org/10.1145/2009916.2010058

Digital Library

[8]

Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does it Matter. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20--24, 2008. ACM, 667--674. https://doi.org/10.1145/1390334.1390447

Digital Library

[9]

Christine Basta, Marta Ruiz Costa-jussà, and Noe Casas. 2019. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. CoRR abs/1904.08783 (2019). arXiv:1904.08783 http://arxiv.org/abs/1904.08783

[10]

Christine Bauer, Ben Carterette, Nicola Ferro, and Norbert Fuhr. 2023. Report from Dagstuhl Seminar 23031: Frontiers of Information Access Experimentation for Research and Education. CoRR abs/2305.01509 (2023). https://doi.org/10.48550/arXiv.2305.01509 arXiv:2305.01509

[11]

Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, and David A. Grossman. 2003. Using Titles and Category Names from Editor-Driven Taxonomies for Automatic Evaluation. In Proceedings of the 2003 ACM CIKM International Conference on Information and Knowledge Management, New Orleans, Louisiana, USA, November 2--8, 2003. ACM, 17--23. https://doi.org/10.1145/956863.956868

Digital Library

[12]

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In FAccT '21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3--10, 2021. ACM, 610--623. https://doi.org/10.1145/3442188.3445922

Digital Library

[13]

Richard Berendsen, Manos Tsagkias, Maarten de Rijke, and Edgar Meij. 2012. Generating Pseudo Test Collections for Learning to Rank Scientific Articles. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics - Third International Conference of the CLEF Initiative, CLEF 2012, Rome, Italy, September 17--20, 2012. Proceedings (Lecture Notes in Computer Science, Vol. 7488). Springer, 42--53. https://doi.org/10.1007/978--3--642--33247-0_6

[14]

Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, and Duc Thanh Tran. 2011. Repeatable and Reliable Search System Evaluation Using Crowdsourcing. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25--29, 2011. ACM, 923--932. https://doi.org/10.1145/2009916.2010039

Digital Library

[15]

Martin Braschler. 2000. CLEF 2000 - Overview of Results. In Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21--22, 2000, Revised Papers (Lecture Notes in Computer Science, Vol. 2069). Springer, 89--101. https://doi.org/10.1007/3--540--44645--1_9

[16]

David Carmel and Elad Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Morgan & Claypool Publishers. https://doi.org/10.2200/S00235ED1V01Y201004ICR015

[17]

Ben Carterette, James Allan, and Ramesh K. Sitaraman. 2006. Minimal Test Collections for Retrieval Evaluation. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6--11, 2006. ACM, 268--275. https://doi.org/10.1145/1148170.1148219

Digital Library

[18]

Xiaoyang Chen, Ben He, and Le Sun. 2022. Groupwise Query Performance Prediction with BERT. In Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10--14, 2022, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 13186). Springer, 64--74. https://doi.org/10.1007/978--3-030--99739--7_8

[19]

Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Assessing Top-k Preferences. CoRR abs/2007.11682 (2020). arXiv:2007.11682 https://arxiv.org/abs/2007.11682

[20]

Cyril W Cleverdon. 1960. The Aslib Cranfield Research Project on the Comparative Efficiency of Indexing Systems. In Aslib Proceedings, Vol. 12. MCB UP Ltd, 421--431.

[21]

Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient Construction of Large Test Collections. In SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24--28 1998, Melbourne, Australia. ACM, 282--289. https://doi.org/10.1145/290941.291009

Digital Library

[22]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 Deep Learning Track. CoRR abs/2102.07662. arXiv:2102.07662 https://arxiv.org/abs/2102.07662

[23]

Antonia Creswell and Murray Shanahan. 2022. Faithful Reasoning Using Large Language Models. CoRR abs/2208.14271 (2022). https://doi.org/10.48550/arXiv.2208.14271 arXiv:2208.14271

[24]

Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The Conversational Assistance Track Overview. CoRR abs/2003.13624 (2020). arXiv:2003.13624 https://arxiv.org/abs/2003.13624

[25]

Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The Conversational Assistance Track Overview. CoRR abs/2003.13624. arXiv:2003.13624 https://arxiv.org/abs/2003.13624

[26]

Bhavana Bharat Dalvi, Einat Minkov, Partha Pratim Talukdar, and William W. Cohen. 2015. Automatic Gloss Finding for a Knowledge Base using Ontological Constraints. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2--6, 2015. ACM, 369--378. https://doi.org/10.1145/2684822.2685288

Digital Library

[27]

Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Comput. Surv. 51, 1 (2018), 7:1--7:40. https://doi.org/10.1145/3148148

Digital Library

[28]

Suchana Datta, Sean MacAvaney, Debasis Ganguly, and Derek Greene. 2022. A 'Pointwise-Query, Listwise-Document' Based Query Performance Prediction Approach. In Proceedings of 45th international ACM SIGIR conference research development in information retrieval. 2148--2153. https://doi.org/10.1145/3477495.3531821

Digital Library

[29]

Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. 2021. Towards Question- Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. Transactions of the Association for Computational Linguistics 9 (2021), 774--789. https://doi.org/10.1162/tacl_a_00397

[30]

Laura Dietz, Shubham Chatterjee, Connor Lennox, Sumanta Kashyapi, Pooja Oza, and Ben Gamari. 2022. Wikimarks: Harvesting Relevance Benchmarks from Wikipedia. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 3003--3012. https://doi.org/10.1145/3477495.3531731

Digital Library

[31]

Laura Dietz and Jeff Dalton. 2020. Humans Optional? Automatic Large-Scale Test Collections for Entity, Passage, and Entity-Passage Retrieval. Datenbank-Spektrum 20, 1 (2020), 17--28. https://doi.org/10.1007/s13222-020-00334-y

[32]

Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question Answering as an Automatic Evaluation Metric for News Article Summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 3938--3948. https://doi.org/10.18653/v1/n19--1395

[33]

Guglielmo Faggioli and Nicola Ferro. 2021. System Effect Estimation by Sharding: A Comparison Between ANOVA Approaches to Detect Significant Differences. In Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12657). Springer, 33--46. https://doi.org/10.1007/978--3-030--72240--1_3

[34]

Guglielmo Faggioli, Nicola Ferro, Cristina Muntean, Raffaele Perego, and Nicola Tonellotto. 2023. A Geometric Framework for Query Performance Prediction in Conversational Search. In Proceedings of 46th international ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2023 July 23--27, 2023, Taipei, Taiwan. ACM. https://doi.org/10.1145/3539618.3591625

Digital Library

[35]

Marco Ferrante, Nicola Ferro, and Maria Maistro. 2017. AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors. ACM Transactions on Information Systems 36, 2 (2017), 20:1--20:38. https://doi.org/10.1145/3110217

Digital Library

[36]

Fernando Ferraretto, Thiago Laitz, Roberto de Alencar Lotufo, and Rodrigo Nogueira. 2023. ExaRanker: Explanation-Augmented Neural Ranker. https://doi.org/10.48550/arXiv.2301.10521 arXiv:2301.10521

[37]

Frank Flemisch, David Abbink, Makoto Itoh, Marie-Pierre Pacaux-Lemoine, and Gina Weßel. 2016. Shared Control is the Sharp End of Cooperation: Towards a Common Framework of Joint Action, Shared Control and Human Machine Cooperation. IFAC-PapersOnLine 49, 19 (2016), 72--77. https://doi.org/10.1016/j.ifacol.2016.10.464 13th IFAC Symposium on Analysis, Design, and Evaluation of Human-Machine Systems HMS 2016.

[38]

Maik Fröbe, Christopher Akiki, Martin Potthast, and Matthias Hagen. 2022. Noise-Reduction for Automatically Transferred Relevance Judgments. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 13th International Conference of the CLEF Association, CLEF 2022, Bologna, Italy, September 5--8, 2022, Proceedings (Lecture Notes in Computer Science, Vol. 13390). Springer, 48--61. https://doi.org/10.1007/978--3-031--13643--6_4

[39]

Maik Fröbe, Lukas Gienapp, Martin Potthast, and Matthias Hagen. 2023. Boot-strapped nDCG Estimation in the Presence of Unjudged Documents. In Advances in Information Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13980). Springer, 313--329. https://doi.org/10.1007/978--3-031--28244--7_20

[40]

Norbert Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum 51, 3, 32--41. https://doi.org/10.1145/3190580.3190586

Digital Library

[41]

Ujwal Gadiraju, Gianluca Demartini, Ricardo Kawase, and Stefan Dietze. 2019. Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-selection. Computer Supported Cooperative Work 28, 5 (2019), 815--841. https://doi.org/10.1007/s10606-018--9336-y

[42]

Debasis Ganguly, Surupendu Gangopadhyay, Mandar Mitra, and Prasenjit Majumder (Eds.). 2022. FIRE '22: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation (Kolkata, India). Association for Computing Machinery, New York, NY, USA.

Digital Library

[43]

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. CoRR abs/2303.15056 (2023). https://doi.org/10.48550/arXiv.2303.15056 arXiv:2303.15056

[44]

Yeow Chong Goh, Xin Qing Cai, Walter Theseira, Giovanni Ko, and Khiam Aik Khor. 2020. Evaluating Human Versus Machine Learning Performance in Classifying Research Abstracts. Scientometrics 125, 2 (2020), 1197--1212. https://doi.org/10.1007/s11192-020-03614--2

[45]

Martin Halvey, Robert Villa, and Paul D. Clough. 2015. SIGIR 2014: Workshop on Gathering Efficient Assessments of Relevance (GEAR). SIGIR Forum 49, 1 (2015), 16--19. https://doi.org/10.1145/2795403.2795409

Digital Library

[46]

PA Hancock. 2013. Task partitioning effects in semi-automated human--machine system performance. Ergonomics 56, 9 (2013), 1387--1399. https://doi.org/10.1080/00140139.2013.816374

[47]

Donna Harman. 1992. Overview of the First Text REtrieval Conference (TREC-1). NIST Special Publication, Vol. 500--207. National Institute of Standards and Technology (NIST). 1--20 pages. http://trec.nist.gov/pubs/trec1/papers/01.txt

[48]

Claudia Hauff. 2010. Predicting the effectiveness of queries and retrieval systems. SIGIR Forum 44, 1 (2010), 88. https://doi.org/10.1145/1842890.1842906

Digital Library

[49]

Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, and Graham Neubig. 2021. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211--225. https://doi.org/10.1162/tacl_a_00362

[50]

Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia. (2016). https://doi.org/10.18653/v1/p16--1145

[51]

Luyang Huang, Lingfei Wu, and Lu Wang. 2020. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 5094--5107. https://doi.org/10.18653/v1/2020.acl-main.457

[52]

Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 5491--5501. https://doi.org/10.18653/v1/2020.acl-main.487

[53]

Aaron Jaech and Mari Ostendorf. 2018. Personalized Language Model for Query Auto-Completion. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 2: Short Papers. Association for Computational Linguistics, 700--705. https://doi.org/10.18653/v1/P18--2111

[54]

Kalervo Järvelin. 2009. Explaining User Performance in Information Retrieval: Challenges to IR Evaluation. In Advances in Information Retrieval Theory, Second International Conference on the Theory of Information Retrieval, ICTIR 2009, Cambridge, UK, September 10--12, 2009, Proceedings (Lecture Notes in Computer Science, Vol. 5766). Springer, 289--296. https://doi.org/10.1007/978--3--642-04417--5_28

[55]

Gaya K. Jayasinghe, William Webber, Mark Sanderson, and J. Shane Culpepper. 2014. Improving Test Collection Pools with Machine Learning. In Proceedings of the 2014 Australasian Document Computing Symposium, ADCS 2014, Melbourne, VIC, Australia, November 27--28, 2014. ACM, 2. https://doi.org/10.1145/2682862.2682864

Digital Library

[56]

Noriko Kando (Ed.). 1999. Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. National Center for Science Information Systems (NACSIS). http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/

[57]

Gjergji Kasneci, Maya Ramanath, Fabian M. Suchanek, and Gerhard Weikum. 2008. The YAGO-NAGA Approach to Knowledge Discovery. ACM SIGMOD Record 37, 4 (2008), 41--47. https://doi.org/10.1145/1519103.1519110

Digital Library

[58]

Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2013. An Analysis of Human Factors and Label Accuracy in Crowdsourcing Relevance Judgments. Information Retrieval 16, 2 (2013), 138--178. https://doi.org/10.1007/s10791-012--9205-0

[59]

Mostafa Keikha, Jae Hyun Park, and W. Bruce Croft. 2014. Evaluating Answer Passages Using Summarization Measures. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '14, Gold Coast, QLD, Australia - July 06 - 11, 2014. ACM, 963--966. https://doi.org/10.1145/2600428.2609485

Digital Library

[60]

Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. CoRR abs/1906.07337 (2019). arXiv:1906.07337 http://arxiv.org/abs/1906.07337

[61]

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful Chain-of-Thought Reasoning. CoRR abs/2301.13379 (2023). https://doi.org/10.48550/arXiv.2301.13379 arXiv:2301.13379

[62]

Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. CoRR abs/2302.11266 (2023). https://doi.org/10.48550/arXiv.2302.11266 arXiv:2302.11266

[63]

Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge. In Proceedings of the Fourth AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2016, 30 October - 3 November, 2016, Austin, Texas, USA. AAAI Press, 129--138. http://aaai.org/ocs/index.php/HCOMP/HCOMP16/paper/view/14040

[64]

Stefano Mizzaro. 1997. Relevance: The Whole History. Journal of the American society for information science 48, 9 (1997), 810--832. https://doi.org/10.1002/(SI CI)1097--4571(199709)48:9%3C810::AID-ASI6%3E3.0.CO;2-U

[65]

Zahra Nouri, Henning Wachsmuth, and Gregor Engels. 2020. Mining Crowd-sourcing Problems from Discussion Forums of Workers. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6264--6276. https://doi.org/10.18653/v1/2020.coling-main.551

[66]

Virgiliu Pavlu, Shahzad Rajput, Peter B. Golbus, and Javed A. Aslam. 2012. IR System Evaluation Using Nugget-Based Test Collections. In Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February 8--12, 2012. ACM, 393--402. https://doi.org/10.1145/2124295.2124343

Digital Library

[67]

Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6--11, 2006. ACM, 525--532. https://doi.org/10.1145/1148170.1148261

Digital Library

[68]

Tetsuya Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum 54, 1, 12:1--12:8. https://doi.org/10.1145/3451964.3451976

Digital Library

[69]

David P. Sander and Laura Dietz. 2021. EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want. In Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems, Padova, Italy, September 15--18, 2021 (CEUR Workshop Proceedings, Vol. 2950). CEUR-WS.org, 136--146. http://ceur-ws.org/Vol-2950/paper-16.pdf

[70]

Tefko Saracevic. 1995. Evaluation of Evaluation in Information Retrieval. In SIGIR'95, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, Washington, USA, July 9--13, 1995 (Special Issue of the SIGIR Forum). ACM Press, 138--146. https://doi.org/10.1145/215206.215351

Digital Library

[71]

Tefko Saracevic. 1996. Relevance Reconsidered. In Proceedings of the second conference on conceptions of library and information science (CoLIS 2). 201--218.

[72]

Linda Schamber. 1994. Relevance and Information Behavior. Annual review of information science and technology (ARIST) 29 (1994), 3--48.

[73]

Seungmin Seo, Donghyun Kim, Youbin Ahn, and Kyong-Ho Lee. 2022. Active Learning on Pre-trained Language Model with Task-Independent Triplet Loss. In Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 11276--11284. https://ojs.aaai.org/index.php/AAAI/article/view/21378

[74]

Akanksha Rai Sharma and Pranav Kaushik. 2017. Literature Survey of Statistical, Deep and Reinforcement Learning in Natural Language Processing. In 2017 International Conference on Computing, Communication and Automation (ICCCA). 350--354. https://doi.org/10.1109/CCAA.2017.8229841

[75]

Aashish Sheshadri and Matthew Lease. 2013. SQUARE: A Benchmark for Research on Computing Crowd Consensus. In Proceedings of the First AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2013, November 7--9, 2013, Palm Springs, CA, USA. AAAI. http://www.aaai.org/ocs/index.php/HCOMP/HCOMP13/paper/view/7550

[76]

Ian Soboroff. 2021. Overview of TREC 2021. In 30th Text REtrieval Conference. Gaithersburg, Maryland. https://trec.nist.gov/pubs/trec30/papers/Overview-2021.pdf

[77]

Ian Soboroff, Charles K. Nicholas, and Patrick Cahan. 2001. Ranking Retrieval Systems without Relevance Judgments. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9--13, 2001, New Orleans, Louisiana, USA. ACM, 66--73. https://doi.org/10.1145/383952.383961

Digital Library

[78]

Taylor Sorensen, Joshua Robinson, Christopher Michael Rytting, Alexander Glenn Shaw, Kyle Jeffrey Rogers, Alexia Pauline Delorey, Mahmoud Khalil, Nancy Fulda, and David Wingate. 2022. An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22--27, 2022. Association for Computational Linguistics, 819--862. https://doi.org/10.18653/v1/2022.acl-long.60

[79]

Lynda Tamine and Cecile Chouquet. 2017. On the Impact of Domain Expertise on Query Formulation, Relevance Assessment and Retrieval Performance in Clinical Settings. Information Processing & Management 53, 2 (2017), 332--350. https://doi.org/10.1016/j.ipm.2016.11.004

Digital Library

[80]

Ellen M. Voorhees. 2000. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Inf. Process. Manag. 36, 5 (2000), 697--716. https://doi.org/10.1016/S0306--4573(00)00010--8

[81]

Ellen M. Voorhees and Donna Harman. 1999. Overview of the Eighth Text REtrieval Conference (TREC-8). In Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17--19, 1999 (NIST Special Publication, Vol. 500--246). National Institute of Standards and Technology (NIST). http://trec.nist.gov/pubs/trec8/papers/overview_8.ps

[82]

Christian Weismayer, Ilona Pezenka, and Christopher Han-Kie Gan. 2018. Aspect-Based Sentiment Detection: Comparing Human Versus Automated Classifications of TripAdvisor Reviews. In Information and Communication Technologies in Tourism 2018, ENTER 2018, Proceedings of the International Conference in Jönköping, Sweden, January 24--26, 2018. Springer, 365--380. https://doi.org/10.1007/978--3--319--72923--7_28

[83]

Charles Welch, Chenxi Gu, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, and Rada Mihalcea. 2022. Leveraging Similar Users for Personalized Language Modeling with Limited Data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22--27, 2022. Association for Computational Linguistics, 1742--1752. https://doi.org/10.18653/v1/2022.acl-long.122

[84]

Jennifer Windsor, Laura M Piché, and Peggy A Locke. 1994. Preference Testing: A Comparison of Two Presentation Methods. Research in developmental disabilities 15, 6 (1994), 439--455. https://doi.org/10.1016/0891--4222(94)90028-0

[85]

Jiechen Xu, Lei Han, Shazia Sadiq, and Gianluca Demartini. 2023. On The Role of Human and Machine Metadata in Relevance Judgment Tasks. Information Processing & Management 60, 2 (2023), 103177. https://doi.org/10.1016/j.ipm.2022.103177

Digital Library

[86]

Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise Crowd Judgments: Preference, Absolute, and Ratio. In Proceedings of the 23rd Australasian Document Computing Symposium, ADCS 2018, Dunedin, New Zealand, December 11--12, 2018. ACM, 3:1--3:8. https://doi.org/10.1145/3291992.3291995

Digital Library

[87]

Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20--24, 2008. ACM, 603--610. https://doi.org/10.1145/1390334.1390437

Digital Library

[88]

Seunghyun Yoon, Hyeongu Yun, Yuna Kim, Gyu-tae Park, and Kyomin Jung. 2017. Efficient Transfer Learning Schemes for Personalized Language Modeling using Recurrent Neural Network. In The Workshops of the The Thirty-First AAAI Conference on Artificial Intelligence, Saturday, February 4--9, 2017, San Francisco, California, USA (AAAI Technical Report, Vol. WS-17). AAAI Press. http://aaai.org/ocs/index.php/WS/AAAIW17/paper/view/15144

[89]

Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jae Sung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan Zellers, Ronan Le Bras, Gunhee Kim, and Yejin Choi. 2022. Multimodal Knowledge Alignment with Reinforcement Learning. CoRR abs/2205.12630 (2022). https://doi.org/10.48550/arXiv.2205.12630 arXiv:2205.12630

[90]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr

[91]

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large Language Models Are Human-Level Prompt Engineers. CoRR abs/2211.01910 (2022). https://doi.org/10.48550/arXiv.2211.01910 arXiv:2211.01910

[92]

Yiming Zhu, Peixian Zhang, Ehsan ul Haq, Pan Hui, and Gareth Tyson. 2023. Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks. CoRR abs/2304.10145 (2023). https://doi.org/10.48550/arXiv.2304.10145 arXiv:2304.10145

Cited By

Azzopardi LClarke CKantor PMitra BTrippas JRen ZAliannejadi MArabzadeh NChandrasekar Rde Rijke MEustratiadis PHersh WHuang JKanoulas EKareem JLi YLupart SMekonnen KRoegiest ASoboroff ISilvestri FVerberne SVos DYang EZhao Y(2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1145/3687273.3687288
Farzi NDietz LOosterhuis HBast HXiong C(2024)Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate SystemsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672511(175-184)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672511
White R(2024)Advancing the Search Frontier with AI AgentsCommunications of the ACM10.1145/3655615Online publication date: 20-Aug-2024
https://doi.org/10.1145/3655615
Show More Cited By

Index Terms

Perspectives on Large Language Models for Relevance Judgment
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Study of Relevance and Effort across Devices
CHIIR '18: Proceedings of the 2018 Conference on Human Information Interaction & Retrieval

Relevance judgments are essential for designing information retrieval systems. Traditionally, judgments have been gathered via desktop interfaces. However, with the rise in popularity of smaller devices for information access, it has become imperative ...
Inferring document relevance from incomplete information
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Recent work has shown that average precision can be accurately estimated from a small random sample of judged documents. Unfortunately, such "random pools" cannot be used to evaluate retrieval measures in any standard way. In this work, we show that ...
Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments
Special Issue on Reproducibility in IR: Evaluation Campaigns, Collections and Analyses

The evaluation of retrieval effectiveness by means of test collections is a commonly used methodology in the information retrieval field. Some researchers have addressed the quite fascinating research question of whether it is possible to evaluate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

August 2023

300 pages

ISBN:9798400700736

DOI:10.1145/3578337

General Chair:
Masaharu Yoshioka
Hokkaido University, Japan
,
Program Chairs:
Julia Kiseleva
Microsoft Research, USA
,
Mohammad Aliannejadi
University of Amsterdam, Netherlands

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICTIR '23

Sponsor:

SIGIR

ICTIR '23: The 2023 ACM SIGIR International Conference on the Theory of Information Retrieval

July 23, 2023

Taipei, Taiwan

Acceptance Rates

ICTIR '23 Paper Acceptance Rate 30 of 73 submissions, 41%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
1,466
Total Downloads

Downloads (Last 12 months)1,466
Downloads (Last 6 weeks)126

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Azzopardi LClarke CKantor PMitra BTrippas JRen ZAliannejadi MArabzadeh NChandrasekar Rde Rijke MEustratiadis PHersh WHuang JKanoulas EKareem JLi YLupart SMekonnen KRoegiest ASoboroff ISilvestri FVerberne SVos DYang EZhao Y(2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1145/3687273.3687288
Farzi NDietz LOosterhuis HBast HXiong C(2024)Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate SystemsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672511(175-184)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672511
White R(2024)Advancing the Search Frontier with AI AgentsCommunications of the ACM10.1145/3655615Online publication date: 20-Aug-2024
https://doi.org/10.1145/3655615
Yoshioka MAliannejadi MKiseleva J(2024)Report on the 9th ACM SIGIR / the 13th International Conference on the Theory of Information Retrieval (ICTIR 2023)ACM SIGIR Forum10.1145/3642979.364299257:2(1-5)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3642979.3642992
White R(2024)Tasks, Copilots, and the Future of Search: A Keynote at SIGIR 2023ACM SIGIR Forum10.1145/3642979.364298557:2(1-8)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3642979.3642985
Sakai T(2024)Evaluating Parrots and Sociopathic Liars: A keynote at ICTIR 2023ACM SIGIR Forum10.1145/3642979.364298457:2(1-7)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3642979.3642984
Oosterhuis HJagerman RQin ZWang XBendersky MBaeza-Yates RBonchi F(2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671883
Roegiest APinkosova Z(2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638345
Zendel OCulpepper JScholer FThomas P(2024)Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch ProcessingProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638322(340-345)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638322
Roegiest ATrippas J(2024)UnExplored FrontCHIIRs: A Workshop Exploring Future Directions for Information AccessProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638302(436-437)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638302
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents