Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3578337.3605136acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article
Open access

Perspectives on Large Language Models for Relevance Judgment

Published: 09 August 2023 Publication History

Abstract

When asked, large language models~(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for~LLMs to support relevance judgments along with concerns and issues that arise. We devise a human--machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of~LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR~researchers.

References

[1]
Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. 2018. A System for Efficient High-Recall Retrieval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08--12, 2018. ACM, 1317--1320. https://doi.org/10.1145/3209978.3210176
[2]
Omar Alonso and Ricardo Baeza-Yates. 2011. Design and Implementation of Relevance Assessments Using Crowdsourcing. In Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18--21, 2011. Proceedings (Lecture Notes in Computer Science, Vol. 6611). Springer, 153--164. https://doi.org/10.1007/978--3--642--20161--5_16
[3]
Omar Alonso and Stefano Mizzaro. 2009. Can we get rid of TREC assessors? Using Mechanical Turk for Relevance Assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, Vol. 15. 16.
[4]
Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. 2021. BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction. In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021. ACM, 2857--2861. https://doi.org/10.1145/3459637.3482063
[5]
Negar Arabzadeh, Mahsa Seifikar, and Charles L. A. Clarke. 2022. Unsupervised Question Clarity Prediction through Retrieved Item Coherency. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17--21, 2022. ACM, 3811--3816. https://doi.org/10.1145/3511808.3557719
[6]
Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers, and Alexander Löser. 2019. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Trans. Assoc. Comput. Linguistics 7 (2019), 169--184. https://doi.org/10.1162/tacl_a_00261
[7]
Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo Test Collections for Learning Web Search Ranking Functions. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25--29, 2011. ACM, 1073--1082. https://doi.org/10.1145/2009916.2010058
[8]
Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, and Emine Yilmaz. 2008. Relevance Assessment: Are Judges Exchangeable and Does it Matter. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20--24, 2008. ACM, 667--674. https://doi.org/10.1145/1390334.1390447
[9]
Christine Basta, Marta Ruiz Costa-jussà, and Noe Casas. 2019. Evaluating the Underlying Gender Bias in Contextualized Word Embeddings. CoRR abs/1904.08783 (2019). arXiv:1904.08783 http://arxiv.org/abs/1904.08783
[10]
Christine Bauer, Ben Carterette, Nicola Ferro, and Norbert Fuhr. 2023. Report from Dagstuhl Seminar 23031: Frontiers of Information Access Experimentation for Research and Education. CoRR abs/2305.01509 (2023). https://doi.org/10.48550/arXiv.2305.01509 arXiv:2305.01509
[11]
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, and David A. Grossman. 2003. Using Titles and Category Names from Editor-Driven Taxonomies for Automatic Evaluation. In Proceedings of the 2003 ACM CIKM International Conference on Information and Knowledge Management, New Orleans, Louisiana, USA, November 2--8, 2003. ACM, 17--23. https://doi.org/10.1145/956863.956868
[12]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In FAccT '21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3--10, 2021. ACM, 610--623. https://doi.org/10.1145/3442188.3445922
[13]
Richard Berendsen, Manos Tsagkias, Maarten de Rijke, and Edgar Meij. 2012. Generating Pseudo Test Collections for Learning to Rank Scientific Articles. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics - Third International Conference of the CLEF Initiative, CLEF 2012, Rome, Italy, September 17--20, 2012. Proceedings (Lecture Notes in Computer Science, Vol. 7488). Springer, 42--53. https://doi.org/10.1007/978--3--642--33247-0_6
[14]
Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, and Duc Thanh Tran. 2011. Repeatable and Reliable Search System Evaluation Using Crowdsourcing. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25--29, 2011. ACM, 923--932. https://doi.org/10.1145/2009916.2010039
[15]
Martin Braschler. 2000. CLEF 2000 - Overview of Results. In Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21--22, 2000, Revised Papers (Lecture Notes in Computer Science, Vol. 2069). Springer, 89--101. https://doi.org/10.1007/3--540--44645--1_9
[16]
David Carmel and Elad Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Morgan & Claypool Publishers. https://doi.org/10.2200/S00235ED1V01Y201004ICR015
[17]
Ben Carterette, James Allan, and Ramesh K. Sitaraman. 2006. Minimal Test Collections for Retrieval Evaluation. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6--11, 2006. ACM, 268--275. https://doi.org/10.1145/1148170.1148219
[18]
Xiaoyang Chen, Ben He, and Le Sun. 2022. Groupwise Query Performance Prediction with BERT. In Advances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10--14, 2022, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 13186). Springer, 64--74. https://doi.org/10.1007/978--3-030--99739--7_8
[19]
Charles L. A. Clarke, Alexandra Vtyurina, and Mark D. Smucker. 2020. Assessing Top-k Preferences. CoRR abs/2007.11682 (2020). arXiv:2007.11682 https://arxiv.org/abs/2007.11682
[20]
Cyril W Cleverdon. 1960. The Aslib Cranfield Research Project on the Comparative Efficiency of Indexing Systems. In Aslib Proceedings, Vol. 12. MCB UP Ltd, 421--431.
[21]
Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient Construction of Large Test Collections. In SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24--28 1998, Melbourne, Australia. ACM, 282--289. https://doi.org/10.1145/290941.291009
[22]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 Deep Learning Track. CoRR abs/2102.07662. arXiv:2102.07662 https://arxiv.org/abs/2102.07662
[23]
Antonia Creswell and Murray Shanahan. 2022. Faithful Reasoning Using Large Language Models. CoRR abs/2208.14271 (2022). https://doi.org/10.48550/arXiv.2208.14271 arXiv:2208.14271
[24]
Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The Conversational Assistance Track Overview. CoRR abs/2003.13624 (2020). arXiv:2003.13624 https://arxiv.org/abs/2003.13624
[25]
Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The Conversational Assistance Track Overview. CoRR abs/2003.13624. arXiv:2003.13624 https://arxiv.org/abs/2003.13624
[26]
Bhavana Bharat Dalvi, Einat Minkov, Partha Pratim Talukdar, and William W. Cohen. 2015. Automatic Gloss Finding for a Knowledge Base using Ontological Constraints. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2--6, 2015. ACM, 369--378. https://doi.org/10.1145/2684822.2685288
[27]
Florian Daniel, Pavel Kucherbaev, Cinzia Cappiello, Boualem Benatallah, and Mohammad Allahbakhsh. 2018. Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques, and Assurance Actions. ACM Comput. Surv. 51, 1 (2018), 7:1--7:40. https://doi.org/10.1145/3148148
[28]
Suchana Datta, Sean MacAvaney, Debasis Ganguly, and Derek Greene. 2022. A 'Pointwise-Query, Listwise-Document' Based Query Performance Prediction Approach. In Proceedings of 45th international ACM SIGIR conference research development in information retrieval. 2148--2153. https://doi.org/10.1145/3477495.3531821
[29]
Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. 2021. Towards Question- Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. Transactions of the Association for Computational Linguistics 9 (2021), 774--789. https://doi.org/10.1162/tacl_a_00397
[30]
Laura Dietz, Shubham Chatterjee, Connor Lennox, Sumanta Kashyapi, Pooja Oza, and Ben Gamari. 2022. Wikimarks: Harvesting Relevance Benchmarks from Wikipedia. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022. ACM, 3003--3012. https://doi.org/10.1145/3477495.3531731
[31]
Laura Dietz and Jeff Dalton. 2020. Humans Optional? Automatic Large-Scale Test Collections for Entity, Passage, and Entity-Passage Retrieval. Datenbank-Spektrum 20, 1 (2020), 17--28. https://doi.org/10.1007/s13222-020-00334-y
[32]
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question Answering as an Automatic Evaluation Metric for News Article Summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2--7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 3938--3948. https://doi.org/10.18653/v1/n19--1395
[33]
Guglielmo Faggioli and Nicola Ferro. 2021. System Effect Estimation by Sharding: A Comparison Between ANOVA Approaches to Detect Significant Differences. In Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12657). Springer, 33--46. https://doi.org/10.1007/978--3-030--72240--1_3
[34]
Guglielmo Faggioli, Nicola Ferro, Cristina Muntean, Raffaele Perego, and Nicola Tonellotto. 2023. A Geometric Framework for Query Performance Prediction in Conversational Search. In Proceedings of 46th international ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2023 July 23--27, 2023, Taipei, Taiwan. ACM. https://doi.org/10.1145/3539618.3591625
[35]
Marco Ferrante, Nicola Ferro, and Maria Maistro. 2017. AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors. ACM Transactions on Information Systems 36, 2 (2017), 20:1--20:38. https://doi.org/10.1145/3110217
[36]
Fernando Ferraretto, Thiago Laitz, Roberto de Alencar Lotufo, and Rodrigo Nogueira. 2023. ExaRanker: Explanation-Augmented Neural Ranker. https://doi.org/10.48550/arXiv.2301.10521 arXiv:2301.10521
[37]
Frank Flemisch, David Abbink, Makoto Itoh, Marie-Pierre Pacaux-Lemoine, and Gina Weßel. 2016. Shared Control is the Sharp End of Cooperation: Towards a Common Framework of Joint Action, Shared Control and Human Machine Cooperation. IFAC-PapersOnLine 49, 19 (2016), 72--77. https://doi.org/10.1016/j.ifacol.2016.10.464 13th IFAC Symposium on Analysis, Design, and Evaluation of Human-Machine Systems HMS 2016.
[38]
Maik Fröbe, Christopher Akiki, Martin Potthast, and Matthias Hagen. 2022. Noise-Reduction for Automatically Transferred Relevance Judgments. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 13th International Conference of the CLEF Association, CLEF 2022, Bologna, Italy, September 5--8, 2022, Proceedings (Lecture Notes in Computer Science, Vol. 13390). Springer, 48--61. https://doi.org/10.1007/978--3-031--13643--6_4
[39]
Maik Fröbe, Lukas Gienapp, Martin Potthast, and Matthias Hagen. 2023. Boot-strapped nDCG Estimation in the Presence of Unjudged Documents. In Advances in Information Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13980). Springer, 313--329. https://doi.org/10.1007/978--3-031--28244--7_20
[40]
Norbert Fuhr. 2017. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided. SIGIR Forum 51, 3, 32--41. https://doi.org/10.1145/3190580.3190586
[41]
Ujwal Gadiraju, Gianluca Demartini, Ricardo Kawase, and Stefan Dietze. 2019. Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-selection. Computer Supported Cooperative Work 28, 5 (2019), 815--841. https://doi.org/10.1007/s10606-018--9336-y
[42]
Debasis Ganguly, Surupendu Gangopadhyay, Mandar Mitra, and Prasenjit Majumder (Eds.). 2022. FIRE '22: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation (Kolkata, India). Association for Computing Machinery, New York, NY, USA.
[43]
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. CoRR abs/2303.15056 (2023). https://doi.org/10.48550/arXiv.2303.15056 arXiv:2303.15056
[44]
Yeow Chong Goh, Xin Qing Cai, Walter Theseira, Giovanni Ko, and Khiam Aik Khor. 2020. Evaluating Human Versus Machine Learning Performance in Classifying Research Abstracts. Scientometrics 125, 2 (2020), 1197--1212. https://doi.org/10.1007/s11192-020-03614--2
[45]
Martin Halvey, Robert Villa, and Paul D. Clough. 2015. SIGIR 2014: Workshop on Gathering Efficient Assessments of Relevance (GEAR). SIGIR Forum 49, 1 (2015), 16--19. https://doi.org/10.1145/2795403.2795409
[46]
PA Hancock. 2013. Task partitioning effects in semi-automated human--machine system performance. Ergonomics 56, 9 (2013), 1387--1399. https://doi.org/10.1080/00140139.2013.816374
[47]
Donna Harman. 1992. Overview of the First Text REtrieval Conference (TREC-1). NIST Special Publication, Vol. 500--207. National Institute of Standards and Technology (NIST). 1--20 pages. http://trec.nist.gov/pubs/trec1/papers/01.txt
[48]
Claudia Hauff. 2010. Predicting the effectiveness of queries and retrieval systems. SIGIR Forum 44, 1 (2010), 88. https://doi.org/10.1145/1842890.1842906
[49]
Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris Ackerson, Raj Neervannan, and Graham Neubig. 2021. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Transactions of the Association for Computational Linguistics 9 (2021), 211--225. https://doi.org/10.1162/tacl_a_00362
[50]
Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia. (2016). https://doi.org/10.18653/v1/p16--1145
[51]
Luyang Huang, Lingfei Wu, and Lu Wang. 2020. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 5094--5107. https://doi.org/10.18653/v1/2020.acl-main.457
[52]
Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social Biases in NLP Models as Barriers for Persons with Disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5--10, 2020. Association for Computational Linguistics, 5491--5501. https://doi.org/10.18653/v1/2020.acl-main.487
[53]
Aaron Jaech and Mari Ostendorf. 2018. Personalized Language Model for Query Auto-Completion. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 2: Short Papers. Association for Computational Linguistics, 700--705. https://doi.org/10.18653/v1/P18--2111
[54]
Kalervo Järvelin. 2009. Explaining User Performance in Information Retrieval: Challenges to IR Evaluation. In Advances in Information Retrieval Theory, Second International Conference on the Theory of Information Retrieval, ICTIR 2009, Cambridge, UK, September 10--12, 2009, Proceedings (Lecture Notes in Computer Science, Vol. 5766). Springer, 289--296. https://doi.org/10.1007/978--3--642-04417--5_28
[55]
Gaya K. Jayasinghe, William Webber, Mark Sanderson, and J. Shane Culpepper. 2014. Improving Test Collection Pools with Machine Learning. In Proceedings of the 2014 Australasian Document Computing Symposium, ADCS 2014, Melbourne, VIC, Australia, November 27--28, 2014. ACM, 2. https://doi.org/10.1145/2682862.2682864
[56]
Noriko Kando (Ed.). 1999. Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. National Center for Science Information Systems (NACSIS). http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/
[57]
Gjergji Kasneci, Maya Ramanath, Fabian M. Suchanek, and Gerhard Weikum. 2008. The YAGO-NAGA Approach to Knowledge Discovery. ACM SIGMOD Record 37, 4 (2008), 41--47. https://doi.org/10.1145/1519103.1519110
[58]
Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2013. An Analysis of Human Factors and Label Accuracy in Crowdsourcing Relevance Judgments. Information Retrieval 16, 2 (2013), 138--178. https://doi.org/10.1007/s10791-012--9205-0
[59]
Mostafa Keikha, Jae Hyun Park, and W. Bruce Croft. 2014. Evaluating Answer Passages Using Summarization Measures. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '14, Gold Coast, QLD, Australia - July 06 - 11, 2014. ACM, 963--966. https://doi.org/10.1145/2600428.2609485
[60]
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W. Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. CoRR abs/1906.07337 (2019). arXiv:1906.07337 http://arxiv.org/abs/1906.07337
[61]
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful Chain-of-Thought Reasoning. CoRR abs/2301.13379 (2023). https://doi.org/10.48550/arXiv.2301.13379 arXiv:2301.13379
[62]
Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. CoRR abs/2302.11266 (2023). https://doi.org/10.48550/arXiv.2302.11266 arXiv:2302.11266
[63]
Eddy Maddalena, Marco Basaldella, Dario De Nart, Dante Degl'Innocenti, Stefano Mizzaro, and Gianluca Demartini. 2016. Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge. In Proceedings of the Fourth AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2016, 30 October - 3 November, 2016, Austin, Texas, USA. AAAI Press, 129--138. http://aaai.org/ocs/index.php/HCOMP/HCOMP16/paper/view/14040
[64]
Stefano Mizzaro. 1997. Relevance: The Whole History. Journal of the American society for information science 48, 9 (1997), 810--832. https://doi.org/10.1002/(SI CI)1097--4571(199709)48:9%3C810::AID-ASI6%3E3.0.CO;2-U
[65]
Zahra Nouri, Henning Wachsmuth, and Gregor Engels. 2020. Mining Crowd-sourcing Problems from Discussion Forums of Workers. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6264--6276. https://doi.org/10.18653/v1/2020.coling-main.551
[66]
Virgiliu Pavlu, Shahzad Rajput, Peter B. Golbus, and Javed A. Aslam. 2012. IR System Evaluation Using Nugget-Based Test Collections. In Proceedings of the Fifth International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, WA, USA, February 8--12, 2012. ACM, 393--402. https://doi.org/10.1145/2124295.2124343
[67]
Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6--11, 2006. ACM, 525--532. https://doi.org/10.1145/1148170.1148261
[68]
Tetsuya Sakai. 2020. On Fuhr's Guideline for IR Evaluation. SIGIR Forum 54, 1, 12:1--12:8. https://doi.org/10.1145/3451964.3451976
[69]
David P. Sander and Laura Dietz. 2021. EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want. In Proceedings of the Second International Conference on Design of Experimental Search & Information REtrieval Systems, Padova, Italy, September 15--18, 2021 (CEUR Workshop Proceedings, Vol. 2950). CEUR-WS.org, 136--146. http://ceur-ws.org/Vol-2950/paper-16.pdf
[70]
Tefko Saracevic. 1995. Evaluation of Evaluation in Information Retrieval. In SIGIR'95, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, Washington, USA, July 9--13, 1995 (Special Issue of the SIGIR Forum). ACM Press, 138--146. https://doi.org/10.1145/215206.215351
[71]
Tefko Saracevic. 1996. Relevance Reconsidered. In Proceedings of the second conference on conceptions of library and information science (CoLIS 2). 201--218.
[72]
Linda Schamber. 1994. Relevance and Information Behavior. Annual review of information science and technology (ARIST) 29 (1994), 3--48.
[73]
Seungmin Seo, Donghyun Kim, Youbin Ahn, and Kyong-Ho Lee. 2022. Active Learning on Pre-trained Language Model with Task-Independent Triplet Loss. In Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 11276--11284. https://ojs.aaai.org/index.php/AAAI/article/view/21378
[74]
Akanksha Rai Sharma and Pranav Kaushik. 2017. Literature Survey of Statistical, Deep and Reinforcement Learning in Natural Language Processing. In 2017 International Conference on Computing, Communication and Automation (ICCCA). 350--354. https://doi.org/10.1109/CCAA.2017.8229841
[75]
Aashish Sheshadri and Matthew Lease. 2013. SQUARE: A Benchmark for Research on Computing Crowd Consensus. In Proceedings of the First AAAI Conference on Human Computation and Crowdsourcing, HCOMP 2013, November 7--9, 2013, Palm Springs, CA, USA. AAAI. http://www.aaai.org/ocs/index.php/HCOMP/HCOMP13/paper/view/7550
[76]
Ian Soboroff. 2021. Overview of TREC 2021. In 30th Text REtrieval Conference. Gaithersburg, Maryland. https://trec.nist.gov/pubs/trec30/papers/Overview-2021.pdf
[77]
Ian Soboroff, Charles K. Nicholas, and Patrick Cahan. 2001. Ranking Retrieval Systems without Relevance Judgments. In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9--13, 2001, New Orleans, Louisiana, USA. ACM, 66--73. https://doi.org/10.1145/383952.383961
[78]
Taylor Sorensen, Joshua Robinson, Christopher Michael Rytting, Alexander Glenn Shaw, Kyle Jeffrey Rogers, Alexia Pauline Delorey, Mahmoud Khalil, Nancy Fulda, and David Wingate. 2022. An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22--27, 2022. Association for Computational Linguistics, 819--862. https://doi.org/10.18653/v1/2022.acl-long.60
[79]
Lynda Tamine and Cecile Chouquet. 2017. On the Impact of Domain Expertise on Query Formulation, Relevance Assessment and Retrieval Performance in Clinical Settings. Information Processing & Management 53, 2 (2017), 332--350. https://doi.org/10.1016/j.ipm.2016.11.004
[80]
Ellen M. Voorhees. 2000. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Inf. Process. Manag. 36, 5 (2000), 697--716. https://doi.org/10.1016/S0306--4573(00)00010--8
[81]
Ellen M. Voorhees and Donna Harman. 1999. Overview of the Eighth Text REtrieval Conference (TREC-8). In Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17--19, 1999 (NIST Special Publication, Vol. 500--246). National Institute of Standards and Technology (NIST). http://trec.nist.gov/pubs/trec8/papers/overview_8.ps
[82]
Christian Weismayer, Ilona Pezenka, and Christopher Han-Kie Gan. 2018. Aspect-Based Sentiment Detection: Comparing Human Versus Automated Classifications of TripAdvisor Reviews. In Information and Communication Technologies in Tourism 2018, ENTER 2018, Proceedings of the International Conference in Jönköping, Sweden, January 24--26, 2018. Springer, 365--380. https://doi.org/10.1007/978--3--319--72923--7_28
[83]
Charles Welch, Chenxi Gu, Jonathan K. Kummerfeld, Verónica Pérez-Rosas, and Rada Mihalcea. 2022. Leveraging Similar Users for Personalized Language Modeling with Limited Data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22--27, 2022. Association for Computational Linguistics, 1742--1752. https://doi.org/10.18653/v1/2022.acl-long.122
[84]
Jennifer Windsor, Laura M Piché, and Peggy A Locke. 1994. Preference Testing: A Comparison of Two Presentation Methods. Research in developmental disabilities 15, 6 (1994), 439--455. https://doi.org/10.1016/0891--4222(94)90028-0
[85]
Jiechen Xu, Lei Han, Shazia Sadiq, and Gianluca Demartini. 2023. On The Role of Human and Machine Metadata in Relevance Judgment Tasks. Information Processing & Management 60, 2 (2023), 103177. https://doi.org/10.1016/j.ipm.2022.103177
[86]
Ziying Yang, Alistair Moffat, and Andrew Turpin. 2018. Pairwise Crowd Judgments: Preference, Absolute, and Ratio. In Proceedings of the 23rd Australasian Document Computing Symposium, ADCS 2018, Dunedin, New Zealand, December 11--12, 2018. ACM, 3:1--3:8. https://doi.org/10.1145/3291992.3291995
[87]
Emine Yilmaz, Evangelos Kanoulas, and Javed A. Aslam. 2008. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20--24, 2008. ACM, 603--610. https://doi.org/10.1145/1390334.1390437
[88]
Seunghyun Yoon, Hyeongu Yun, Yuna Kim, Gyu-tae Park, and Kyomin Jung. 2017. Efficient Transfer Learning Schemes for Personalized Language Modeling using Recurrent Neural Network. In The Workshops of the The Thirty-First AAAI Conference on Artificial Intelligence, Saturday, February 4--9, 2017, San Francisco, California, USA (AAAI Technical Report, Vol. WS-17). AAAI Press. http://aaai.org/ocs/index.php/WS/AAAIW17/paper/view/15144
[89]
Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jae Sung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan Zellers, Ronan Le Bras, Gunhee Kim, and Yejin Choi. 2022. Multimodal Knowledge Alignment with Reinforcement Learning. CoRR abs/2205.12630 (2022). https://doi.org/10.48550/arXiv.2205.12630 arXiv:2205.12630
[90]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr
[91]
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large Language Models Are Human-Level Prompt Engineers. CoRR abs/2211.01910 (2022). https://doi.org/10.48550/arXiv.2211.01910 arXiv:2211.01910
[92]
Yiming Zhu, Peixian Zhang, Ehsan ul Haq, Pan Hui, and Gareth Tyson. 2023. Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks. CoRR abs/2304.10145 (2023). https://doi.org/10.48550/arXiv.2304.10145 arXiv:2304.10145

Cited By

View all
  • (2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 1-Jun-2024
  • (2024)Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate SystemsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672511(175-184)Online publication date: 2-Aug-2024
  • (2024)Advancing the Search Frontier with AI AgentsCommunications of the ACM10.1145/3655615Online publication date: 20-Aug-2024
  • Show More Cited By

Index Terms

  1. Perspectives on Large Language Models for Relevance Judgment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval
    August 2023
    300 pages
    ISBN:9798400700736
    DOI:10.1145/3578337
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 August 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automatic test collections
    2. human--machine collaboration
    3. large language models
    4. relevance judgments

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation

    Conference

    ICTIR '23
    Sponsor:

    Acceptance Rates

    ICTIR '23 Paper Acceptance Rate 30 of 73 submissions, 41%;
    Overall Acceptance Rate 235 of 527 submissions, 45%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,466
    • Downloads (Last 6 weeks)126
    Reflects downloads up to 01 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Report on the Search Futures Workshop at ECIR 2024ACM SIGIR Forum10.1145/3687273.368728858:1(1-41)Online publication date: 1-Jun-2024
    • (2024)Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate SystemsProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672511(175-184)Online publication date: 2-Aug-2024
    • (2024)Advancing the Search Frontier with AI AgentsCommunications of the ACM10.1145/3655615Online publication date: 20-Aug-2024
    • (2024)Report on the 9th ACM SIGIR / the 13th International Conference on the Theory of Information Retrieval (ICTIR 2023)ACM SIGIR Forum10.1145/3642979.364299257:2(1-5)Online publication date: 22-Jan-2024
    • (2024)Tasks, Copilots, and the Future of Search: A Keynote at SIGIR 2023ACM SIGIR Forum10.1145/3642979.364298557:2(1-8)Online publication date: 22-Jan-2024
    • (2024)Evaluating Parrots and Sociopathic Liars: A keynote at ICTIR 2023ACM SIGIR Forum10.1145/3642979.364298457:2(1-7)Online publication date: 22-Jan-2024
    • (2024)Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I.Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671883(2307-2317)Online publication date: 25-Aug-2024
    • (2024)Generative Information Systems Are Great If You Can ReadProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638345(165-177)Online publication date: 10-Mar-2024
    • (2024)Enhancing Human Annotation: Leveraging Large Language Models and Efficient Batch ProcessingProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638322(340-345)Online publication date: 10-Mar-2024
    • (2024)UnExplored FrontCHIIRs: A Workshop Exploring Future Directions for Information AccessProceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638302(436-437)Online publication date: 10-Mar-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media