research-article

Open access

Are Large Language Models Good at Utility Judgments?

Authors:

Maarten de Rijke,

Xueqi ChengAuthors Info & Claims

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1941 - 1951

https://doi.org/10.1145/3626772.3657784

Published: 11 July 2024 Publication History

Abstract

Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering.

In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain question answering (QA). Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our experiments reveal that: (i) well-instructed LLMs can distinguish between relevance and utility, and that LLMs are highly receptive to newly generated counterfactual passages. Moreover, (ii) we scrutinize key factors that affect utility judgments in the instruction design. And finally, (iii) to verify the efficacy of utility judgments in practical retrieval augmentation applications, we delve into LLMs' QA capabilities using the evidence judged with utility and direct dense retrieval results. (iv) We propose a k-sampling, listwise approach to reduce the dependency of LLMs on the sequence of input passages, thereby facilitating subsequent answer generation. We believe that the way we formalize and study the problem along with our findings contributes to a critical assessment of retrieval-augmented LLMs. Our code and benchmark can be found at https://github.com/ict-bigdatalab/utility_judgments.

References

[1]

Ahmad Aghaebrahimian. 2018. Linguistically-based Deep Unstructured Question Answering. CoNLL, 433--443.

[2]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with ChatGPT Quality. https://vicuna.lmsys.org (Accessed 14 April 2023).

[3]

Nick Craswell. 2009. Mean Reciprocal Rank. Encyclopedia of Database Systems 1703 (2009).

[4]

Guglielmo Faggioli, Laura Dietz, Charles LA Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, et al. 2023. Perspectives on Large Language Models for Relevance Judgment. In SIGIR. 39--50.

[5]

Run-Ze Fan, Yixing Fan, Jiangui Chen, Jiafeng Guo, Ruqing Zhang, and Xueqi Cheng. 2024. RIGHT: Retrieval-Augmented Generation for Mainstream Hashtag Recommendation. In European Conference on Information Retrieval. Springer, 39--55.

[6]

Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, and Pengfei Liu. 2024. Reformatted Alignment. arXiv preprint arXiv:2402.12219 (2024). https://arxiv.org/abs/2402.12219

[7]

Helia Hashemi, Mohammad Aliannejadi, Hamed Zamani, and W Bruce Croft. 2020. ANTIQUE: A Non-factoid Question Answering Benchmark. In ECIR 2020. Springer, 166--173.

[8]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In ICLR.

[9]

Matthew Honnibal. 2017. spaCy. (2017). https://spacy.io/

[10]

Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2023. Large Language Models are Zero-shot Rankers for Recommender Systems. arXiv preprint arXiv:2305.08845 (2023).

[11]

Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. EACL (2021).

[12]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Few-shot Learning with Retrieval Augmented Language Models. arXiv preprint arXiv:2208.03299 (2022).

[13]

Kalervo Järvelin and Jaana Kekäläinen. 2017. IR Evaluation Methods for Retrieving Highly Relevant Focuments. In SIGIR, Vol. 51. 243--250.

Digital Library

[14]

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. ACL (2023).

[15]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-shot Reasoners. NeurIPS 35 (2022), 22199--22213.

[16]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research. TACL 7 (2019), 453--466.

[17]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 33 (2020), 9459--9474.

[18]

Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, and Mike Lewis. 2023. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259 (2023).

[19]

Chin-Yew Lin. 2004. Rouge: A Package for Automatic Evaluation of Summaries. In Text summarization branches out. 74--81.

[20]

Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. WebGLM: Towards An EfficientWeb-Enhanced Question Answering System with Human Preferences. KDD (2023).

[21]

Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng. 2023. Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 1647--1656.

Digital Library

[22]

Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng. 2023. Topic-oriented Adversarial Attacks against Blackbox Neural Ranking Models. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1700--1709.

[23]

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based Knowledge Conflicts in Question Answering. EMNLP (2021).

[24]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. ACL (2022).

[25]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-parametric Memories. In ACL. 9802--9822.

[26]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated Machine Reading Comprehension Dataset. choice 2640 (2016), 660.

[27]

Shiyu Ni, Keping Bi, Jiafeng Guo, and Xueqi Cheng. 2024. When Do LLMs Need Retrieval Augmentation? Mitigating LLMs' Overconfidence Helps Retrieval Augmentation. arXiv preprint arXiv:2402.11457 (2024).

[28]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019).

[29]

OpenAI. 2022. Introducing ChatGPT. (2022). openai.com/blog/chatgpt.

[30]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL. 311--318.

[31]

Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483 (2023).

[32]

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. arXiv preprint arXiv:2309.15088 (2023).

[33]

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, et al. 2023. Large Language Models are Effective Text Rankers with Pairwise Ranking prompting. arXiv preprint arXiv:2306.17563 (2023).

[34]

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-Context Retrieval-Augmented Language Models. TACL (2023).

[35]

Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. 2021. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. EMNLP (2021).

[36]

Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, and Haifeng Wang. 2023. Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. arXiv preprint arXiv:2307.11019 (2023).

[37]

Tefko Saracevic. 2016. The Notion of Relevance in Information Science: Everybody knows what relevance is. But, what is it really? Morgan & Claypool Publishers.

[38]

Tefko Saracevic, Paul Kantor, Alice Y Chamis, and Donna Trivison. 1988. A study of information seeking and retrieving. I. Background and methodology. Journal of the American Society for Information science 39, 3 (1988), 161--176.

[39]

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. REPLUG: Retrieval-augmented Black-box Language Models. arXiv preprint arXiv:2301.12652 (2023).

[40]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. arXiv preprint arXiv:2104.07567 (2021).

[41]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. EMNLP (2023).

[42]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288 (2023).

[43]

Yequan Wang, Hengran Zhang, Aixin Sun, and Xuying Meng. 2022. Cort: A new baseline for comparative opinion classification by dual prompts. In Findings of the Association for Computational Linguistics: EMNLP 2022. 7064--7075.

[44]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 35 (2022), 24824--24837.

[45]

Chen Wu, Ruqing Zhang, Jiafeng Guo, Yixing Fan, and Xueqi Cheng. 2022. Are Neural Ranking Models Robust? TOIS 41, 2 (2022), 1--36.

Digital Library

[46]

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2023. Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Conflicts. arXiv preprint arXiv:2305.13300 (2023).

[47]

Zhouhang Xie, Sameer Singh, Julian McAuley, and Bodhisattwa Prasad Majumder. 2023. Factual and Informative Review Generation for Explainable Recommendation. In AAAI, Vol. 37. 13816--13824.

Digital Library

[48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, WilliamW. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. ACL (2018).

[49]

Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2023. Generate Rather than Retrieve: Large Language Models are Strong Context Generators. ICLR (2023).

[50]

Hamed Zamani, Fernando Diaz, Mostafa Dehghani, Donald Metzler, and Michael Bendersky. 2022. Retrieval-enhanced Machine Learning. In SIGIR. 2875--2886.

[51]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing Dense Retrieval Model Training with Hard Negatives. In SIGIR. 1503--1512.

[52]

Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2023. From Relevance to Utility: Evidence Retrieval with Feedback for Fact Verification. EMNLP Findings (2023).

[53]

Xuchao Zhang, Menglin Xia, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, and Victor Ruhle. 2023. Hybrid Retrieval-Augmented Generation for Real-time Composition Assistance. arXiv preprint arXiv:2308.04215 (2023).

[54]

Honglei Zhuang, Zhen Qin, Kai Hui, Junru Wu, Le Yan, Xuanhui Wang, and Michael Berdersky. 2023. Beyond Yes and No: Improving Zero-shot LLM Rankers via Scoring Fine-grained Relevance Labels. arXiv preprint arXiv:2310.14122 (2023).

[55]

Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. 2023. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. arXiv preprint arXiv:2310.09497 (2023).

Cited By

Liu YZhang RGuo Jde Rijke MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Robust Information RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661380(3009-3012)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661380

Index Terms

Are Large Language Models Good at Utility Judgments?
1. Information systems
  1. Information retrieval

Recommendations

On Implementing Case-Based Reasoning with Large Language Models
Case-Based Reasoning Research and Development
Abstract
Systems based on Large Language Models (LLMs), such as ChatGPT, have impressive performance but also well-known issues with erroneous output. Retrieval Augmented Generation (RAG), which typically presents the LLM with text snippets of additional ...
Learning to match patients to clinical trials using large language models
Abstract Objective:
This study investigates the use of Large Language Models (LLMs) for matching patients to clinical trials (CTs) within an information retrieval pipeline. Our objective is to enhance the process of patient-trial matching by leveraging ...
Graphical abstract

Display Omitted
Highlights
- We present different strategies to include LLMs in an information retrieval pipeline for matching patients to clinical trials.
- Our methods leveraging fine-tuned LLM re-ranker attain effectiveness competitive with more complex state-of-...
A survey of large language models for cyber threat detection
Abstract
With the increasing complexity of cyber threats and the expanding scope of cyberspace, there exist progressively more challenges in cyber threat detection. It is proven that most previous threat detection models may become inadequate due to the ...
Highlights
- Comprehensive review of LLMs for cyber threat detection stage.
- Explore four suitable cyber threat detection scenarios for LLMs.
- Explore different roles of LLMs in common cyber threat detection tasks.
- Discussion of extra ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2024

3164 pages

ISBN:9798400704314

DOI:10.1145/3626772

General Chairs:
Grace Hui Yang
Georgetown University, USA
,
Hongning Wang
Tsinghua University, China
,
Sam Han
The Washington Post, USA
,
Program Chairs:
Claudia Hauff
Spotify, Netherlands
,
Guido Zuccon
The University of Queensland, Australia
,
Yi Zhang
University of California Santa Cruz, USA

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Youth Innovation Promotion Association CAS
Strategic Priority Research Program of the CAS
LESSEN
FINDHR
Lenovo-CAS Joint Lab Youth Scientist Project
National Key Research and Development Program of China
National Natural Science Foundation of China (NSFC)
10-year

Conference

SIGIR 2024

Sponsor:

SIGIR

SIGIR 2024: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 14 - 18, 2024

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
590
Total Downloads

Downloads (Last 12 months)590
Downloads (Last 6 weeks)84

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YZhang RGuo Jde Rijke MHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Robust Information RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661380(3009-3012)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661380

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten