research-article

Open access

Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems

Authors:

Laura DietzAuthors Info & Claims

ICTIR '24: Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 175 - 184

https://doi.org/10.1145/3664190.3672511

Published: 05 August 2024 Publication History

Abstract

Current IR evaluation paradigms are challenged by large language models (LLMs) and retrieval-augmented generation (RAG) methods. Furthermore, evaluation either resorts to expensive human judgments or lead to an over-reliance on LLMs.

To remedy this situation, we introduce the RUBRIC metric, which puts information retrieval systems to the proverbial test. This metric leverages a bank of query-related test questions to quantify relevant information content that is contained in the systems' responses. The process involves (1) decomposing the query into detailed questions, and (2) checking each for answerability using passages in the system response. Using three TREC benchmarks, we demonstrate that our LLM-based RUBRIC approach works successfully. Unlike previous LLM-based evaluation measures, our paradigm lends itself for incorporating a human-in-the-loop while avoiding some pitfalls of over-reliance on AI or resorting to expensive manual passage-level judgments. Moreover, our evaluation is repeatable and extensible and can be scored with existing evaluation tools. Data and code at https://github.com/TREMA-UNH/rubric-evaluation/

References

[1]

Negar Arabzadeh, Amin Bigdeli, and Charles LA Clarke. 2024. Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers. arXiv preprint arXiv:2401.04842 (2024).

[2]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

[3]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).

[4]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research, Vol. 25, 70 (2024), 1--53.

[5]

James Clarke and Mirella Lapata. 2010. Discourse Constraints for Document Compression. Computational Linguistics, Vol. 36, 3 (2010).

[6]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102.07662 (2021).

[7]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).

[8]

Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. 2020. Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. arXiv preprint arXiv:2010.00490 (2020).

[9]

Laura Dietz. 2024. A Workbench for Autograding Retrieveslash Generate Systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '24) - Resource and Reproducibility Papers. https://doi.org/10.1145/3626772.3657871

Digital Library

[10]

Laura Dietz and John Foley. 2019. TREC CAR Y3: Complex Answer Retrieval Overview. In Proceedings of Text REtrieval Conference (TREC).

[11]

Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question Answering as an Automatic Evaluation Metric for News Article Summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3938--3948. https://doi.org/10.18653/v1/N19--1395

[12]

Guglielmo Faggioli, Laura Dietz, Charles Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, et al. 2024. Who Determines What Is Relevant? Humans or AI? Why Not Both? A spectrum of human-AI collaboration in assessing relevance. Commun. ACM (2024).

[13]

Guglielmo Faggioli, Laura Dietz, Charles LA Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, et al. 2023. Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 39--50.

Digital Library

[14]

Naghmeh Farzi and Laura Dietz. 2024. EXAM: LLM-based Answerability Metrics for IR Evaluation. In Proceedings of LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval.

[15]

Raymond Fok and Daniel S Weld. 2023. In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making. arXiv preprint arXiv:2305.07722 (2023).

[16]

Luyang Huang, Lingfei Wu, and Lu Wang. 2020. Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.457

[17]

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5376--5384.

[18]

Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, and Xipeng Qiu. 2023. LLatrieval: LLM-Verified Retrieval for Verifiable Generation. arXiv preprint arXiv:2311.07838 (2023).

[19]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).

[20]

Jimmy Lin and Dina Demner-Fushman. 2006. Will pyramids built of nuggets topple over?. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. 383--390.

Digital Library

[21]

Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. 2023. Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766 (2023).

[22]

Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. arXiv preprint arXiv:2302.11266 (2023).

[23]

Richard McCreadie and Cody Buntain. 2023. CrisisFACTS: Buidling and Evaluating Crisis Timelines. Technical Report. Univerity of Glasgow.

[24]

Lidiya Murakhovs'ka, Chien-Sheng Wu, Philippe Laban, Tong Niu, Wenhao Liu, and Caiming Xiong. 2022. MixQG: Neural Question Generation with Mixed Answer Types. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 1486--1497. https://doi.org/10.18653/v1/2022.findings-naacl.111

[25]

Virgil Pavlu, Shahzad Rajput, Peter B Golbus, and Javed A Aslam. 2012. IR system evaluation using nugget-based test collections. In Proceedings of the fifth ACM international conference on Web search and data mining. 393--402.

Digital Library

[26]

Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving passage retrieval with zero-shot question generation. arXiv preprint arXiv:2204.07496 (2022).

[27]

David P Sander and Laura Dietz. 2021. EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want. In DESIRES. 136--146.

[28]

Mark Smucker, James Allan, and Blagovest Dachev. 2008. Human Question Answering Performance using an Interactive Information Retrieval System. (01 2008).

[29]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. arXiv e-prints (2023), arXiv--2304.

[30]

Paul Thomas, Seth Spielman, Nick Craswell, and Bhaskar Mitra. 2023. Large language models can accurately predict searcher preferences. arxiv: 2309.10621 [cs.IR]

[31]

Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5008--5020. https://doi.org/10.18653/v1/2020.acl-main.450

[32]

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).

[33]

Yixuan Weng, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, Jun Zhao, et al. 2023. Large Language Models are Better Reasoners with Self-Verification. In The 2023 Conference on Empirical Methods in Natural Language Processing.

[34]

Xuan Zhang and Wei Gao. 2023. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. arXiv preprint arXiv:2310.00305 (2023).

Index Terms

Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

A Workbench for Autograding Retrieve/Generate Systems
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of ...
Pooling-based continuous evaluation of information retrieval systems
Abstract
The dominant approach to evaluate the effectiveness of information retrieval (IR) systems is by means of reusable test collections built following the Cranfield paradigm. In this paper, we propose a new IR evaluation methodology based on pooled ...
Effort-based information retrieval evaluation with varied evaluation depth and topic sizes
ICBIM '19: Proceedings of the 3rd International Conference on Business and Information Management

The information retrieval accessed globally is a vital productivity boost for most organization. However, the outcome of information retrieval system evaluation does not agree with the real user's satisfaction. Information retrieval systems retrieving ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '24: Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval

August 2024

267 pages

ISBN:9798400706813

DOI:10.1145/3664190

General Chair:
Harrie Oosterhuis
Radboud University
,
Program Chairs:
Hannah Bast
University of Freiburg
,
Chenyan Xiong
Carnegie Mellon University

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

ICTIR '24

Sponsor:

SIGIR

ICTIR '24: The 2024 ACM SIGIR International Conference on the Theory of Information Retrieval

July 13, 2024

Washington DC, USA

Acceptance Rates

ICTIR '24 Paper Acceptance Rate 26 of 45 submissions, 58%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
450
Total Downloads

Downloads (Last 12 months)450
Downloads (Last 6 weeks)89

Reflects downloads up to 06 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten