Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Liu, Zijun; Kou, Boqun; Li, Peng; Yan, Ming; Zhang, Ji; Huang, Fei; Liu, Yang

Computer Science > Computation and Language

arXiv:2402.12146 (cs)

[Submitted on 19 Feb 2024 (v1), last revised 31 May 2024 (this version, v3)]

Title:Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Authors:Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

View PDF HTML (experimental)

Abstract:Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called $\textit{Meta Ranking}$ (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, where weak LLMs, such as Phi-2, could surpass strong baselines like GPT-3.5-turbo, requiring only five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass Alpaca-13B with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness.

Comments:	Preprint, under review. 28 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2402.12146 [cs.CL]
	(or arXiv:2402.12146v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.12146

Submission history

From: Zijun Liu [view email]
[v1] Mon, 19 Feb 2024 13:57:55 UTC (8,907 KB)
[v2] Sun, 26 May 2024 17:46:42 UTC (9,415 KB)
[v3] Fri, 31 May 2024 03:25:42 UTC (9,415 KB)

Computer Science > Computation and Language

Title:Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators