Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Mohankumar, Akash Kumar; Khapra, Mitesh M.

Computer Science > Computation and Language

arXiv:2203.06063v2 (cs)

[Submitted on 11 Mar 2022 (v1), last revised 17 Apr 2022 (this version, v2)]

Title:Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Authors:Akash Kumar Mohankumar, Mitesh M. Khapra

View PDF

Abstract:Recent studies have shown the advantages of evaluating NLG systems using pairwise comparisons as opposed to direct assessment. Given $k$ systems, a naive approach for identifying the top-ranked system would be to uniformly obtain pairwise comparisons from all ${k \choose 2}$ pairs of systems. However, this can be very expensive as the number of human annotations required would grow quadratically with $k$. In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms. We perform extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation datasets spanning 5 tasks and show that the number of human annotations can be reduced by 80%. To further reduce the number of human annotations, we propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations. Specifically, we eliminate sub-optimal systems even before the human annotation process and perform human evaluations only on test examples where the automatic metric is highly uncertain. This reduces the number of human annotations required further by 89%. In effect, we show that identifying the top-ranked system requires only a few hundred human annotations, which grow linearly with $k$. Lastly, we provide practical recommendations and best practices to identify the top-ranked system efficiently. Our code has been made publicly available at this https URL

Comments:	Accepted at ACL 2022; 21 pages and 12 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2203.06063 [cs.CL]
	(or arXiv:2203.06063v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.06063

Submission history

From: Akash Kumar Mohankumar [view email]
[v1] Fri, 11 Mar 2022 16:39:15 UTC (1,729 KB)
[v2] Sun, 17 Apr 2022 15:17:00 UTC (1,729 KB)

Computer Science > Computation and Language

Title:Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators