: Leveraging Large Language Models to
Judge Audio Captions
Abstract
The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them, auditory scene understanding, sound-object inference, temporal coherence, and the environmental context of the scene. While current methods focus on specific aspects, they often fail to provide an overall score that aligns well with human judgment. In this work, we propose , a simple and flexible method that leverages the zero-shot capabilities of large language models (LLMs) to evaluate candidate audio captions by directly asking LLMs for a semantic distance score. In our evaluations, better predicts human judgments of quality compared to traditional metrics, with a 5.8% relative accuracy improvement compared to the domain-specific FENSE metric and up to 11% over the best general-purpose measure on the Clotho-Eval dataset. Moreover, offers more transparency by allowing the language model to explain the reasoning behind its scores, with these explanations rated up to 30% better by human evaluators than those provided by baseline methods. is made publicly available at https://github.com/DavidMChan/clair-a.
Index Terms:
Audio Captioning, Evaluation Metrics, Language Models, Auditory Scene UnderstandingI Introduction & Background
Audio captioning, generating a textual description for a sound, remains an ongoing and complex challenge in audio processing. Strong models designed for audio captioning must understand the sound and context wherein that sound occurs while expressing that information in natural language. A separate challenge, however, lies in evaluating the quality of these models. While the gold standard for evaluation is a human evaluation of caption quality [1], human evaluations are expensive and time-consuming. This expense indicates an imminent need to develop high-quality automated measures of caption quality that can be used to compare the semantic distance between human-written ground truth captions, and model-generated candidate captions.
Often, approaches to audio captioning are evaluated with traditional natural language generation measures based on N-gram matching such as BLEU [2], which counts the N-gram precision of the candidate sentence compared to a set of reference ground truths and ROUGE [3], which counts N-gram recall. A key issue with N-gram evaluation alone is that such measures cannot easily account for candidate sentences with identical semantic content to the references, but share few (if any) common N-grams. Some metrics were designed specifically to handle this issue: METEOR [4] attempts to solve this problem with synonym-matching and stemming, and CIDEr [5] focused the n-gram matching on “rare” N-grams (using TF-IDF), as they are more likely to contain relevant semantic information.
A key and prevailing idea among automated measures is that it is necessary to understand the “relationships” between objects in the scene (either objects in images or sound sources in audio captions). SPICE [6] used the idea that image captions should parallel visual content by constructing “object-graphs” from parses of the captions, and comparing the ground truth object graphs with the candidate object graphs. SPIDEr [7], a linear combination of SPICE and CIDEr, further aims to improve the improve the robustness of these measures.
On the other hand, some measures have followed the thesis that such semantic similarity is inherent in the structure of language models. BERT-SCORE [8] and Sentence-BERT [9] encode candidate and reference sentences as vectors using large language models, and compute distances between these vectors to produce a final semantic similarity. The most prevalent current audio captioning measure, FENSE [10], extends this idea with an additional auxiliary score for local fluency detection to improve the robustness of the measure to non-fluent, but semantically similar generated captions.
Some methods have aimed to combine the two approaches in a two-stage framework: SPICE+ [11] and ACES [12] are both audio-captioning specific measures which first use a parser to extract either a parse graph (SPICE+) or explicit sound descriptors (ACES), and then use sentence-embedding methods to compare the resulting parses. With large language models (LLMs) such as GPT-4 [13] showing promising results in the parsing space, the recently introduced X-ACE [14] replaces many of the fixed components in SPICE with LLM-based parsers, and shows that the dynamic flexibility of LLMs can easily help extend some of the introduced rigidity in traditional domain-specific measures.
In this work, we go beyond such two-stage methods, and present , a novel, single-stage, approach that takes a highly simplified view of combining parsing and similarity. Inspired by recent work in image captioning [15], and visual-question-answering [16, 17, 18], instead of explicitly parsing the sentences, and then using semantic measures on the resulting parse, asks an LLM to score the semantic similarity between a candidate caption and reference set directly. By simply asking LLMs to produce a numeric score using in-context learning [19], aims to leverage already strong correlations with human judgment present in the base language models to solve semantic tasks without significant structural oversight. In addition to providing a score, we further ask the LLM to justify its answer in natural language. This justification is a unique benefit of , which allows the numeric score to be introspectable, leading to a measure that is directly human-interpretable.
Our key contributions are summarized as follows:
-
•
We introduce the measure, a simple and interpretable measure for audio captioning evaluation.
-
•
We demonstrate that correlates better with human judgment than existing measures (both general and domain-specific), achieving up to 5.8% relative accuracy improvement over the domain-specific FENSE metric and up to 11% improvements over the best general-purpose measure on the Clotho-Eval dataset.
-
•
We show that is interpretable in human judgment: humans rate the justifications generated by to be up to 30% higher quality than naïve baselines.
II : LLMs as a Judge for Audio Captions
Given a candidate audio caption , and a set of ground truth audio captions , we would like to develop a score which accurately predicts the semantic distance between and . is inspired by CLAIR [15] (Criterion using LAnguage models for Image caption Rating), and similarly leverages in-context-learning [19] to convert audio caption evaluation to a text-completion task, which is solved using an off-the-shelf large language model (LLM), here, GPT-4o [13]. The prompt, given in Figure 2, encourages the large language model to produce a JSON output containing both (1) a numeric score between 1 and 100, and (2) a reason justifying that score, to provide interpretability. The numeric output of the LLM is used to generate the normalized LLM score:
(1) |
To ensure that the LLM produces a valid JSON output, we leverage efficient guided generation introduced in Willard and Louf [20], which reformulates the text generation process of a standard LLM (which is usually done using temperature sampling from the likelihood distribution) by using a context-free grammar (CFG) to constrain the sampling process and ensure that sampled tokens conform to a valid JSON specification. A simple approach to this: checking each valid generated token for conformance to the CFG, and then re-sampling with that token masked if invalid, is prohibitively expensive because of LLMs’ large vocabulary size and repeated evaluations of invalid tokens. To fix this, Willard and Louf [20] first construct a pushdown automaton parser for the grammar, and for every potential stack state of the parser, leverage pre-processing to pre-compute the valid next sampling tokens. These pre-computed token masks can then be efficiently queried (using a trie) at sampling time, with only one query needed per new token generated, guaranteeing that the next token that is generated by the LLM will be a valid continuation of the CFG.
Unlike CLAIR, which uses re-sampling if the model generates errors, such an approach, which we implement using the Outlines library [20], guarantees valid parsing, and is significantly more efficient than CLAIR when handling invalid JSON generations. Another benefit over the re-sampling is that this allows to be fully deterministic (given a fixed LLM) when the sampling process is constrained by underlying CFG and is sampled with temperature zero, a key property for an automated measure.
Compared to recent measures such as X-ACE [14], SPICE+ [11] and ACES [12], which require a multi-step process that leverages LLMs or fixed parsers to transform captions into audio graphs which are then used for graph-matching (across sound events, sources, attributes, relationships, etc. either with LLMs or semantic vectors), is a simple, highly interpretable, zero-shot, approach which is easily transferable between languages (See Table III).
While the LLM score alone can be powerful for distinguishing semantically varied captions (Table I, Table II), we found that in practice, many correct human captions are quite nuanced and similar, while many machine-generated audio captions are of poor quality, resulting in them receiving identical scores when assessed independently by the LLM. While this is not a problem for evaluating methods, it can be a problem when developing methods, as such tying scores cannot densely provide information to a researcher about which approaches are incremental improvements over others. To avoid this, we augment the base LLM score with an additional tie-breaking measure to get the final score:
(2) |
where is a normalized tie-breaking method. In section III, we discuss several choices for including (random), sentence-BERT and FENSE, and show that this significantly improves performance for samples that are either equally good or bad, even with . Following experiments in Table IV, we choose FENSE as a tie-breaking method with for the reference implementation.
Similar to Chan et al. [15], we also consider a variant, , which averages across several LLMs to generate a mean LLM score, which is then summed with . This simple ensemble approach takes into account several LLM choices, which can often encode different aspects of human judgment.
III Results & Discussion
To validate the performance of the measure, we perform several experiments targeting different aspects of the measure, including the correlation of the measure with human judgment, the performance on multilingual data, and the quality of the interpretable reasoning behind each of the caption scores. We benchmark against both standard measures of text similarity (BLEU [2], METEOR [4], CIDEr [5], SPICE [6], and CLAIR [15]) and specialized measures for audio captioning (SPIDEr [7], Spice+ [11], FENSE [10], ACES [12] and X-ACE [14]).
Human Judgment: Following Zhou et al. [10], we evaluate our measure on two datasets of pairwise human annotations for caption evaluation: the Clotho dataset [21] and the Audio-Caps dataset [22]. These datasets, created by Zhou et al. [10], consist of 1,671 and 1,750 pairs of audio captions on Clotho and Audio-Caps respectively, with each pair of candidate captions annotated with ground truth reference captions, and human judgments of which caption better fits the ground truths. On this benchmark, the goal of a metric is to indicate reliably which caption is preferred by human raters, and we report the pair accuracy (a pair is correct under a metric if the preferred caption is assigned a higher score).
Mirroring the design of Vedantam et al. [5], tests are split into four categories: HC, which contains two correct human captions describing the source audio, HI, which contains one correct, and one known incorrect human-generated caption for the source audio, HM, which contains one correct human-generated caption, and one machine-generated caption for the source audio, and MM which contains two machine-generated captions for the source audio. Note in the HM and MM cases, it is not known if the machine-generated captions are correct or incorrect, rather, they were generated by a system to match the corresponding source audio.
The accuracy of the metrics on each of the categories (HC, HI, HM, and MM), along with a total aggregate accuracy (mean micro-average), are shown for Clotho in Table I and Audio-Caps in Table II. We can see that outperforms other measures in all categories, with dramatic improvements in the HM and MM categories. It is worth noting that even though X-ACE leverages additional audio similarity in addition to the text content, still outperforms X-ACE overall, and significantly outperforms X-ACE without the cross-modal component. It is also clear that domain specialization for the measure is necessary. CLAIR alone, which is designed for image captioning, achieves only a 62.3% total accuracy, demonstrating the necessity of per-domain specialization.
Measure | HC | HI | HM | MM | All |
---|---|---|---|---|---|
BLEU@1 [2] | 51.0 | 90.6 | 65.5 | 50.3 | 59.0 |
BLEU@4 [2] | 52.9 | 88.9 | 65.1 | 53.2 | 60.5 |
METEOR [4] | 54.8 | 93.0 | 74.6 | 57.8 | 65.4 |
ROUGEL [3] | 56.2 | 90.6 | 69.4 | 50.7 | 60.5 |
CIDEr [5] | 51.4 | 91.8 | 70.3 | 56.0 | 63.2 |
SPICE [6] | 44.3 | 84.4 | 65.5 | 48.9 | 56.3 |
BERTScore [8] | 57.1 | 95.5 | 70.3 | 61.3 | 67.5 |
Sentence-BERT [9] | 60.0 | 95.5 | 75.9 | 66.9 | 71.8 |
CLAIR [15] | 42.9 | 95.9 | 72.8 | 54.8 | 62.3 |
SPICE+ [11] | 46.7 | 88.1 | 70.3 | 48.7 | 57.8 |
ACES [12] | 56.7 | 95.5 | 82.8 | 69.9 | 74.0 |
SPIDEr [7] | 53.3 | 93.4 | 70.3 | 57.0 | 64.2 |
FENSE [10] | 60.5 | 94.7 | 80.2 | 72.8 | 75.7 |
+ GPT-4o [13] | 62.4 | 97.1 | 83.6 | 77.9 | 79.7 |
+ Gemini v1.5 (pro) [23] | 59.0 | 95.9 | 83.2 | 75.1 | 77.4 |
+ Phi Mini (3.5B) [24] | 61.4 | 95.1 | 82.3 | 75.0 | 77.4 |
CLAIRAE | 61.9 | 97.1 | 81.9 | 77.1 | 78.9 |
Measure | HC | HI | HM | MM | All |
---|---|---|---|---|---|
BLEU@1 [2] | 58.6 | 90.3 | 77.4 | 50.3 | 62.4 |
BLEU@4 [2] | 54.7 | 85.8 | 78.7 | 50.6 | 61.6 |
METEOR [4] | 66.0 | 96.4 | 90.0 | 60.1 | 71.7 |
ROUGEL [3] | 61.1 | 91.5 | 82.8 | 52.1 | 64.9 |
CIDEr [5] | 56.2 | 96.0 | 90.4 | 61.2 | 71.0 |
SPICE [6] | 50.2 | 83.8 | 77.8 | 49.1 | 59.7 |
BERTScore [8] | 60.6 | 97.6 | 92.9 | 65.0 | 74.3 |
Sentence-BERT [9] | 64.0 | 99.2 | 92.5 | 73.6 | 79.6 |
CLAIR [15] | 44.8 | 99.2 | 90.0 | 56.4 | 67.4 |
SPICE+ [11] | 59.1 | 85.4 | 83.7 | 49.0 | 62.0 |
ACES [12] | 64.5 | 95.1 | 89.5 | 82.0 | 83.0 |
SPIDEr [7] | 56.7 | 93.4 | 70.3 | 57.0 | 64.2 |
FENSE [10] | 64.5 | 98.4 | 91.6 | 84.6 | 85.3 |
X-ACE [14] | 69.7 | 99.6 | 93.7 | 76.8 | 81.8 |
X-ACE w/o. CM [14] | 64.7 | 94.3 | 91.6 | 72.6 | 78.2 |
+ GPT-4o [13] | 70.9 | 99.2 | 93.3 | 84.6 | 86.6 |
+ Gemini v1.5 (pro) [23] | 70.4 | 99.2 | 93.7 | 81.5 | 84.9 |
+ Phi Mini (3.5B) [24] | 70.0 | 98.0 | 94.1 | 80.7 | 84.3 |
CLAIRAE | 72.4 | 99.6 | 93.3 | 81.5 | 85.2 |
Multilingual Evaluation: While most research in audio captioning is restricted to the English language, it is important to develop measures that transfer efficiently and effectively to multiple languages. To evaluate the performance of methods on multilingual data, we leveraged GPT-4o [13] to translate the Clotho dataset to Chinese, and we retained the human annotations from the English language datasets. We then evaluate metrics zero-shot on the newly translated dataset and report their performance. Note that for , we explore two variants, a zero-shot variant where the prompt is un-translated (remains in English), and a language-aware variant, where the prompt is translated to the target language. We also leverage Sentence-BERT tiebreaking (as FENSE is incompatible with other languages). Our results are given in Table III, where we can see that translates flexibly to new languages with minimal or no adaptation and with minimal loss of accuracy, specifically for the HC cases.
Measure | HC | HI | HM | MM | All |
---|---|---|---|---|---|
BLEU@1 | 50.0 | 91.0 | 70.3 | 57.1 | 63.4 |
BERTScore | 53.3 | 95.9 | 71.6 | 59.5 | 66.2 |
Sentence-BERT | 56.2 | 93.9 | 78.9 | 66.6 | 71.3 |
61.9 | 96.3 | 77.6 | 70.8 | 74.5 | |
(Language Aware) | 61.9 | 95.5 | 82.3 | 75.6 | 77.9 |
Tie-Breaking: One of the primary issues with the original CLAIR measure is the propensity of the method to generate ties when faced with equally good or bad data (which can be seen in the HC and MM column in Table II and Table I). Indeed, in these columns, the model generates a tying score of zero over 31% of the time, leading to poor correlation. Thus, in Equation 2, we add an additional tie-breaking score to avoid inconclusive decisions. In Table IV we demonstrate the performance of several tie-breaking methods. We can see that any tie-breaking method (including random) significantly improves the performance of the method, with “intelligent” tie-breaking methods leading to marginal improvements.
Measure | HC | HI | HM | MM | All |
---|---|---|---|---|---|
None | 42.4 | 96.3 | 75.9 | 64.7 | 68.3 |
Random | 58.6 | 97.1 | 82.3 | 74.7 | 77.6 |
Sentence-BERT, | 61.4 | 97.1 | 83.2 | 76.4 | 78.6 |
FENSE, | 61.9 | 97.1 | 83.2 | 77.3 | 79.2 |
FENSE, | 62.4 | 97.1 | 83.6 | 77.9 | 79.7 |
Reasoning: One of the key strengths of the method is its ability to produce interpretable reasoning for the methods. To evaluate the quality of the reasoning, for 200 randomly sampled AudioCaps-Eval captions, we asked crowd-source workers to rate three aspects of the generated scores on a 5-point Likert Scale: (1) How well the justification supported the score (Quality), (2) how fair the score was (Fairness), and (3) how well the score matched with the justification (Match). To provide a baseline, we employed with one of 36 variations of the justification “No particular reason”. The results are given in Table V, where we found that the justifications both matched the score and were of significantly higher quality than the baselines . Further, we found that the justifications led humans to rate the score as more fair, with a significant improvement over no justification (but the same score).
Measure | Fairness | Match | Quality |
---|---|---|---|
FENSE | - | - | |
/No Reason | |||
Qualitative Evaluations: Some examples of the measure are given in Figure 3. In the first example, captures aggregate information in the set of baseline references and assigns a higher score to a caption that captures the entirety of that information, as opposed to closely matching a single caption. In the second, CLAIR-A penalizes for poor grammar, whereas other measures are fooled by high N-gram overlap.
IV Conclusion
This paper introduces , a simple and interpretable domain-specific LLM-based measure for audio captioning. We demonstrate that not only is our simple approach well-aligned with human judgments, but also that such a method is significantly more interpretable to downstream human users. While is a first step towards LLM evaluation of audio captions, we hope that our work inspired further research into how LLMs can align with human judgment and can be used to develop simple and interpretable systems across a wide range of audio domains.
Acknowledgements: We would like to acknowledge Jeeweon Jung for his helpful contribution to the data used in the paper. GPT-4o was used to check the language of the paper (all sections) for spelling and grammar concerns.
References
- Drossos et al. [2017] K. Drossos, S. Adavanne, and T. Virtanen, “Automated audio captioning with recurrent neural networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017, pp. 374–378.
- Papineni et al. [2002] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, July 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040
- Lin [2004] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Association for Computational Linguistics, July 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
- Agarwal and Lavie [2008] A. Agarwal and A. Lavie, “Meteor, M-BLEU and M-TER: Evaluation metrics for high-correlation with human rankings of machine translation output,” in Proceedings of the Third Workshop on Statistical Machine Translation. Association for Computational Linguistics, June 2008, pp. 115–118. [Online]. Available: https://aclanthology.org/W08-0312
- Vedantam et al. [2015] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2015, pp. 4566–4575.
- Anderson et al. [2016] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European conference on computer vision. Springer, 2016, pp. 382–398.
- Liu et al. [2017] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 873–881.
- Zhang et al. [2020] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with BERT,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Reimers and Gurevych [2019] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, November 2019, pp. 3982–3992. [Online]. Available: https://aclanthology.org/D19-1410
- Zhou et al. [2022] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, “Can audio captions be evaluated with image caption metrics?” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 981–985.
- Gontier et al. [2023] F. Gontier, R. Serizel, and C. Cerisara, “Spice+: Evaluation of automatic audio captioning systems with pre-trained language models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Wijngaard et al. [2023] G. Wijngaard, E. Formisano, B. L. Giordano, and M. Dumontier, “Aces: Evaluating automated audio captioning models on the semantics of sounds,” in 2023 31st European Signal Processing Conference (EUSIPCO). IEEE, 2023, pp. 770–774.
- OpenAI [2024] OpenAI, “Hello gpt-4o,” 2024, accessed: 2024-09-12. [Online]. Available: https://openai.com/index/hello-gpt-4o/
- Wang et al. [2024] Q. Wang, J.-C. Gu, and Z.-H. Ling, “X-ace: Explainable and multi-factor audio captioning evaluation,” in Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 12 273–12 287.
- Chan et al. [2023] D. Chan, S. Petryk, J. Gonzalez, T. Darrell, and J. Canny, “Clair: Evaluating image captions with large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 13 638–13 646.
- Bubeck et al. [2023] S. Bubeck et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” ArXiv preprint, vol. abs/2303.12712, 2023.
- Dettmers et al. [2023] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” ArXiv preprint, vol. abs/2305.14314, 2023.
- Chiang et al. [2023] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023.
- Brown et al. [2020] T. B. Brown, B. Mann et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
- Willard and Louf [2023] B. T. Willard and R. Louf, “Efficient guided generation for llms,” arXiv preprint arXiv:2307.09702, 2023.
- Drossos et al. [2020] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 736–740.
- Kim et al. [2019] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
- Team et al. [2024a] G. Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024. [Online]. Available: https://arxiv.org/abs/2403.05530
- Team et al. [2024b] P. Team et al., “Phi-3 technical report: A highly capable language model locally on your phone,” 2024. [Online]. Available: https://arxiv.org/abs/2404.14219