Speech Translation with Speech Foundation Models and
Large Language Models: What is There and What is Missing?
Abstract
The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
Speech Translation with Speech Foundation Models and
Large Language Models: What is There and What is Missing?
Marco Gaido and Sara Papi and Matteo Negri and Luisa Bentivogli Fondazione Bruno Kessler, Trento, Italy {mgaido,spapi,negri,bentivo}@fbk.eu
1 Introduction
The natural language processing (NLP) landscape has recently undergone a paradigm shift with the emergence of foundation models (Bommasani et al., 2021). Among them, Large Language Models (LLMs) have revolutionized text-based NLP, showcasing remarkable capabilities across a wide range of NLP tasks (Radford et al., 2019). This unprecedented success has spurred research into creating foundation models for other modalities, including speech processing (Latif et al., 2023).
Building on the translation abilities of LLMs (Hendy et al., 2023; Jiao et al., 2023; Raunak et al., 2023; Zhu et al., 2023a; Xu et al., 2023) and the remarkable speech recognition and understanding capabilities achieved by Speech Foundation Models (SFMs) (Radford et al., 2023; Pratap et al., 2023; Communication et al., 2023), researchers are now actively exploring their combination. The resulting large multimodal models leverage, on the one hand, the SFM ability to encode speech content into rich and high-level representations and, on the other, the extensive linguistic knowledge of the LLM to generate fluent outputs and address a wide range of tasks (Chen et al., 2023b; Yu et al., 2023; Wang et al., 2023b; Rubenstein et al., 2023; Zhang et al., 2023a). Focusing on the speech-to-text translation (ST) task – the scope of this paper – the rapid pace of the advancements has led to multiple parallel endeavors, resulting in a variety of solutions. While all these efforts have the merit of demonstrating the viability and effectiveness of this line of work, their contemporaneity, along with methodological inconsistencies, hinders a fair comparison. For this reason, we provide a systematic analysis of the proposed SFM+LLM solutions for ST with the multiple goals of identifying their similarities and differences, organizing the lessons learned, and suggesting future research directions, along with best practices for insightful evaluations. At its core, this paper addresses two key questions:
-
\twemoji
triangular flag
What is There? We survey the publicly available works that propose an SFM+LLM solution for ST, resulting in 9 papers (henceforth referred to as \twemoji keycap: 1 ,…, \twemoji keycap: 9 ), and analyze them (2) focusing on two orthogonal aspects:
-
\twemoji
gear
Architectural Building Blocks (2.1): We delve into the SFM+LLM architectures, identifying a common abstraction made of 5 building blocks and underscoring similarities and differences in the SFM and LLM choices, along with the strategies adopted for combining them;
-
\twemoji
gear
Training and Evaluation (2.2): We inspect the training data, tasks, and strategies employed in the studies, as well as evaluation data and supported language pairs, gathering insights about promising solutions, and highlighting the sparsity of the current landscape;
-
\twemoji
gear
-
\twemoji
triangular flag
What is Missing? We conclude by underscoring the importance of establishing a standard training setting based on open data to ease direct comparability across works, and by identifying aspects that need further investigation to better understand the potential of SFM+LLM combination for ST (3).