Speech Translation with Speech Foundation Models and
Large Language Models: What is There and What is Missing?

Marco Gaido    Sara Papi    Matteo Negri    Luisa Bentivogli
Fondazione Bruno Kessler, Trento, Italy

The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.

1 Introduction

The natural language processing (NLP) landscape has recently undergone a paradigm shift with the emergence of foundation models (Bommasani et al., 2021). Among them, Large Language Models (LLMs) have revolutionized text-based NLP, showcasing remarkable capabilities across a wide range of NLP tasks (Radford et al., 2019). This unprecedented success has spurred research into creating foundation models for other modalities, including speech processing (Latif et al., 2023).

Building on the translation abilities of LLMs (Hendy et al., 2023; Jiao et al., 2023; Raunak et al., 2023; Zhu et al., 2023a; Xu et al., 2023) and the remarkable speech recognition and understanding capabilities achieved by Speech Foundation Models (SFMs) (Radford et al., 2023; Pratap et al., 2023; Communication et al., 2023), researchers are now actively exploring their combination. The resulting large multimodal models leverage, on the one hand, the SFM ability to encode speech content into rich and high-level representations and, on the other, the extensive linguistic knowledge of the LLM to generate fluent outputs and address a wide range of tasks (Chen et al., 2023b; Yu et al., 2023; Wang et al., 2023b; Rubenstein et al., 2023; Zhang et al., 2023a). Focusing on the speech-to-text translation (ST) task – the scope of this paper – the rapid pace of the advancements has led to multiple parallel endeavors, resulting in a variety of solutions. While all these efforts have the merit of demonstrating the viability and effectiveness of this line of work, their contemporaneity, along with methodological inconsistencies, hinders a fair comparison. For this reason, we provide a systematic analysis of the proposed SFM+LLM solutions for ST with the multiple goals of identifying their similarities and differences, organizing the lessons learned, and suggesting future research directions, along with best practices for insightful evaluations. At its core, this paper addresses two key questions:

    What is There? We survey the publicly available works that propose an SFM+LLM solution for ST, resulting in 9 papers (henceforth referred to as \twemoji keycap: 1 ,…, \twemoji keycap: 9 ), and analyze them (§§\S§2) focusing on two orthogonal aspects:

      Architectural Building Blocks (§§\S§2.1): We delve into the SFM+LLM architectures, identifying a common abstraction made of 5 building blocks and underscoring similarities and differences in the SFM and LLM choices, along with the strategies adopted for combining them;

      Training and Evaluation (§§\S§2.2): We inspect the training data, tasks, and strategies employed in the studies, as well as evaluation data and supported language pairs, gathering insights about promising solutions, and highlighting the sparsity of the current landscape;

    What is Missing? We conclude by underscoring the importance of establishing a standard training setting based on open data to ease direct comparability across works, and by identifying aspects that need further investigation to better understand the potential of SFM+LLM combination for ST (§§\S§3).


colspec=|X[0.2,c]|X[1.5]|X[1.5]|X[1.5]|X[1.5]|X[1.5]|X|X|X, row1 = c, hlines, cell34=c=2, hspan=minimal, # & Model SFM LA MA LLM Prompt PSMix
\twemoji keycap: 1 LST (Zhang et al., 2023b) wav2vec 2.0 (Baevski et al., 2020) 2×\times×Conv1D 1 FFN LLaMa2 13B (Touvron et al., 2023) None Speech Only
\twemoji keycap: 2 SALM (Chen et al., 2023d) NeMo STT Fast Conformer NVIDIA (2023) 2×\times×Conformer layers with 4×\times× downsample Megatron-LM 2B (Shoeybi et al., 2020) Fixed Template Speech Prepended
\twemoji keycap: 3 Speech-LLaMa (Wu et al., 2023) \SetCell[r=2]c in-house Transformer \SetCell[r=2]c CTC compression (Gaido et al., 2021) \SetCell[r=2]c 4 Transformer Layers + 1 FFN \SetCell[r=2]c LLaMa2 7B (Touvron et al., 2023) Sampled from List of Templates Speech Appended
\twemoji keycap: 4 COSMIC (Pan et al., 2023) Fixed for ASR/ST, Open for SQA Speech Prepended
\twemoji keycap: 5 SLM (Wang et al., 2023a) USM (Zhang et al., 2023d) Randomly discarded 75% vectors 2 Transformer Layers mT0-MT XXL 13B (Muennighoff et al., 2023) Fixed for ASR/ST, Open for SIT Speech Appended
\twemoji keycap: 6 SALMONN (Tang et al., 2024) Whisper-large-v2 (Radford et al., 2023) + BEATs (Chen et al., 2023c) \SetCell[c=2]cWindow-level Q-Former (Li et al., 2023) Vicuna 13B (Chiang et al., 2023) Fixed for ASR/ST, Open for Other Tasks Speech Prepended
\twemoji keycap: 7 LLM-ST (Huang et al., 2023b) Whisper-large-v3 (Radford et al., 2023) NA NA GPT 13B (Brown et al., 2020) trained from scratch Fixed Speech Prepended
\twemoji keycap: 8 Qwen-Audio (Chu et al., 2023) Whisper-large-v2 (Radford et al., 2023) NA NA Qwen 7B (Bai et al., 2023) Learned Tokens Speech Prepended
\twemoji keycap: 9 Conformer LLaMa (Fathullah et al., 2023) Custom Conformer trained on ASR data Stacking 4 consecutive vectors 1 FFN LLaMa2 Chat 7B (Touvron et al., 2023) LLaMa’s Default Structure Speech Placed within the Prompt

Table 1: Architectural components of SFM+LLM comprising speech foundation model (SFM), length adapter (LA), modality adapter (MA), large language model (LLM), prompt, and prompt-speech mixer (PSMix).

2 What is There?

In this section, we explore two key aspects of SFM+LLM research in ST: first, we delve into the architectural components of SFM+LLM models (§§\S§2.1); second, we examine the training and evaluation settings utilized in these studies (§§\S§2.2).

2.1 Architectural Building Blocks

The combination of SFMs and LLMs has so far been addressed with different architectures, which have, though, a common structure. Specifically, we identify 5 building blocks (see Figure 1): i) the SFM, ii) the length adapter, iii) the modality adapter, iv) the prompt-speech mixer that merges the textual prompt with the adapted speech representation, and v) the LLM. In Table 1, we summarize how the 9 analyzed papers have designed each component.


The SFM is in charge of extracting rich, semantic representations from the audio signal, which have then to be projected onto the LLM input semantic space to successfully connect the audio modality with the LLM. Looking at Table 1, we immediately notice that there is no consensus on the best SFM to choose. With the only exception of \twemoji keycap: 3 and \twemoji keycap: 4 , which are from the same authors/research group, each work relies on a different SFM. Also, no work has addressed the comparative assessment of different SFMs under controlled conditions within the same framework. Differences among SFMs encompass multiple aspects. First, their architectural backbone predominantly relies on either a Transformer (Vaswani et al., 2017) or Conformer (Gulati et al., 2020) encoder. Second, the diversity extends to the training data, which are not public for most SFMs, except for wav2vec 2.0 and NeMo STT Fast Conformer. Third, distinctions emerge in the supported languages, as most SFMs are limited to English, while the Whisper encoder supports 99 languages (Radford et al., 2023). Lastly, SFMs vary in the tasks they undertake, with some focusing solely on ASR, while Whisper extends its capabilities to ST and timestamp prediction. In addition, it is noteworthy that the majority of the SFMs used are not publicly available: none of the four works that trained a custom speech model released it, and USM, employed by  \twemoji keycap: 5 , is not openly accessible. From these observations, it is evident that the works are not directly comparable, and is often impossible for future research to make fair comparisons with existing solutions. The absence of a comparative analysis among SFMs also hinders our understanding of their impact on downstream performance, as well as the identification of the most suitable choice to guide future research.

Length Adapter (LA).

This module is designed to reduce the number of embeddings representing an audio sequence over the time axis. This operation serves a dual purpose. On the one hand, compressing the length of audio sequences – typically longer than the corresponding textual ones – contributes to reducing the difference between the two modalities, hence limiting the modality mismatch for the LLM, which is trained on textual inputs. On the other, as current LLMs exploit the Transformer architecture whose self-attention suffers from a quadratic complexity with respect to the input sequence, this compression prevents the already demanding memory and computational costs from becoming unaffordable. As already noted for the SFM, Table 1 highlights that a wide range of methods have been adopted for the LA. Also in this case, a comparison between different solutions in the same settings is missing, with one exception. In fact, \twemoji keycap: 3 evaluates two LA methods based on a CTC module (Graves et al., 2006): i) the CTC compression, which averages vectors corresponding to the same CTC predictions, and ii) the CTC blank filtering (Wang et al., 2023e), which discards all the vectors corresponding to predictions of the <blank> token.111<blank> is a special token used by the CTC loss to denote the absence of speech content in the signal (e.g., silence). Their results indicate that the former leads to better ST quality. The only other existing comparison of LAs has been conducted in the scope of the related ASR task, where Yu et al. (2023) introduce a window-level Q-Former (Li et al., 2023) encoder, named Seg-QF, demonstrating its superiority over a plain Q-Former, a 1D convolutional layer, and the stacking of consecutive vectors followed by a feed-forward network (as done in \twemoji keycap: 9 ). Seg-QF is very similar to the Window-level Q-Former used in \twemoji keycap: 6 : it divides the speech sequence into chunks of a predetermined size (a hyperparameter nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) that are independently processed by the Q-Former, which controls the length of the output sequence with the number of learned query vectors used (another hyperparameter nqsubscript𝑛𝑞n_{q}italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT). As a result, this approach reduces the input length by a factor of ns/nqsubscript𝑛𝑠subscript𝑛𝑞n_{s}/n_{q}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. It is important to notice that this finding was obtained by keeping both the SFM and the LLM frozen and without introducing any other module (e.g., without any modality adapter). Hence, its validity should be confirmed in different conditions where the LA does not have to learn the modality mapping as well. To sum up, although the literature offers insights into the most promising approaches for LAs, a comparative analysis covering all the proposed methods is missing. Moreover, as the LA controls the length of the LLM input and this is a critical factor for the computational costs of the resulting SFM+LLM models, their analysis should not be limited to the downstream (ST) performance but it should also consider the model efficiency, which has been disregarded so far.

Modality Adapter (MA).

The MA is a small trained network (compared to SFMs and LLMs) that maps the LA output into an embedding space compatible with the LLM. Compared to LA, its design has seen fewer variations: in some instances, the MA is a simple FFN ( \twemoji keycap: 1 and \twemoji keycap: 9 ) or is composed of a variable number of Transformer layers ( \twemoji keycap: 3 , \twemoji keycap: 4 , and \twemoji keycap: 5 ). In other cases, it is fused with the LA ( \twemoji keycap: 2 and \twemoji keycap: 6 ) or even absent ( \twemoji keycap: 7 and \twemoji keycap: 8 ). The necessity and design of the MA depend on the training strategy adopted: if the LLM is finetuned, the MA can indeed be avoided (see \twemoji keycap: 7 and \twemoji keycap: 8 ) as the LLM can learn to use a new embedding space (the one produced by the SFM and LA). In contrast, if the LLM and SFM are not adapted, the MA is necessary to enable their communication (as in \twemoji keycap: 5 ). Similarly, the complexity and size of the MA can vary depending on the training strategy: if a simple MA is adopted, the introduction of trainable adapters in the LLM or its finetuning may be required ( \twemoji keycap: 1 and \twemoji keycap: 2 ). However, the role and necessity of the MA have not been systematically investigated in existing works, which introduced it without conducting ablation studies or analyses on its size. This calls for a dedicated contrastive evaluation accounting for crucial factors like the training strategy and the quantity of finetuning paired data used.

Prompt-Speech Mixer (PSMix).

The goal of the PSMix is to merge the speech representation with the textual prompt that is to be fed to the LLM. Regarding the type of textual prompt, the analyzed works show little variability, with most of them relying on a fixed template to fill with the source and target language (e.g. “Translate the audio from <SOURCE LANGUAGE> to <TARGET LANGUAGE>”). In \twemoji keycap: 3 , the authors experimented with a list of templates to enhance system robustness, but they did not investigate its impact on performance. In \twemoji keycap: 5 , the authors demonstrated that a wider range of prompts enables the system to support unseen ones at inference time; however, in their setting, this corresponds to a broader set of tasks, making it challenging to isolate the contribution of different prompts and tasks to this ability. Regarding the PSMix strategies, most works rely on three concatenation solutions: prepending the speech representation to the prompt embeddings ( \twemoji keycap: 2 , \twemoji keycap: 4 , \twemoji keycap: 6 , \twemoji keycap: 7 and \twemoji keycap: 8 ), appending it to the prompt embeddings ( \twemoji keycap: 3 and \twemoji keycap: 5 ), or interleaving the speech representation with a prompt prefix and suffix ( \twemoji keycap: 9 ). Only one work ( \twemoji keycap: 1 ) completely omits the prompt and the PSMix module by directly feeding the LLM with the speech representations. To sum up, it is unclear whether using a fixed template for the prompt is the best choice, despite its prominent adoption, and which of the PSMix options (if any) leads to the best results. As these aspects have not yet been thoroughly studied, such interesting questions remain to be addressed in future works.


The last component is the LLM, which takes the mixed prompt and speech representations as input to generate the final (textual) translation. In \twemoji keycap: 5 , Wang et al. (2023a) claim that “the pretrained LLM plays a crucial role in both training efficiency and model quality”, and that a stronger model on a given task leads to better performance. However, with the only exception of works by the same authors ( \twemoji keycap: 3 and \twemoji keycap: 4 ) that leverage LLaMa2 7B, all the SFM+LLM combinations exploit different LLMs without motivating the choice (e.g. through comparisons across models): \twemoji keycap: 1 uses a larger LLaMa2 (i.e., the 13B version), \twemoji keycap: 6 and \twemoji keycap: 9 use a finetuned version (Vicuna 13B and LLaMa2 Chat 7B, respectively) while \twemoji keycap: 2 , \twemoji keycap: 5 , \twemoji keycap: 7 , and \twemoji keycap: 8 use completely different models. The dominance of the LLaMa family is probably motivated by its openness and support for multiple languages. On the other hand, LLMs specifically built for the translation task are emerging (Xu et al., 2023) and represent a natural option to be considered in future works. In light of the high computing costs of these large models and the significant performance variations they can exhibit, establishing the best option for the ST task through systematic comparisons represents a priority for future research.


colspec=|X[0.1,c]|X[0.7,c]|X[1.4]|X[1.1]|X[0.3,c]|X[0.4,c]|X|X|, row1 = c, hlines, # Model Train. Data Train. Tasks SFM fn LLM fn Eval. Data Supported Lang. Pairs
\twemoji keycap: 1 LST MuST-C, LibriSpeech ASR, ST No Yes MuST-C en\rightarrow{de, fr, es}
\twemoji keycap: 2 SALM IWSLT 2023 ASR, ST No LoRA MuST-C en\rightarrow{de, ja}
\twemoji keycap: 3 Speech- LLaMa in-house ASR, ST \SetCell[r=2]c Yes \SetCell[r=2]c LoRA CoVoST2 {de, zh, ar, es, fr, it, nl, ja, ru, pt, et, sv, sl}\rightarrowen
\twemoji keycap: 4 COSMIC TEDLIUM3 ASR, SQA TEDLIUM3, FLEURS en\rightarrow{es, fr, de, zh}
\twemoji keycap: 5 SLM Alpaca, CoVoST2, YouTube (in-house) ASR, ST, SIT No No CoVoST2 {fr, de, es, ca, it, ru, pt, fa, et, mn, nl, tr, ar, sv, lv, sl, ta, ja, id, cy, zh}\rightarrowen
\twemoji keycap: 6 SALMONN AudioCaps, Clotho, CoVoST2, GigaSpeech, IEMOCAP, LibriMix, LibriSpeech, MillionSong, MusicCaps, MusicNet, VoxCeleb1, WavCaps ASR, ST, AAC, PR, ER, MC, OSR, SV, GR, SQA, AQA, MQA, ABST No LoRA CoVoST2 en\rightarrow{de, ja, zh}
\twemoji keycap: 7 LLM-ST CoVoST2, GigaST, MuST-C v2, WeNetSpeech + in-house ASR, ST, MT, PT, ITN, TE, SRST, STST Yes Yes CoVoST2, GigaST, MuST-C v2, in-house en\leftrightarrowzh
\twemoji keycap: 8 Qwen-Audio in-house ASR, ST, OSR, DASR, SRWT, DID, LID, GR, ER, SV, SD, SER, KS, IC, SF, SAP, VSC, AAC, SEC, ASC, SED, AQA, SID, MC, MIC, MNA, MR, MQA Yes No CoVoST2 en\rightarrow{de, zh}, {de, zh, es, fr, it}\rightarrowen
\twemoji keycap: 9 Conformer LLaMa MLS ASR Yes No N/A N/A

Table 2: Experimental settings adopted for finetuning the SFMs+LLMs. "fn" stands for finetuning, and, for supported language pairs (Supported Lang. Pairs), we intend language pairs on which models have been evaluated.

2.2 Training and Evaluation

In this section, we describe the experimental settings of the analyzed papers by focusing on the datasets used for training and evaluation, the supported tasks and language pairs, and the techniques used for SFM and LLM finetuning. A summary is provided in Table 2.1. The task acronyms are defined in Appendix A, the training and evaluation datasets are reported in Appendix B, while the language codes follow the ISO 639 notation.222https://www.iso.org/standard/74575.html

Training Data.

The training datasets used in the 9 analyzed papers are different both in terms of type and quantity. Approximately half of the works (5 out of 9) leverage publicly available data only, both within and outside the ST domain. Despite this, none of them utilize similar data settings for finetuning their proposed SFM+LLM architecture: while LST \twemoji keycap: 1 , COSMIC \twemoji keycap: 4 , and ConformerLLaMa \twemoji keycap: 9 rely on 1 or 2 datasets, SALM \twemoji keycap: 2 uses all the 11 speech corpora available for the IWSLT 2023 Offline Speech Translation Shared Task,333https://iwslt.org/2023/offline and SALMONN \twemoji keycap: 6 employs 12 different datasets during training. For SFM+LLM models trained on non-publicly available data, we observe that SLM \twemoji keycap: 5 and LLM-ST \twemoji keycap: 7 adopt a combination of in-house and open data, while Speech-LLaMa \twemoji keycap: 3 and Qwen-Audio \twemoji keycap: 8 exclusively use proprietary data. In addition to the lack of uniform training settings, none of the existing works has analyzed scaling laws and the effect of increasing the data size on the performance, rendering a fair comparison among the diverse approaches impractical.

Training Tasks.

Regarding training tasks, almost half of the SFM+LLM models (5 out of 9) extend their scope beyond pure ASR and ST applications. Among them, SLM \twemoji keycap: 5 integrates a single additional task – instruction tuning – while LLM-ST \twemoji keycap: 7 is trained with 4 translation-related tasks (e.g., translation explanation) and 2 speech-related tasks (e.g., timestamp estimation). In contrast, SALMONN \twemoji keycap: 6 supports a diverse array of up to 10 additional tasks, spanning various domains such as SQA and emotion recognition. Qwen-Audio \twemoji keycap: 8 takes this a step further by incorporating 26 more tasks, encompassing a comprehensive collection of speech, audio, and music-related tasks. In contrast, COSMIC \twemoji keycap: 4 is exclusively trained on ASR and SQA but is also tested on ST. Similarly, ConformerLLaMA \twemoji keycap: 9 is trained solely on the ASR task but it demonstrates emergent capabilities in ST, although its translation quality is not systematically assessed.444The ST ability of the model is only anecdotally reported. Interestingly, only three models – LST \twemoji keycap: 1 , SALM \twemoji keycap: 2 , and Speech-LLaMa \twemoji keycap: 3 – are trained on the same tasks (ASR and ST). The effect of adding more tasks on the resulting model capabilities and ST performance is (partly) studied only in SALMONN \twemoji keycap: 6 , where tasks are progressively introduced. Specifically, its training strategy involves three stages: i) the first stage (pre-training) includes ASR and AAC, ii) the second stage (instruction tuning) includes 12 tasks, and iii) the third stage (activation tuning) finetunes the model on tasks with longer and more diverse responses as AQA and ABST. The last step is shown to increase the generalization and emergent abilities while impacting translation quality in a limited yet unclear way, as it improves in one direction (en-ja) but degrades in two other directions (en-de and en-zh). All in all, the lack of uniformity in the training task selection hinders the comparability of the solutions, and the benefits of knowledge transfer across tasks (Hampton et al., 2017; Ke et al., 2021; Kubo et al., 2022) have yet to be studied in-depth.

SFM and LLM finetuning.

As SFMs and LLMs are huge in terms of parameters, their training/finetuning is computationally expensive. This raises the question about whether they can be used without expensive adaptation or not. Regarding the SFM, more than half of the examined papers (5 out of 9) finetune it, while this component is kept frozen in the others. The LLM, instead, is adapted by 6 of the 9 analyzed papers, but only 2 ( \twemoji keycap: 1 and \twemoji keycap: 7 ) finetune the whole model. The others rely on the Low-Rank Adaptation (Hu et al., 2022), or LoRA, a widely employed technique for adapting LLMs to new datasets or tasks (Hu et al., 2023; Kwon et al., 2024). LoRA consists in introducing trainable rank decomposition matrices into each layer of the architecture while keeping the original weights frozen, so as to significantly reduce the trainable parameters (by a factor of 10,000). Notably, only one study ( \twemoji keycap: 5 ) presents results with both the SFM and LLM frozen, and also shows that LLM finetuning yields substantial performance gains. However, since finetuning is conducted on data from the same domain as the test set, the observed benefits may be partially attributed to domain adaptation, making it challenging to quantify the improvement solely attributable to finetuning. Similarly, Wu et al. (2023) ( \twemoji keycap: 3 ) show that LoRA leads to improvements of similar-to\sim1.5 BLEU, averaged over 13 CoVoST2 language pairs. We can conclude that, while LLM adaptation brings significant improvements, it is unclear whether the need for finetuning depends on the type of LLM used (e.g., would it be needed when using an LLM built for the translation task?) or on the design of other modules (e.g., the MA) or on other training choices (e.g., adapting the SFM or not). Moreover, similar studies should be conducted for the even less explored SFM adaptation.

Evaluation Data.

The selection of consistent evaluation benchmarks is crucial for facilitating meaningful comparisons among different SFM+LLM models. However, our survey reveals disparate choices regarding the test sets employed. The main dichotomy regards the evaluation within English-to-many or many-to-English settings, as four papers focus on the former, two on the latter, and two on both (although \twemoji keycap: 7 investigates only zh), while one ( \twemoji keycap: 9 ) does not report evaluation results.4 CoVoST2 emerges as the most widespread benchmark (used in 5 papers), thanks to its broad coverage of translation directions (15 in the English-to-many case, and 21 in the many-to-English one). For the English-to-many scenario, MuST-C is also frequently used (in 3 cases), while COSMIC \twemoji keycap: 4 is the only one tested on TEDLIUM and FLEURS, and LLM-ST \twemoji keycap: 7 complements CoVoST2 and MuST-C with GigaST and private in-house test sets. The tendency not to report scores computed on a common set of benchmarks and language pairs (as discussed below), contributes to making the comparison for future works nearly impossible without an expensive re-implementation of existing methods, slowing down the progress in the area.

Supported Translation Languages.

Concerning the languages supported for translation, all the examined papers analyze different pairs but share the characteristic of being English-centric. They investigate either many-to-English directions ( \twemoji keycap: 3 , \twemoji keycap: 5 , and \twemoji keycap: 8 ) or English-to-many directions ( \twemoji keycap: 1 , \twemoji keycap: 2 , \twemoji keycap: 4 , \twemoji keycap: 6 , and \twemoji keycap: 8 ). In the context of many-to-English pairs, Qwen-Audio \twemoji keycap: 8 encompasses 5 source languages, Speech-LLaMa \twemoji keycap: 3 covers more than half of the CoVoST2 languages (13 out of 21), while SLM \twemoji keycap: 5 includes all 21 CoVoST2 pairs. Conversely, all papers focusing on English-to-many directions cover 2 to 4 target languages, constituting a consistently smaller set compared to the many-to-English case. Lastly, LLM-ST \twemoji keycap: 7 exclusively addresses a single translation pair (en\leftrightarrowzh). Interestingly, the majority of the works mainly report results for either de\rightarrowen or en\rightarrowde, which represents one of the most extensively analyzed language pairs in ST (Anastasopoulos et al., 2021, 2022; Agarwal et al., 2023), with \twemoji keycap: 7 being the only work addressing neither of them. en\leftrightarrowzh emerges as the second most reported language setting (each direction being used by 4 papers). Despite these commonalities, it is evident that the choice of supported languages varies significantly between the works. Also, the impact on performance of supporting multiple languages – which can interfere or enable transfer learning between linguistically similar languages (Ruder et al., 2019; Durrani et al., 2021) – remains uncertain.

3 What is Missing?

Alongside the need for focused and thorough analyses devoted to identifying the best-performing option for each architectural building block highlighted in §2.1 and the effects of the training choices discussed in §2.2, in the following we identify blind spots that need to be addressed for a more grounded and insightful progress in research on SFM+LLM solutions for ST.

Open Standard Training Settings.

As highlighted throughout the previous section, the lack of common experimental settings prevents the fair and direct comparison of different works. The adoption of public and standard training settings holds paramount importance in advancing research and fostering progress within the scientific community (Koch et al., 2021). On the one hand, it enables the comparison among various works, thus providing actionable insights on the most promising architectural choices. On the other, it fosters inclusivity and accessibility, allowing researchers without access to large proprietary corpora to contribute to the field (Scandura and Iammarino, 2020; Dusdal and Powell, 2021), and thus supporting AI democratization in the development process (Seger et al., 2023). Therefore, we advocate for future research to adhere to established data-setting standards, paving the way for cumulative progress and shared understanding in the field. However, as experimenting with different data sizes is also an interesting topic and findings may vary depending on the datasets and the tasks used in the training stage (see §2.2), it is debatable which would be the most appropriate training set. In the English-to-many scenario, researchers commonly adhere to the IWSLT offline constrained data condition,555https://iwslt.org/2023/offline comprising similar-to\sim4.5K hours of English audio, while, for smaller-scale experiments, MuST-C (similar-to\sim500 hours) is a widespread option. For many-to-English settings, CoVoST2, mTEDX (Salesky et al., 2021), and Europarl-ST (Iranzo-Sánchez et al., 2020) are open datasets with ST references and can be complemented with larger ASR resources such as CommonVoice (Ardila et al., 2020) and VoxPopuli. Notice that, by advocating for standardized and public training data, we do not imply that researchers should not investigate the effects of training in different data conditions. Rather, we suggest that, for works primarily focused on defining new architectural solutions, reporting results for (at least) a standard setting would ease comparisons with other alternatives and reduce the overall computational costs.

Standard and Reliable Evaluation.

The comparison between different methods is currently hindered not only by different training conditions but also by the fact that practitioners do not systematically present results on a common open benchmark. Furthermore, all works rely on the BLEU metric (Papineni et al., 2002), except for \twemoji keycap: 7 , which additionally reports COMET (Rei et al., 2022). Although we acknowledge that BLEU is still widespread (Marie et al., 2021) despite the wide consensus on its limited dependability and correlation with human judgments (Freitag et al., 2022), we argue that this specific scenario exacerbates the need for adopting alternative metrics to assess translation quality. The main reason behind this argument is the well-known tendency of n-gram-based metrics to penalize translations generated by LLMs that are, in general, less literal (Zhao et al., 2023a; Liu et al., 2023). As a suggestion for future works, we recommend reporting at least one semantic metric (e.g., COMET), and, preferably, multiple metrics. We also believe that reporting scores on open and multilingual benchmarks, such as CoVoST2, would improve comparability across studies without the need for re-running costly experiments, thereby promoting faster, cost-effective progress within the research community.

Comparison with Standard ST Approaches.

In analogy to studies (Sperber and Paulik, 2020; Bentivogli et al., 2021) and initiatives (Agarwal et al., 2023) dedicated to assessing the strengths and weaknesses of the two established end-to-end and cascade ST paradigms, the emergence of the SFM+LLM solution calls for thorough and fine-grained evaluations to investigate its peculiarities compared to other, more traditional methods. This need is also motivated by a recent analysis in the context of text-to-text translation (Pang et al., 2024), which showed that LLMs are partly affected by long-standing problems of neural approaches (e.g., the translation of rare entities and out-of-domain settings), while they do not face others (e.g., the translation of long sentences) and suffer from new ones (e.g., pre-training data imbalance across domains and languages). Among the new problems, a noteworthy element is the inference efficiency: the comparison with the standard methods – which typically rely on models of limited size (100-300M parameters) – should account for this aspect, which is critical for social, economic, and environmental reasons (Strubell et al., 2019). Along this line, important research directions include i) pruning the LLM (and possibly the SFM) in a task-aware manner (Ma et al., 2023; Zhu et al., 2023b; Dery et al., 2024), ii) dynamic layer selection during decoding (Xin et al., 2020; Geva et al., 2022; Xia et al., 2024), and iii) efficient decoding strategies (Stern et al., 2018; Chen et al., 2023a; Leviathan et al., 2023; Santilli et al., 2023). In addition, the speech source contains a wide range of information that can be exploited depending on the paradigm used (e.g., prosody is not handled by cascade systems – Zhou et al. 2024). As such, the ability of SFM+LLM models to leverage this information has to be investigated. The fine-grained evaluation of these aspects calls for the comparison of SFM+LLM models with other paradigms on tailored test suites (King and Falkedal, 1990; Ribeiro et al., 2020), similar to those used in MT (Kocmi et al., 2023).

In-Context Learning Assessment.

One of the most interesting emergent abilities of LLMs (Wei et al., 2022) is their ability to exploit a few demonstrations or examples to perform a task or enhance their performance on it (Dong et al., 2023). This ability – referred to as in-context learning (ICL) (Brown et al., 2020)is one of the main motivations for integrating an SFM and an LLM into a single ST model. However, the transfer of ICL capabilities of LLMs to the speech modality, and, even more so, to the SFM+LLM approach to ST, cannot be taken for granted. In fact, while the ICL ability of SFM+LLM has been successfully assessed in ASR and SLU (Gao et al., 2022; Hsu et al., 2023; Chen et al., 2023d) – also within the retrieval-augmented framework (Wang et al., 2023b), where the relevant context is retrieved from a knowledge base (Ram et al., 2023) – the only attempt in ST has not been similarly successful (Chen et al., 2023d). Moreover, SFMs like Whisper feature similar (yet limited) ICL capabilities (Peng et al., 2023; Wang et al., 2023c), which might make the SFM+LLM integration not even necessary. For this reason, investigating whether and to what extent the integration of SFMs with LLMs transfers the ICL ability of the latter to the ST task is an important and interesting avenue for future studies.

4 Conclusions

The ST landscape has recently witnessed the emergence of a new paradigm, which is the combination of SFMs and LLMs into single ST models. To summarize the lessons learned and establish a unified framework, we surveyed the existing works on the topic, analyzing their architectural and training choices. As a result, we identified a common abstraction of the surveyed SFM+LLM architectures, which consists of five building blocks: i) the SFM extracting high-level speech representations, ii) the Length Adapter compressing such sequence of features, iii) the Modality Adapter mapping them to an embedding space more suitable for the LLM, iv) the Prompt-Speech Merger combining the speech information with an adequate prompt for the LLM, and v) the LLM generating the output translation. Subsequently, we highlighted how the current lack of standardized training recipes and evaluations hinders the direct comparison of the proposed approaches, limiting the possibility of extracting precise and unified indications. Lastly, we pointed out the need for thorough comparisons with standard ST approaches and in-depth investigations of the inherent capabilities of SFM+LLM solutions in order to shed light on its real potential for ST.


Our survey of the existing studies on the integration of an SFM and an LLM has been limited to the context of the speech-to-text translation task. We did not target the more generic case of the SFM+LLM integration as already covered by existing surveys (Latif et al., 2023) and would have prevented the ability to go more in-depth for the specific works within the page limit. For the same reason, we have not included works that target different tasks, such as ASR (Chen et al., 2023b; Hono et al., 2023; Lakomkin et al., 2023; Radhakrishnan et al., 2023; Yu et al., 2023), nor models that focus on audio phenomena different from the human speech666E.g., sound classification/captioning or music processing. (Deshmukh et al., 2023; Han et al., 2023; Shu et al., 2023; Zhang et al., 2023c; Zhao et al., 2023b). While they can inspire effective solutions for the ST case as well, assessing their portability to the ST field may be the focus of dedicated works.

Moreover, we only discussed the solutions in terms of their ST performance, without considering their generalization capability and/or capacity to perform different downstream tasks, as it would have added complexity to an analysis that targeted a specific task of spoken language processing. However, we believe that applying foundation models to a specific task does not necessarily imply that they need to retain generic capabilities, although this is a desirable property. Similarly, we have not delved into ethical considerations and implications of such solutions (Manvi et al., 2024; Schramowski et al., 2022), as we believe that it should be the topic of tailored and dedicated evaluations, also in comparison with traditional ST approaches, as mentioned in §3.

Lastly, the study did not include models that can perform the ST task as part of a cascade approach, where audio is converted into text or other units (Wang et al., 2023d; Zhang et al., 2023a), nor those that use the LLM only to understand user requests and forward their actual processing to SFMs (Huang et al., 2023a). While these represent viable solutions, we argue that their progress and analysis are directly linked to the ASR quality of SFMs and the MT quality of LLMs, which are extensively studied in specific works (Radford et al., 2023; Hendy et al., 2023; Communication et al., 2023; Xu et al., 2023; Pang et al., 2024).


We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU. This paper has received funding from the European Union’s Horizon research and innovation programme under grant agreement No 101135798, project Meetween (My Personal AI Mediator for Virtual MEETtings BetWEEN People).


