2407.03169v1

Investigating Decoder-only Large Language Models for Speech-to-text
Translation
Chao-Wei Huang1,∗ , Hui Lu2,∗ , Hongyu Gong3 , Hirofumi Inaguma3 , Ilia Kulikov3 , Ruslan
Mavlyutov3 , Sravya Popuri3
1
National Taiwan University,
2
The Chinese University of Hong Kong, 3 AI at Meta
f07922069@csie.ntu.edu.tw
Abstract without relying on large amount of proprietary data. Further-
more, we analyze design choices of each aspect of our exper-
Large language models (LLMs), known for their exceptional imental pipeline. Our contribution can be summarized as the
reasoning capabilities, generalizability, and fluency across di- following:
verse domains, present a promising avenue for enhancing
speech-related tasks. In this paper, we focus on integrating • We propose a decoder-only architecture for integrating LLMs
to S2TT.
arXiv:2407.03169v1 [cs.CL] 3 Jul 2024
decoder-only LLMs to the task of speech-to-text translation

(S2TT). We propose a decoder-only architecture that enables • Our proposed model outperforms state-of-the-art S2TT mod-
the LLM to directly consume the encoded speech represen- els on CoVoST 2 and FLEURS without training on propri-
tation and generate the text translation. Additionally, we in- etary data.
vestigate the effects of different parameter-efficient fine-tuning • We conduct analyses to validate our design choices, which
techniques and task formulation. Our model achieves state-of- we hope could facilitate future research on S2TT with LLMs.
the-art performance on CoVoST 2 and FLEURS among models
trained without proprietary data. We also conduct analyses to
validate the design choices of our proposed model and bring
2. Related Work
insights to the integration of LLMs to S2TT. 2.1. Speech-to-text Translation
Index Terms: speech-to-text translation, large language models
Speech-to-text translation has seen significant progress, espe-
cially for end-to-end models. To solve the data scarcity issue of
1. Introduction training end-to-end models, multiple large-scale datasets have
The task of speech-to-text translation (S2TT) involves convert- been collected, e.g., MuST-C [10], CoVoST [11], Common
ing audio signals in one language into text in another, which is Voice [12], and VoxPopuli [13]. Recent studies have started to
crucial for enabling cross-lingual communication. Tradition- focus on multilingual S2TT, where a single end-to-end model
ally, S2TT has employed a cascaded architecture with sepa- supports multiple translation directions [2]. The advent of pre-
rate automatic speech recognition (ASR) and machine transla- trained models in language [5, 6] and speech [14, 15] have fa-
tion (MT) components [1]. Recently, the emerging end-to-end cilitated new state-of-the-art models that leveraged the pretrain-
(E2E) approach, which integrates audio encoding and text de- then-finetune paradigm [3, 16].
coding into a single process, has gained popularity for the bene- Our paper studies the integration of decoder-only LLMs to
fits of error propagation mitigation and latency reduction [2, 3]. S2TT, which is still under-explored due to their new architecture
While it has achieved significant performance improvement, and emerging capabilities.
S2TT still suffers from poor out-of-domain generalization and
failure to capture nuanced details, e.g., slangs and cultural dif- 2.2. Speech and Audio LLMs
ferences [4].
Large language models (LLMs) have emerged as powerful With the emergence of large language models, studies have ex-
techniques for natural language processing (NLP) due to their plored applying them to different modalities. LTU [17] fine-
excellent reasoning capabilities and generalizability. They ex- tuned LLMs on diverse audio datasets, thus enabling LLMs to
cel at generating text for a wide range of tasks based on large- reason given audio inputs. Furthermore, various works have ex-
scale pre-training [5, 6], instruction fine-tuning [7], and rein- plored extending the instruction-following capability of LLMs
forcement learning from human feedback [8, 9]. LLMs are also to speech and audio inputs [18, 19, 20]. While these methods
known for their fluency and diverse domain coverage, which make it possible for LLMs to handle a variety of speech and au-
could potentially mitigate the generalization gap for S2TT mod- dio tasks, their performance on individual tasks often falls short
els. However, it is still under-explored as to how LLMs should of that achieved by specialized models.
be integrated to improve S2TT performance. Another line of research focuses on adapting LLMs to
In this paper, we aim to examine various aspects of adapt- a specific speech or audio task. Recent works have exam-
ing decoder-only LLMs to S2TT, including architectural design, ined the integration of LLMs to automatic speech recognition,
parameter-efficient fine-tuning, and taks formulations. We pro- demonstrating their potential in understanding the content of
pose a decoder-only architecture that directly consumes contin- speech [21, 22]. Similar to our work, AudioPaLM [23], Speech-
uous speech representation instead of discretized tokens. Our LLaMA [24], and SALM [25] aimed at leveraging LLMs to
proposed model achieves state-of-the-art S2TT performance improve the state-of-the-art S2TT performance. AudioPaLM
proposed to adapt LLMs to speech by discretizing speech rep-
∗ Work done during internship at Meta AI resentations and treat the discrete tokens as additional text to-
Good morning ! </s>
Decoder-only LLM
Text Embedding Speech Encoder Text Embedding
Translate from German to English (Guten Morgen!) <s> Good morning !
Figure 1: Illustration of our proposed decoder-only architecture.
kens. Such method has two drawbacks, as shown in the original F = {F1 , · · · , Fn } to their corresponding hidden representa-
paper: 1) its performance is highly dependent on the quality tions Es (F ), where n denotes the sequence length of the fbank
of the speech encoder, and 2) the discretization makes fine- features. Speech frames are typically much more granular than
tuning the speech encoder hard, which requires fine-tuning text tokens. Therefore, we employ a length adapter on top of
the speech encoder with ASR first [23]. Our paper demon- the speech encoder to reduce the length of the speech represen-
strates that using continuous speech representations mitigates tations. The length adapter consists of a single 1-dimensional
these issues, achieving better performance while being simpler. convolutional layer with a filter size and stride of k, which re-
Speech-LLaMA and SALM both proposed briding LLMs and duces the length of the speech representations by k-fold.
speech encoders with a modality adaptor and fine-tunes LLMs The text decoder is based on LLaMA-2 [9], a decoder-only
via LoRA [26]. Additionally, Speech-LLaMA introduced CTC large language model pre-trained on 2 trillion text tokens with a
compressor to shorten the speech input. Our paper adopts a sim- language modeling objective. The speech inputs and text inputs
pler length adaptor in our architecture, and applies LNA fine- are encoded with their corresponding encoders, i.e., speech en-
tuning [3] and demonstrates that it outperforms LoRA signifi- coder for speech inputs and text embedding layer for text inputs.
cantly. Subsequently, the encoded representations are concatenated and
fed to the transformer decoder. In other words, we treat the
3. Our Method encoded speech representations S the same as the text embed-
dings, without discretizing them as done in prior work [23].
In this section, we introduce the task formulations (§3.1), the A triangular mask is appied to the self-attention layers to re-
architectural designs of our model (§3.2), how the model is strict tokens from atteding to latter positions. More formally,
trained (§3.3), and parameter-efficient fine-tuning techniques given an interleaving sequence of text and speech sequences
(§3.4). X = {X 1 , F, X 2 }, where X i = {xii , · · · , xi|xi | } denotes a
text sequence, X 1 denotes the prefix text, and X 2 denotes the
3.1. Task Formulations suffix text. After encoding, the input sequence to the trans-
The task of speech-to-text translation is to translate the source former decoder will be X = {Emb(X 1 ), Es (F ), Emb(X 2 )},
speech input S into the corresponding target translation Y = where Emb denotes the text embedding layer. Note that we
{y1 , · · · , yM } which is in the target language. Following prior flatten the sequences in X before processing them with the
work [23], we define two formulations of our S2TT model: 1) decoder. Finally, we apply a linear transformation to the de-
the standard formulation where the model generates the target coder outputs to obtain the logits for predicting the next token
sequence directly f : S → Y , and 2) the chained formulation O = W ⊤ D(X), where D denotes the transformer decoder and
where the model first generates the transcription in the source W ∈ Rh×|V | is a trainable matrix where |V | denotes the vo-
language then the translation in the target language fchain : S → cabulary size.
{YASR , Y }, where YASR denotes the transcription of the source
speech. It is also common to include ASR during training as an 3.3. Training
auxiliary task, which is formulated as fASR : S → YASR . There- As described above, we include three formulations, i.e., f ,
fore, we include f , fchain , and fASR during training for multi-task fchain , and fASR , for multi-task training. To let our model distin-
training, and perform either f or fchain during inference. guish among tasks, we provide different instructions in natural
language for each task t. The instructions include a description
3.2. Architecture of the task, the source language, and the target language. We
format the instruction I and the source speech S into the input
Our model consists of a speech encoder and a text decoder, both
sequence X with a template. The target sequence for training is
using the Transformer architecture [27]. An illustration of the
formatted as:
overall architecture is shown in Figure 1. 
Our speech encoder is based on W2v-BERT [15], a self- Translation: Y
 if t = f
supervised pre-trained speech encoder. For a given speech in- Y ′ = Transcription: YASR if t = fASR
put S, we first convert the speech signal to fbank features with 
Transcription: Y
ASR Translation: Y if t = fchain .
80 mel banks, a context window of 25 ms, and a stride of
10 ms. The speech encoder Es encodes the fbank features Given a source speech S, an instruction I, and the formatted
Ar Ca Cy De Es Et Fa Fr Id It Ja Lv Mn Nl Pt Ru Sl Sv Ta Tr Zh Avg
Trained with Proprietary Data
Whisper-large [28] 39.7 31.8 18.0 21.5 36.3 15.0 36.4 48.1 30.9 26.1 0.1 13.9 41.2 19.3 51.6 43.3 21.6 40.1 42.9 4.2 28.3 29.1
USM-M [29] - - - - - - - - - - - - - - - - - - - - - 30.7
Speech-LLaMA [24] 28.2 - - 27.1 27.9 18.7 - 25.2 - 25.9 19.9 - - 36.5 32.0 36.8 22.7 29.0 - - 12.3 -
AudioPaLM [23] 48.7 38.4 25.5 13.7 43.4 30.0 44.8 56.2 44.3 25.9 7.6 35.0 48.3 29.4 57.3 55.6 42.6 44.2 53.3 9.0 41.0 37.8
Trained with Public Data Only
XLS-R [16] 17.1 33.8 9.4 14.0 33.6 11.1 37.6 16.5 34.9 3.5 1.6 19.5 31.7 12.9 41.8 39.5 19.6 39.2 29.6 0.5 16.7 22.1
ComSL-large [30] - - - - - - - - - - - - - - - - - - - - - 31.5
AudioPaLM† [23] - - - - - - - - - - - - - - - - - - - - - 33.1
W2vBERT+NLLB 42.0 38.4 18.3 52.0 39.3 23.6 41.3 47.3 39.4 18.1 3.5 18.4 43.0 27.2 50.8 51.7 36.9 42.2 40.6 6.2 33.2 34.0
Ours 45.8 39.5 22.4 56.9 41.2 20.4 44.5 54.5 42.9 24.4 0.9 21.9 46.8 26.3 56.1 53.3 42.7 45.1 53.7 5.3 34.4 37.1
Table 1: Main results on the X-En test sets of CoVoST 2 (%). We report corpus BLEU scores computed with SacreBLEU. The best
results among models trained with public data are bolded. † The result reported in the AudioPaLM paper [23] when trained on only
public datasets.
target sequence Y ′ , the training objective is to minimize the transformer model during inference, making LoRA a common
S2TT loss: technique for adapting large language models efficiently.
′
M
1 X
L(S, Y ′ ) = − logP (yi′ | S, I, Y<i
′
) 4. Experiments
M ′ i=1
4.1. Experimental Setup
where M ′ denotes the length of Y ′ and P (yi′ | S, I, Y<i
′
) de-
′ We train and evaluate our models on publicly available datasets.
notes the probability of yi predicted by the model given the
′ For training, we use CoVoST2 [11], Common Voice 11 [12],
source speech and the prior tokens Y<i in the target sequence.
The predicted probability is obtained by applying the softmax and VoxPopuli [13] datasets. CoVoST-2 is a speech-to-text
function to the logits O. translation dataset consisting of 21 languages. The dataset in-
cludes human-labeled translation pairs from 21 languages to
3.4. Parameter-efficient Fine-tuning English (X-En), and from English to 15 languages (En-X).
Common Voice is a collection of speech-text pairs where the
Large language models have billions of parameters, making it speech was recorded by annotators given the text transcription.
computationally expensive and inefficient to fine-tune all of the VoxPopuli consists of speech from the European Parliament
parameters during training. It is common to apply parameter- with the corresponding transcriptions and interpretations in 15
efficient fine-tuning techniques when fine-tuning LLMs on languages.
downstream tasks to improve efficiency and mitigate catas- We conduct in-domain evaluation on the test sets of CoV-
trophic forgetting. To this end, we employ and compare two oST 2. Additionally, we perform zero-shot evaluation on
parameter-efficient fine-tuning techniques in this paper: LNA FLEURS [4], a dataset that aims to evaluate the out-of-domain
fine-tuning [3] and Low Rank Adaptation (LoRA) [26]. generalizability of speech translation models. Note that for all
datasets, we only use the directions that are present in CoV-
3.4.1. LNA Fine-tuning oST2. We report BLEU scores from SacreBLEU and addi-
LayerNorm and Attention (LNA) fine-tuning adapts pretrained tionally the model-based COMET score with the model wmt22-
language and speech models to S2TT by fine-tuning only the comet-da [31].
layer normalization and the multi-head attention layers [3]. This
method greatly reduces the number of trainable parameters dur- 4.2. Implementation Details
ing fine-tuning and avoids catastrophic forgetting, thus improv-
ing the downstream performance for multilingual speech-to-text We employ a pretrained W2v-BERT [15] model that was re-
translation. Since the pretrained language model we use is a leased in [32] with 600M parameters that is pretrained on 4 mil-
decoder-only transformer model, we apply LNA fine-tuning and lion hours of speech data with a self-supervised objective as the
fine-tune only the layer normalization and the self-attention lay- speech encoder. The text decoder is initialized with LLaMA2-
ers in the transformer decoder. 7B-chat [9].
We implement our models, training, and evaluation proce-
3.4.2. Low Rank Adaptation (LoRA) dures with the Fairseq2 library1 . During training, the effective
batch size is set to 800K speech frames, or 8000 seconds of
LoRA injects trainable rank decomposition matrices into the speech inputs. We optimize the model with the AdamW op-
projections layers of a transformer model, which serves as a timizer and set the learning rate to 1e-4. The learning rate is
residual path in addition to a projection layer. During fine- warmed up for 5000 steps and linearly decayed until the maxi-
tuning, only the decomposition matrices are updated, while all mum number of steps is reached, which is set to 60000. We fine-
of the pretrained parameters are frozen. Thus, the number of tune all parameters of the speech encoder and apply parameter-
trainable parameters is significantly reduced. The decomposi- efficient fine-tuning methods to the text decoder. All experi-
tion matrices can be merged into the original projection ma- ments are conducted on 32 NVIDIA A100 GPUs.
trix after fine-tuning. Therefore, there is no additional com-
putation nor additional parameters compared to the pretrained 1 https://github.com/facebookresearch/fairseq2
CoVoST 2 FLEURS CoVoST 2 FLEURS
BLEU COMET BLEU COMET
Ours 37.1 23.4
Encoder-Decoder - fchain 35.4 22.4
W2vBERT+NLLB 33.8 80.1 22.7 76.4 - fASR 36.4 22.7
W2vBERT+LLaMA2 33.3 79.7 18.2 74.2 - fASR & fchain 35.8 22.5
Decoder-only Table 4: Results of formulation ablation (%).
Ours 36.3 81.4 23.4 77.6
Table 2: Results of different architectures (%). We report the av-
erage BLEU score and COMET score on the 21 X-En directions 5. Discussion
on CoVoST 2 and FLEURS.
In this section, we conduct various experiments to analyze and
discuss the details of our proposed method.
Rank Layers CoVoST 2 FLEURS 5.1. Architectural Design

Freeze W2vBERT - - 15.6 3.8 With decoder-only LLMs, it is unclear as to which architecture
Freeze decoder - - 29.9 19.0 performs the best for S2TT. We compare our decoder-only ar-
8 q, v 32.1 20.8 chitecture with encoder-decoder models, with NLLB [33] and
LoRA 32 q, k, v, o 32.7 21.1 LLaMA-2 [9] as the text decoder. As shown in Table 2, our
32 All linear 33.0 21.4 model significantly outperforms the encoder-decoder counter-
part on both CoVoST 2 and FLEURS. Furthermore, encoder-
LNA - - 36.3 23.4
decoder with LLaMA 2 even underperforms NLLB, demon-
Table 3: Results of different parameter-efficient fine-tuning strating that encoder-decoder architecture are unsuitable for
methods (%). Rank and Layers refer to the configuration of decoder-only LLMs. We hypothesize that it is the newly intro-
LoRA. The notations q, k, v, o denote the query, key, value, out- duced encoder-decoder attention layers which are not pretrained
put layers of the self-attention layers respectively. that degrade the performance of encoder-decoder models.
5.2. Parameter-efficient Fine-tuning

We compare LNA fine-tuning, LoRA, and the effect of freezing
4.3. Baseline Methods pretrained models. As shown in Table 3, LNA fine-tuning sig-
nificantly outperforms LoRA with various configurations. This
We compare our model with various state-of-the-art baselines result suggests that adopting LoRA, as done in prior work such
that were trained on the same set of public datasets as our as Speech-LLaMA [24], is suboptimal for S2TT. Freezing the
method, i.e., CoVoST 2, Common Voice, and VoxPopuli. XLS- text decoder during fine-tuning yields even worse performance
R [16] is a self-supervised cross-lingual speech representation than LoRA, demonstrating the importance of fine-tuning the
model. ComSL [30] conducts self-training on the Common text decoder. Finally, freezing the speech encoder results in
Voice dataset. Additionally, we implement an encoder-decoder detrimental performance degradation. This result shows that
baseline with W2vBERT as the speech encoder and NLLB [33] fine-tuning the speech encoder is crucial for aligning the speech
1.3B as the text decoder. representation with the text inputs. We hypothesize that this
We also compare our model with models trained with pro- leads to the underperformance of AudioPaLM with encoders
prietary data. Whisper [28] trains a robust speech recognition that are not fine-tuned with ASR [23], since the discretization
and translation model with large amounts of weak supervisions. of speech representations makes fine-tuning the speech encoder
USM [29] is an universal speech model pretrained with 12 mil- non-trivial.
lion hours of speech data. Speech-LLaMA [24] shares a similar
architecture with our model and was trained with in-house data 5.3. Ablation of Formulations
and LoRA [26]. AudioPaLM [23] is the state-of-the-art method Table 4 shows the results of various combination of the for-
on CoVoST2 which is trained on proprietary data. We also in- mulations. Removing either fASR or fchain degrades the S2TT
clude a variant of AudioPaLM that is trained on public datasets performance. Notably, training with f and fASR slightly under-
only for a fair comparison, which is reported in the paper [23]. performs f , showing that multi-task training with ASR does not
always improve performance.
4.4. Results
The main results on CoVoST 2 are reported in Table 1. Our

6. Conclusion
model achieves an average BLEU score of 37.1, which is the In this paper, we propose a decoder-only architecture that adapts
new state-of-the-are performance among models trained with a decoder-only LLM to the speech-to-text translation task. Our
public data only. Notably, our model outperforms the Au- proposed method is simple and effective, achieving state-of-
dioPaLM variant which was trained on only public datasets, the-art performance and is comparable to the best-performing
demonstrating the superiority of our proposed method. When proprietary model. We conduct additional analyses to examine
compared to models trained with proprietary data, our model the effect of different design choices regarding architectural de-
outperforms all of them and achieves comparable performance sign, parameter-efficient fine-tuning, and task formulations. We
to AudioPaLM. These results demonstrate that our method in- hope that our findings could facilitate future work on leveraging
tegrates LLMs to S2TT efficiently and effectively. LLMs in the S2TT task.
7. References [17] Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass, “Listen,
think, and understand,” in The Twelfth International Conference
[1] H. Ney, “Speech translation: Coupling of recognition and trans- on Learning Representations, 2024.
lation,” in 1999 IEEE International Conference on Acoustics,
Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. [18] M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y. Cao, N. Chen,
99CH36258), vol. 1. IEEE, 1999, pp. 517–520. Y. Zhang, H. Soltau, P. K. Rubenstein et al., “Slm: Bridge the thin
gap between speech and text foundation models,” in ASRU 2023.
[2] H. Inaguma, K. Duh, T. Kawahara, and S. Watanabe, “Mul- IEEE, 2023, pp. 1–8.
tilingual end-to-end speech translation,” in 2019 IEEE Auto-
matic Speech Recognition and Understanding Workshop (ASRU). [19] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan,
IEEE, 2019, pp. 570–577. C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal au-
dio understanding via unified large-scale audio-language models,”
[3] X. Li, C. Wang, Y. Tang, C. Tran, Y. Tang, J. Pino, A. Baevski, arXiv:2311.07919, 2023.
A. Conneau, and M. Auli, “Multilingual speech translation from
efficient finetuning of pretrained models,” in Proceedings of ACL- [20] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and
IJCNLP 2021, 2021, pp. 827–838. C. Zhang, “Salmonn: Towards generic hearing abilities for large
language models,” arXiv:2310.13289, 2023.
[4] A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia,
J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning [21] Y. Fathullah, C. Wu, E. Lakomkin, J. Jia, Y. Shangguan, K. Li,
evaluation of universal representations of speech,” in 2022 IEEE J. Guo, W. Xiong, J. Mahadeokar, O. Kalinli et al., “Prompt-
Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. ing large language models with speech recognition abilities,”
798–805. arXiv:2307.11795, 2023.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [22] W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and
training of deep bidirectional transformers for language under- C. Zhang, “Connecting speech encoder and large language model
standing,” in Proceedings NAACL-HLT 2019, 2019, pp. 4171– for asr,” arXiv:2309.13963, 2023.
4186. [23] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna,
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han,
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan- E. Kharitonov et al., “Audiopalm: A large language model that
guage models are few-shot learners,” Advances in neural informa- can speak and listen,” arXiv:2306.12925, 2023.
tion processing systems, vol. 33, pp. 1877–1901, 2020. [24] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu,
B. Ren, L. Liu et al., “On decoder-only architecture for speech-
[7] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
to-text and large language model integration,” in 2023 IEEE Auto-
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-
matic Speech Recognition and Understanding Workshop (ASRU).
shot learners,” arXiv:2109.01652, 2021.
IEEE, 2023, pp. 1–8.
[8] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright,
[25] Z. Chen, H. Huang, A. Andrusenko, O. Hrinchuk, K. C. Puvvada,
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train-
J. Li, S. Ghosh, J. Balam, and B. Ginsburg, “Salm: Speech-
ing language models to follow instructions with human feedback,”
augmented language model with in-context learning for speech
Advances in Neural Information Processing Systems, vol. 35, pp.
recognition and translation,” arXiv:2310.09424, 2023.
27 730–27 744, 2022.
[26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
[9] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large lan-
Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale
guage models,” in International Conference on Learning Repre-
et al., “Llama 2: Open foundation and fine-tuned chat models,”
sentations, 2022.
arXiv:2307.09288, 2023.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[10] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
M. Turchi, “MuST-C: a Multilingual Speech Translation Corpus,” Advances in neural information processing systems, vol. 30, 2017.
in Proceedings of NAACL-HLT 2019, 2019, pp. 2012–2017.
[28] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
[11] C. Wang, A. Wu, and J. Pino, “Covost 2 and massively multilin- I. Sutskever, “Robust speech recognition via large-scale weak su-
gual speech-to-text translation,” arXiv:2007.10310, 2020. pervision,” in International Conference on Machine Learning.
[12] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- PMLR, 2023, pp. 28 492–28 518.
retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common [29] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen,
voice: A massively-multilingual speech corpus,” in Proceedings N. Chen, B. Li, V. Axelrod, G. Wang et al., “Google usm:
of LREC 2020, 2020, pp. 4218–4222. Scaling automatic speech recognition beyond 100 languages,”
[13] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- arXiv:2303.01037, 2023.
iza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A [30] C. Le, Y. Qian, L. Zhou, S. Liu, Y. Qian, M. Zeng, and X. Huang,
large-scale multilingual speech corpus for representation learn- “Comsl: A composite speech-language model for end-to-end
ing, semi-supervised learning and interpretation,” in Proceedings speech-to-text translation,” Advances in Neural Information Pro-
of ACL-IJCNLP 2021, 2021, pp. 993–1003. cessing Systems, vol. 36, 2024.
[14] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec [31] R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C. Far-
2.0: A framework for self-supervised learning of speech repre- inha, T. Glushkova, A. Lavie, L. Coheur, and A. F. T. Mar-
sentations,” Advances in neural information processing systems, tins, “COMET-22: Unbabel-IST 2022 submission for the metrics
vol. 33, pp. 12 449–12 460, 2020. shared task,” in Proceedings of the Seventh Conference on Ma-
[15] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and chine Translation (WMT), 2022, pp. 578–585.
Y. Wu, “W2v-bert: Combining contrastive learning and masked [32] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-
language modeling for self-supervised speech pre-training,” in A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman
2021 IEEE Automatic Speech Recognition and Understanding et al., “Seamlessm4t-massively multilingual & multimodal ma-
Workshop (ASRU). IEEE, 2021, pp. 244–250. chine translation,” arXiv:2308.11596, 2023.
[16] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, [33] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield,
K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No
and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- language left behind: Scaling human-centered machine transla-
resentation Learning at Scale,” in Proc. Interspeech 2022, 2022, tion,” arXiv:2207.04672, 2022.
pp. 2278–2282.

2407.03169v1

Uploaded by

Copyright:

Available Formats

2407.03169v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2407.03169v1

Uploaded by

Copyright:

Available Formats

Investigating Decoder-only Large Language Models for Speech-to-text

decoder-only LLMs to the task of speech-to-text translation

Text Embedding Speech Encoder Text Embedding

Translate from German to English (Guten Morgen!) <s> Good morning !

Figure 1: Illustration of our proposed decoder-only architecture.

Rank Layers CoVoST 2 FLEURS 5.1. Architectural Design

5.2. Parameter-efficient Fine-tuning

The main results on CoVoST 2 are reported in Table 1. Our

You might also like