2407.03169v1
2407.03169v1
2407.03169v1
Translation
Chao-Wei Huang1,∗ , Hui Lu2,∗ , Hongyu Gong3 , Hirofumi Inaguma3 , Ilia Kulikov3 , Ruslan
Mavlyutov3 , Sravya Popuri3
1
National Taiwan University,
2
The Chinese University of Hong Kong, 3 AI at Meta
f07922069@csie.ntu.edu.tw
Abstract without relying on large amount of proprietary data. Further-
more, we analyze design choices of each aspect of our exper-
Large language models (LLMs), known for their exceptional imental pipeline. Our contribution can be summarized as the
reasoning capabilities, generalizability, and fluency across di- following:
verse domains, present a promising avenue for enhancing
speech-related tasks. In this paper, we focus on integrating • We propose a decoder-only architecture for integrating LLMs
to S2TT.
arXiv:2407.03169v1 [cs.CL] 3 Jul 2024
Decoder-only LLM
kens. Such method has two drawbacks, as shown in the original F = {F1 , · · · , Fn } to their corresponding hidden representa-
paper: 1) its performance is highly dependent on the quality tions Es (F ), where n denotes the sequence length of the fbank
of the speech encoder, and 2) the discretization makes fine- features. Speech frames are typically much more granular than
tuning the speech encoder hard, which requires fine-tuning text tokens. Therefore, we employ a length adapter on top of
the speech encoder with ASR first [23]. Our paper demon- the speech encoder to reduce the length of the speech represen-
strates that using continuous speech representations mitigates tations. The length adapter consists of a single 1-dimensional
these issues, achieving better performance while being simpler. convolutional layer with a filter size and stride of k, which re-
Speech-LLaMA and SALM both proposed briding LLMs and duces the length of the speech representations by k-fold.
speech encoders with a modality adaptor and fine-tunes LLMs The text decoder is based on LLaMA-2 [9], a decoder-only
via LoRA [26]. Additionally, Speech-LLaMA introduced CTC large language model pre-trained on 2 trillion text tokens with a
compressor to shorten the speech input. Our paper adopts a sim- language modeling objective. The speech inputs and text inputs
pler length adaptor in our architecture, and applies LNA fine- are encoded with their corresponding encoders, i.e., speech en-
tuning [3] and demonstrates that it outperforms LoRA signifi- coder for speech inputs and text embedding layer for text inputs.
cantly. Subsequently, the encoded representations are concatenated and
fed to the transformer decoder. In other words, we treat the
3. Our Method encoded speech representations S the same as the text embed-
dings, without discretizing them as done in prior work [23].
In this section, we introduce the task formulations (§3.1), the A triangular mask is appied to the self-attention layers to re-
architectural designs of our model (§3.2), how the model is strict tokens from atteding to latter positions. More formally,
trained (§3.3), and parameter-efficient fine-tuning techniques given an interleaving sequence of text and speech sequences
(§3.4). X = {X 1 , F, X 2 }, where X i = {xii , · · · , xi|xi | } denotes a
text sequence, X 1 denotes the prefix text, and X 2 denotes the
3.1. Task Formulations suffix text. After encoding, the input sequence to the trans-
The task of speech-to-text translation is to translate the source former decoder will be X = {Emb(X 1 ), Es (F ), Emb(X 2 )},
speech input S into the corresponding target translation Y = where Emb denotes the text embedding layer. Note that we
{y1 , · · · , yM } which is in the target language. Following prior flatten the sequences in X before processing them with the
work [23], we define two formulations of our S2TT model: 1) decoder. Finally, we apply a linear transformation to the de-
the standard formulation where the model generates the target coder outputs to obtain the logits for predicting the next token
sequence directly f : S → Y , and 2) the chained formulation O = W ⊤ D(X), where D denotes the transformer decoder and
where the model first generates the transcription in the source W ∈ Rh×|V | is a trainable matrix where |V | denotes the vo-
language then the translation in the target language fchain : S → cabulary size.
{YASR , Y }, where YASR denotes the transcription of the source
speech. It is also common to include ASR during training as an 3.3. Training
auxiliary task, which is formulated as fASR : S → YASR . There- As described above, we include three formulations, i.e., f ,
fore, we include f , fchain , and fASR during training for multi-task fchain , and fASR , for multi-task training. To let our model distin-
training, and perform either f or fchain during inference. guish among tasks, we provide different instructions in natural
language for each task t. The instructions include a description
3.2. Architecture of the task, the source language, and the target language. We
format the instruction I and the source speech S into the input
Our model consists of a speech encoder and a text decoder, both
sequence X with a template. The target sequence for training is
using the Transformer architecture [27]. An illustration of the
formatted as:
overall architecture is shown in Figure 1.
Our speech encoder is based on W2v-BERT [15], a self- Translation: Y
if t = f
supervised pre-trained speech encoder. For a given speech in- Y ′ = Transcription: YASR if t = fASR
put S, we first convert the speech signal to fbank features with
Transcription: Y
ASR Translation: Y if t = fchain .
80 mel banks, a context window of 25 ms, and a stride of
10 ms. The speech encoder Es encodes the fbank features Given a source speech S, an instruction I, and the formatted
Ar Ca Cy De Es Et Fa Fr Id It Ja Lv Mn Nl Pt Ru Sl Sv Ta Tr Zh Avg
Trained with Proprietary Data
Whisper-large [28] 39.7 31.8 18.0 21.5 36.3 15.0 36.4 48.1 30.9 26.1 0.1 13.9 41.2 19.3 51.6 43.3 21.6 40.1 42.9 4.2 28.3 29.1
USM-M [29] - - - - - - - - - - - - - - - - - - - - - 30.7
Speech-LLaMA [24] 28.2 - - 27.1 27.9 18.7 - 25.2 - 25.9 19.9 - - 36.5 32.0 36.8 22.7 29.0 - - 12.3 -
AudioPaLM [23] 48.7 38.4 25.5 13.7 43.4 30.0 44.8 56.2 44.3 25.9 7.6 35.0 48.3 29.4 57.3 55.6 42.6 44.2 53.3 9.0 41.0 37.8
Trained with Public Data Only
XLS-R [16] 17.1 33.8 9.4 14.0 33.6 11.1 37.6 16.5 34.9 3.5 1.6 19.5 31.7 12.9 41.8 39.5 19.6 39.2 29.6 0.5 16.7 22.1
ComSL-large [30] - - - - - - - - - - - - - - - - - - - - - 31.5
AudioPaLM† [23] - - - - - - - - - - - - - - - - - - - - - 33.1
W2vBERT+NLLB 42.0 38.4 18.3 52.0 39.3 23.6 41.3 47.3 39.4 18.1 3.5 18.4 43.0 27.2 50.8 51.7 36.9 42.2 40.6 6.2 33.2 34.0
Ours 45.8 39.5 22.4 56.9 41.2 20.4 44.5 54.5 42.9 24.4 0.9 21.9 46.8 26.3 56.1 53.3 42.7 45.1 53.7 5.3 34.4 37.1
Table 1: Main results on the X-En test sets of CoVoST 2 (%). We report corpus BLEU scores computed with SacreBLEU. The best
results among models trained with public data are bolded. † The result reported in the AudioPaLM paper [23] when trained on only
public datasets.
target sequence Y ′ , the training objective is to minimize the transformer model during inference, making LoRA a common
S2TT loss: technique for adapting large language models efficiently.
′
M
1 X
L(S, Y ′ ) = − logP (yi′ | S, I, Y<i
′
) 4. Experiments
M ′ i=1
4.1. Experimental Setup
where M ′ denotes the length of Y ′ and P (yi′ | S, I, Y<i
′
) de-
′ We train and evaluate our models on publicly available datasets.
notes the probability of yi predicted by the model given the
′ For training, we use CoVoST2 [11], Common Voice 11 [12],
source speech and the prior tokens Y<i in the target sequence.
The predicted probability is obtained by applying the softmax and VoxPopuli [13] datasets. CoVoST-2 is a speech-to-text
function to the logits O. translation dataset consisting of 21 languages. The dataset in-
cludes human-labeled translation pairs from 21 languages to
3.4. Parameter-efficient Fine-tuning English (X-En), and from English to 15 languages (En-X).
Common Voice is a collection of speech-text pairs where the
Large language models have billions of parameters, making it speech was recorded by annotators given the text transcription.
computationally expensive and inefficient to fine-tune all of the VoxPopuli consists of speech from the European Parliament
parameters during training. It is common to apply parameter- with the corresponding transcriptions and interpretations in 15
efficient fine-tuning techniques when fine-tuning LLMs on languages.
downstream tasks to improve efficiency and mitigate catas- We conduct in-domain evaluation on the test sets of CoV-
trophic forgetting. To this end, we employ and compare two oST 2. Additionally, we perform zero-shot evaluation on
parameter-efficient fine-tuning techniques in this paper: LNA FLEURS [4], a dataset that aims to evaluate the out-of-domain
fine-tuning [3] and Low Rank Adaptation (LoRA) [26]. generalizability of speech translation models. Note that for all
datasets, we only use the directions that are present in CoV-
3.4.1. LNA Fine-tuning oST2. We report BLEU scores from SacreBLEU and addi-
LayerNorm and Attention (LNA) fine-tuning adapts pretrained tionally the model-based COMET score with the model wmt22-
language and speech models to S2TT by fine-tuning only the comet-da [31].
layer normalization and the multi-head attention layers [3]. This
method greatly reduces the number of trainable parameters dur- 4.2. Implementation Details
ing fine-tuning and avoids catastrophic forgetting, thus improv-
ing the downstream performance for multilingual speech-to-text We employ a pretrained W2v-BERT [15] model that was re-
translation. Since the pretrained language model we use is a leased in [32] with 600M parameters that is pretrained on 4 mil-
decoder-only transformer model, we apply LNA fine-tuning and lion hours of speech data with a self-supervised objective as the
fine-tune only the layer normalization and the self-attention lay- speech encoder. The text decoder is initialized with LLaMA2-
ers in the transformer decoder. 7B-chat [9].
We implement our models, training, and evaluation proce-
3.4.2. Low Rank Adaptation (LoRA) dures with the Fairseq2 library1 . During training, the effective
batch size is set to 800K speech frames, or 8000 seconds of
LoRA injects trainable rank decomposition matrices into the speech inputs. We optimize the model with the AdamW op-
projections layers of a transformer model, which serves as a timizer and set the learning rate to 1e-4. The learning rate is
residual path in addition to a projection layer. During fine- warmed up for 5000 steps and linearly decayed until the maxi-
tuning, only the decomposition matrices are updated, while all mum number of steps is reached, which is set to 60000. We fine-
of the pretrained parameters are frozen. Thus, the number of tune all parameters of the speech encoder and apply parameter-
trainable parameters is significantly reduced. The decomposi- efficient fine-tuning methods to the text decoder. All experi-
tion matrices can be merged into the original projection ma- ments are conducted on 32 NVIDIA A100 GPUs.
trix after fine-tuning. Therefore, there is no additional com-
putation nor additional parameters compared to the pretrained 1 https://github.com/facebookresearch/fairseq2
CoVoST 2 FLEURS CoVoST 2 FLEURS
BLEU COMET BLEU COMET
Ours 37.1 23.4
Encoder-Decoder - fchain 35.4 22.4
W2vBERT+NLLB 33.8 80.1 22.7 76.4 - fASR 36.4 22.7
W2vBERT+LLaMA2 33.3 79.7 18.2 74.2 - fASR & fchain 35.8 22.5
Decoder-only Table 4: Results of formulation ablation (%).
Ours 36.3 81.4 23.4 77.6
Table 2: Results of different architectures (%). We report the av-
erage BLEU score and COMET score on the 21 X-En directions 5. Discussion
on CoVoST 2 and FLEURS.
In this section, we conduct various experiments to analyze and
discuss the details of our proposed method.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [22] W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and
training of deep bidirectional transformers for language under- C. Zhang, “Connecting speech encoder and large language model
standing,” in Proceedings NAACL-HLT 2019, 2019, pp. 4171– for asr,” arXiv:2309.13963, 2023.
4186. [23] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna,
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han,
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan- E. Kharitonov et al., “Audiopalm: A large language model that
guage models are few-shot learners,” Advances in neural informa- can speak and listen,” arXiv:2306.12925, 2023.
tion processing systems, vol. 33, pp. 1877–1901, 2020. [24] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu,
B. Ren, L. Liu et al., “On decoder-only architecture for speech-
[7] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
to-text and large language model integration,” in 2023 IEEE Auto-
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-
matic Speech Recognition and Understanding Workshop (ASRU).
shot learners,” arXiv:2109.01652, 2021.
IEEE, 2023, pp. 1–8.
[8] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright,
[25] Z. Chen, H. Huang, A. Andrusenko, O. Hrinchuk, K. C. Puvvada,
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train-
J. Li, S. Ghosh, J. Balam, and B. Ginsburg, “Salm: Speech-
ing language models to follow instructions with human feedback,”
augmented language model with in-context learning for speech
Advances in Neural Information Processing Systems, vol. 35, pp.
recognition and translation,” arXiv:2310.09424, 2023.
27 730–27 744, 2022.
[26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
[9] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large lan-
Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale
guage models,” in International Conference on Learning Repre-
et al., “Llama 2: Open foundation and fine-tuned chat models,”
sentations, 2022.
arXiv:2307.09288, 2023.
[27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[10] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
M. Turchi, “MuST-C: a Multilingual Speech Translation Corpus,” Advances in neural information processing systems, vol. 30, 2017.
in Proceedings of NAACL-HLT 2019, 2019, pp. 2012–2017.
[28] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
[11] C. Wang, A. Wu, and J. Pino, “Covost 2 and massively multilin- I. Sutskever, “Robust speech recognition via large-scale weak su-
gual speech-to-text translation,” arXiv:2007.10310, 2020. pervision,” in International Conference on Machine Learning.
[12] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- PMLR, 2023, pp. 28 492–28 518.
retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common [29] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen,
voice: A massively-multilingual speech corpus,” in Proceedings N. Chen, B. Li, V. Axelrod, G. Wang et al., “Google usm:
of LREC 2020, 2020, pp. 4218–4222. Scaling automatic speech recognition beyond 100 languages,”
[13] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- arXiv:2303.01037, 2023.
iza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A [30] C. Le, Y. Qian, L. Zhou, S. Liu, Y. Qian, M. Zeng, and X. Huang,
large-scale multilingual speech corpus for representation learn- “Comsl: A composite speech-language model for end-to-end
ing, semi-supervised learning and interpretation,” in Proceedings speech-to-text translation,” Advances in Neural Information Pro-
of ACL-IJCNLP 2021, 2021, pp. 993–1003. cessing Systems, vol. 36, 2024.
[14] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec [31] R. Rei, J. G. C. de Souza, D. Alves, C. Zerva, A. C. Far-
2.0: A framework for self-supervised learning of speech repre- inha, T. Glushkova, A. Lavie, L. Coheur, and A. F. T. Mar-
sentations,” Advances in neural information processing systems, tins, “COMET-22: Unbabel-IST 2022 submission for the metrics
vol. 33, pp. 12 449–12 460, 2020. shared task,” in Proceedings of the Seventh Conference on Ma-
[15] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and chine Translation (WMT), 2022, pp. 578–585.
Y. Wu, “W2v-bert: Combining contrastive learning and masked [32] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-
language modeling for self-supervised speech pre-training,” in A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman
2021 IEEE Automatic Speech Recognition and Understanding et al., “Seamlessm4t-massively multilingual & multimodal ma-
Workshop (ASRU). IEEE, 2021, pp. 244–250. chine translation,” arXiv:2308.11596, 2023.
[16] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, [33] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield,
K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No
and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- language left behind: Scaling human-centered machine transla-
resentation Learning at Scale,” in Proc. Interspeech 2022, 2022, tion,” arXiv:2207.04672, 2022.
pp. 2278–2282.