Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\interspeechcameraready\name

[affiliation=1]BoyongWu \name[affiliation=1]ChaoYan \name[affiliation=1]HaoranPu

Transferable speech-to-text large language model alignment module

Abstract

By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.

keywords:
speech-text bimodal LLM, decoder-only, spoken translation, speech recognizion

1 Introduction

LLMs have received much attention in recent years. The powerful capabilities of ChatGPT[1] have achieved unprecedented breakthroughs in the natural language processing(NLP) field. Gradually, using a single model to solve multiple tasks has become the mainstream approach. Vision large language models have applied this principle to various vision tasks[2, 3, 4, 5, 6]. In terms of speech modality, some studies have signaled that it is feasible to interact with LLM through speech. AudioGPT[7] and HuggingGPT[8] have made preliminary attempts. They employ a cascade method to seamlessly integrate automatic speech recognition (ASR), text-to-speech (TTS), and other recognition/generation tasks. The key concept is to apply LLM as an intermediate interface for distributing tasks via calling upon the appropriate models. Because the LLM is trained with text, speech information is hardly recognized, such as emotions and tones in human voice. By discretizing the speech signal into token sequences and expanding them within the LLM, SpeechGPT[9] enables seamless text-speech interaction with a vocoder model for speech synthesis. However, this method requires retraining the LLM to support additional tokens. Moreover, there are some works achieve similar results by concatenating speech and text features as the prompt of LLM. LLaSM[10] uses Whisper[11] and Chinese-LLAMA2-7B111https://huggingface.co/LinkSoul/Chinese-Llama-2-7b as speech encoder and LLM with two training stages. In the first stage, they use ASR dataset for the adaptor pre-training. In the second stage, adaptor and language model are updated for cross-modal instruction fine-tuning. Whisper does not appear in Speech-LLaMA[12], they train 4 Transformer layers as audio encoder to complete ST tasks in 13 languages with LLaMA[13]. Whisper and Qwen222https://huggingface.co/Qwen/Qwen1.5-7B are used in Qwen-Audio[14], which is also trained in two stages. The first stage, Qwen LLm is frozen and multi-task audio data is used to train Whisper. In the second stage of training, multi-round dialogue data is utilized to generate an interactive chat model that can accommodate input from diverse audio and text sources.

Previous works have excelled in aligning speech text modalities, most of which require retraining the speech encoder with a large amount of data to improve representation ability, and then fine-tuning the LLM model with instruction data to achieve better performance. However, this brings a large overhead to computing resources and is difficult to implement when data resources are scarce. Besides, those training strategies are fixed with particulay models and require multiple training with different speech-text foundation models composition. Should a replacement become necessary, realignment processes would have to be updated once more, leading to significant expenses in terms of overall training and utilization.

This paper raises several questions in response to this situation. Does modal alignment module require retraining speech modality and text modality? Does the alignment of speech-text modality only require a simple alignment module, or even a simple linear layer? Does training modality alignment module require massive amounts of data? Is the trained alignment module scalable and can it be replaced by a LLM with better performance? Furthermore, what kind of knowledge does the feature after alignment module mapping contain? Extend from existing work, we investigate the sufficiency of each components separately, namely the model size of the alignment module, amount of training data, transferability of alignment module across LLMs and the information contained in alignment modules which are rarely explored in current work.

Refer to caption
Figure 1: An overview of our proposed speech-text bimodal architecture. Alignment module is used to map the speech features into text feature space. Speech encoder is frozen all the time. LLM embedding will extract text features form prompt. The speech and text modal features are concatenated as LLM’s input.

We propose a linear layer after speech encoder as modal alignment module with open source models and corpus to achieve ASR, ST, SQA and text question answering(QA) multitasks in Mandarin. First, we conjecture that pre-trained speech encoder and LLM have strong text and speech capabilities, so we explore the connection through a single-layer alignment module. We choose Whisper encoder to extract speech features, while keeping parameters frozen, in order to reduce training overhead. Yi-6B333https://huggingface.co/01-ai/Yi-6B with the LLaMA[13] decoder-only structure is selected as LLM. A linear layer is chosen as the modal alignment module to map the speech features output by Whisper into the text feature space. The LLM is frozen during the alignment module training phase. In addition, we explore the transferability of alignment module. After the alignment module is aligned between speech and text, both of Whisper and alignment module are frozen, we replace the Yi-6B model with a supervised fine-tuning(SFT) version that aligns with human preferences. This updated model is validated on ST tasks, resulting in significant performance improvements. Finally, in order to further explore the alignment subspace, we use SVD analysis and therefore reveal information redundancy. Our contributions can be summarized as the following points:

  • Only adding and training an additional layer of alignment module between LLM and speech encoder to achieve ASR, ST, SQA and QA via open source models and data. The alignment module uses only a small amount of data to stimulate modal alignment capabilities.

  • The trained alignment module has strong scalability. It can be replaced with the SFT model with better command following and human preference capabilities from the same source without additional training, further improving the preference of specific tasks, such as ST, SQA, etc.

  • Preliminary analysis of the features after alignment mapping revealed information redundancy. Gradually reducing the dimension of modal alignment mapping revealed that a small reduction in feature dimension has only a slight impact on model performance. This provides insights for future feature concatenation, such as voiceprint features or video features.

2 Approach and Experiment Setup

2.1 Model Architecture

Figure  1 shows the model structure, including speech encoder, modal alignment module and LLM. Given the paired data (s,x)𝑠𝑥(s,x)( italic_s , italic_x ), where s𝑠sitalic_s and x𝑥xitalic_x denote the features mapped by alignment module and the text features extracted by LLM’s embedding layer respectively. The training objective is to maximize the next token probability as

Pθ(xt|x<t,Alignmentϕ(s)),subscript𝑃𝜃conditionalsubscript𝑥𝑡subscript𝑥absent𝑡𝐴𝑙𝑖𝑔𝑛𝑚𝑒𝑛subscript𝑡italic-ϕ𝑠P_{\theta}(x_{t}|x_{\textless t},Alignment_{\phi}(s)),italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_A italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) ,

where θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ denote the parameters of the large language model and the alignment module.

Speech Encoder The speech encoder uses the encoder module of Whisper large-v3444https://huggingface.co/openai/whisper-large-v3, which is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeded audio collected using Whisper large-v2[11]. The encoder module accepts a 128-dimensional mel-spectrogram as input and produces an output with a dimension of 1280.

Large Language Model The open source Yi-6B is selected as the LLM. It is a bilingual language model that supports Chinese and English, which is trained on a 3T multi-language corpus. Yi-6B is a 32-layer transformer decoder-only structure with a hidden size of 4096.

Alignment Module Modal alignment module is a linear layer that maps the features output by the Whisper encoder to the LLM text modality, with an input dimension of 1280 and an output dimension of 4096.

Refer to caption
Figure 2: Cases of speech and plain text input

2.2 Prompt design

Since a speech-text bimodal LLM requires support for both audio and text inputs, we design the data format inspired from Whisper[11] and Qwen-Audio[14]. For the input sequence, “<|Human|>” is the special token, which means that the following content from here on is provided by humans. The next special token is “<|startofaudio|>”, the audio content will be connected after this token. And then, special token “<|endofaudio|>” is followed, which represents the end of the speech content. In cases where the input lacks speech content, the area enclosed by “<|startofaudio|>” and “<|endofaudio|>” will be empty. The next special token is {task}, which is used to specify the model generation task and the {prompt} will follow it for LLM. Different tasks have unique {prompt}. For the ASR task, the prompt is to “recognize the content in the speech”. “Translate audio content into English” is for ST. The prompt for QA is “Answer the question in the audio”. And the plain text task takes its prompt from the question presented in the description. After {prompt}, the special token ”<|Assistant|>” indicates that the subsequent content generated by the model is the label content of the current sample. Figure  2 shows the cases with speech and plain text.

2.3 Training strategy

The training process requires the following steps. We first extract the fbank feature from the audio data via Whisper’s default configuration and generate speech feature by speech encoder. Then the speech feature passes through the alignment module and concatenate with LLM text embedding of the prompt and start of the answer. Finally, the alignment module will be optimized by CrossEntropy loss.

2.3.1 Modal alignment

Whisper has excellent performance in ASR and ST tasks, and its encoder has strong semantic representation capabilities. As a LLM base model, Yi-6B demonstrates robust language capabilities due to its extensive pre-training using vast amounts of text data. Given the rich representations from text and speech foundation models, we explore the possibility to achieve modality alignment via a single linear layer. In this stage of training, freeze the parameters of the speech encoder and LLM, and use the ASR, ST, and SQA data to train the modal alignment module. We will explore how much data the modal alignment module requires to stimulate modal capabilities.

2.3.2 Extensibility of the alignment module

We also explore whether the alignment module has transfer-ability. Based on 2.3.1, we keep speech encoder and alignment module untouched, while swap the original LLM model with a homologous SFT model with stronger instruction following and human preference capabilities, and test it against ST data.

2.3.3 Alignment mapping feature analysis

In order to analyze the feature content of the alignment module after mapping, we use the SVD algorithm to perform feature decomposition.

Amn=UmmΣmnVnnTsubscript𝐴𝑚𝑛subscript𝑈𝑚𝑚subscriptΣ𝑚𝑛superscriptsubscript𝑉𝑛𝑛𝑇A_{m*n}=U_{m*m}\Sigma_{m*n}V_{n*n}^{T}italic_A start_POSTSUBSCRIPT italic_m ∗ italic_n end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_m ∗ italic_m end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_m ∗ italic_n end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n ∗ italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

A𝐴Aitalic_A is the speech feature matrix mapped by the modal alignment module. m𝑚mitalic_m and n𝑛nitalic_n are the time dimension and the hidden size of the LLM respectively. ΣΣ\Sigmaroman_Σ is an mn𝑚𝑛m*nitalic_m ∗ italic_n matrix, all of which are 0 except for the elements on the main diagonal. The main diagonal each element on the line becomes a singular value, and elements closer to the top are more important. We will only retain the top part the ΣΣ\Sigmaroman_Σ value to explore whether erasing feature information will have a greater impact on model performance.

2.4 Experiments setup

2.4.1 Experimental data

For the speech recognition task, we use the open source aishell[15] and WenetSpeech[16] data. For the translation task, we use the wmt19[17] Chinese-English data set, with a total of 310k items. For the QA task, we use the Alpaca-zh[18] dataset, which has a total of 48k pieces of data. Both ST and SQA tasks use a self-developed TTS model to generate audio from text data, and randomly select speaker information to ensure the diversity of speech timbres. There are also many excellent open source TTS models available, such as Bark-TTS555https://github.com/suno-ai/bark, etc. The synthesized data is about 643 hours for wmt19 and about 60 hours for Alpaca.

We build the following multi-task dataset:

  • dataset1. 90 hours aishell, 100 hours wmt19 and 30 hours Alpaca.

  • dataset2. 178 hours aishell, 200 hours wmt19 and 60 hours Alpaca.

  • dataset3. 178 hours aishell, 200 hours WenetSpeech, 200 hours wmt19 and 60 hours Alpaca.

For the test set, ASR task uses the test set of aishell2[19]. ST task uses the test set of wmt19.

2.4.2 Parameter settings

The Speech encoder uses the Whisper’s encoder module of large-v3, the LLM uses the Yi-6B, and the modal alignment module uses only a linear layer.

When training the modal alignment module, freeze both the speech encoder and LLM parameters, only update the parameters of modal alignment module . Learning rate is set to 1e-3, batch size is set to 128, and A800-40G is used for training.

LoRA[20] is used when fine-tuning LLM, the speech encoder and alignment module are frozen. Learning rate is set to 1e-4, and the batch size is set to 128. For LoRA parameters, r𝑟ritalic_r is set to 16 and α𝛼\alphaitalic_α is set to 32.

For audio, all the data are 16 kHztimes16kilohertz16\text{\,}\mathrm{kHz}start_ARG 16 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG single-channel in wav format. The fBank feature uses 25 mstimes25millisecond25\text{\,}\mathrm{ms}start_ARG 25 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG window size and a hop size of 10 mstimes10millisecond10\text{\,}\mathrm{ms}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_ms end_ARG.

As for evaluation, the ASR task uses the character error rate(CER) as the statistical standard, and ROUGE-L for ST task.

3 Results and Analysis

Table 1: ROUGE-L (%) and CER (%) score for alignment module experiments for Yi-6B and Yi-6B-Chat. CER are evaluated by AISHELL-2 and ROUGE-L score are evaluated by WM19
LLM Training Dataset
1 2 3
ROUGE-L (\uparrow) Yi-6B 29.378 31.392 27.916
Yi-6B-Chat 33.180 33.844 30.660
Yi-6B-LoRA 29.697 31.512 27.877
CER (\downarrow) Yi-6B 11.071 9.418 8.429
Yi-6B-Chat 12.753 14.824 8.616
Yi-6B-LoRA 11.021 9.321 8.251

In this section, we first go through evaluation result in speech recognition and translation. And then we perform a deeper analysis about how alignment module behaves during inference.

3.1 Evaluation

Table  1 of the Yi-6B section shows our alignment methods against three dataset configurations. All results are compared internally in order to investigate how linear alignment module behaves across different corpus size and composition, suggesting our method achieves text-speech modality alignment. For CER, we observe an incremental improvements of metrics from 11.07111.07111.07111.071 to 8.4298.4298.4298.429. On the other hand, alignment module trained from dataset 2 has the best ROUGE-L score (31.39231.39231.39231.392) while there is a 3.4763.4763.4763.476 decreases from dataset 3 setting (27.91627.91627.91627.916), worse than the result from dataset 1 (29.73829.73829.73829.738), even though it contains much more speech data. The first observation suggests that adding training data for one task can indeed enhance the corresponding learnability, while the unbalanced data significantly degenerates the tasks with minor utterances. Hundreds of hours of audio data can inspire the alignment capabilities of the module with only one linear layer. Beyond the training result evaluation, we urge future modality-align works to take better attention towards balance construction of Speech-LLM alignment training.

3.2 Alignment module’s transfer-ability across LLMs

After the alignment module is trained, we swap the Yi-6B model with Yi-6B-Chat666https://huggingface.co/01-ai/Yi-6B-Chat model fine-tuned by human preferences dataset to investigate its behaviors across choices of LLMs. The comparison of Table  1 across LLM suggests the alignment module can still align speech and text modalities, given the input LLM is fine-tuned with specific tasks. However, given LLM fine-tuned with chat prefers to generate semantically related content from prompt, there is a significant improvement for speech translation task around 3.0 improvement of ROUGE-L score, while there is a non-negligible degeneration of speech recognition capability, which renders around 2.4 increase of character error rate for alignment module trained by average. With closer examination we find most of the wrong transcription are semantically the same as reference text with wrong pronunciation. In addition, we use LoRA to fine-tune Yi-6B with the same data used by alignment module and we can observe a minimal improvement in model performance. We conjecture that, with more and balance dataset, invariance between alignment module and LLM may push future speech-text alignment module to become independent from speech encoder and language models. Comparing with other LoRA-based alignment technique, such approach can achieve one-for-all alignment free from fine-tuning different LLM variances. When speech data resources are scarce, we can focus on more easily accessible text modal data and SFT LLM to improve the performance of speech-text modality LLM on specific tasks.

3.3 Alignment Feature Analysis

Table 2: ROUGE-L (%) and CER (%) score for top-k SVD decomposition inference for training dataset 2 configuration.
Top-k singular vecrors
None 1000 300 200 100 50
ROUGE-L (\uparrow) 31.4 31.3 31.4 30.0 3.89 0.31
CER (\downarrow) 9.42 9.61 9.45 9.60 11.7 70.8
Table 3: ROUGE-L (%) and CER (%) score for alginment module with various trainable dimensions under training dataset 2 configuration.
Trainable dimension
4096 3072 2048 1024
ROUGE-L (\uparrow) 31.392 31.144 24.958 22.011
CER (\downarrow) 9.418 9.365 14.447 19.486

We first take the top-k singular vector decomposition of the linear alignment module during inference to measure the information entailment in alignment space. Table  2 reveals that there is negligible change of ROUGE-L and CER by applying the top-200 or more singular vectors while scores suddenly. On the other hand, shifting from top-200 to top 50 there is a significant drop from 9.60 to 70.8 for CER and from 30.0 to 0.31 for ROUGE-L. Such observation implies that the aligned space can have much fewer rank compared with full-size 4096 LLM subspace. As a supplementary experiment, we constrain trainable alignment modules with {3072,2048,1024}307220481024\{3072,2048,1024\}{ 3072 , 2048 , 1024 } trainable dimension, and then fill the gap dimension with 0 during training. Table  3 shows that there is negligible change of CER (9.4189.3659.4189.3659.418\rightarrow 9.3659.418 → 9.365) and ROUGE-L (31.39231.14431.39231.14431.392\rightarrow 31.14431.392 → 31.144) scores when the training dimension decrease from 4096 to 3072 while there is a significant drop from 3072 to 1024 (9.36519.4869.36519.4869.365\rightarrow 19.4869.365 → 19.486 for CER and 31.14422.01131.14422.01131.144\rightarrow 22.01131.144 → 22.011 for ROUGE-L). Given each learnable dimension represents the highest possible ranks for alignment subspace, such observation further implies that alignment space might be less complicated compared with text subspace described by LLM which may lead further simplification.

4 Conclusion

We explore the capability of speech-text multitasking by training an linear alignment module across Whisper and Yi-6B models. Results from ASR and ST reveal that speech-text alignment module can be achieved and the balance of the dataset significantly impacts each tasks capability. In further, the extensibility across Yi-6B and Yi-6B-Chat version and the alignment module’s sparse space also suggests its universal applicability and to extend with more tasks. Future work is required for investigating the further potential such as additionally integrating video as the third input or other acoustic related tasks.

References

  • [1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20.   Red Hook, NY, USA: Curran Associates Inc., 2020.
  • [2] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bińkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language model for few-shot learning,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 23 716–23 736. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf
  • [3] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 23–29 Jul 2023, pp. 19 730–19 742. [Online]. Available: https://proceedings.mlr.press/v202/li23q.html
  • [4] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 34 892–34 916. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf
  • [5] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 8748–8763. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
  • [6] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” 2023.
  • [7] R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y. Wu, Z. Hong, J. Huang, J. Liu, Y. Ren, Z. Zhao, and S. Watanabe, “Audiogpt: Understanding and generating speech, music, sound, and talking head,” 2023.
  • [8] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face,” in Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 38 154–38 180. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/77c33e6a367922d003ff102ffb92b658-Paper-Conference.pdf
  • [9] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 15 757–15 773. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.1055
  • [10] Y. Shu, S. Dong, G. Chen, W. Huang, R. Zhang, D. Shi, Q. Xiang, and Y. Shi, “Llasm: Large language and speech model,” 2023.
  • [11] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 23–29 Jul 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
  • [12] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu, and Y. Wu, “On decoder-only architecture for speech-to-text and large language model integration,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
  • [13] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” 2023.
  • [14] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” 2023.
  • [15] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017, pp. 1–5.
  • [16] B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng, D. Wu, and Z. Peng, “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6182–6186.
  • [17] N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov, “Facebook fair’s wmt19 news translation task submission,” in Conference on Machine Translation, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:196621535
  • [18] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self-generated instructions,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 13 484–13 508. [Online]. Available: https://aclanthology.org/2023.acl-long.754
  • [19] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transforming mandarin asr research into industrial scale,” 2018.
  • [20] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR 2022, April 2022. [Online]. Available: https://www.microsoft.com/en-us/research/publication/lora-low-rank-adaptation-of-large-language-models/