Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2401.08992v1 [cs.CL] 17 Jan 2024

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

Abstract

The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.

Index Terms—  Streaming Multilingual ASR, Adapter Finetuning

1 Introduction

End-to-end multilingual automatic speech recognition (MASR) is an active research topic where a single speech model is trained to recognize multiple languages [1, 2, 3]. A single MASR system is often more cost-efficient to deploy compared to a large number of monolingual models [4]. However, training an end-to-end MASR system is not trivial. Different languages have distinct vocabularies and data abundances. A single model may not have enough capacity to accommodate all languages/locales [5]. Recent efforts focus on increasing the model size. For instance, USM [5] reaches model size of 2B parameters and MMS [6] has over 1B parameters. These large-scale models can often serve as foundation models for non-streaming scenarios [7, 8]. It is desirable to utilize their large capacity with minimal changes for different languages, especially the tail ones.

Nevertheless, while these efforts have incorporated more languages and shown improved qualities, it is still not clear how the performance varies across all languages. Many existing papers select and report the checkpoint with the lowest average word error rate (WER) across all the tested languages. Such WER is often contributed by only a portion of all the languages while sacrificing other under-fitting or over-fitting locales. For example, in a recent MASR model [9], the best WER on Portuguese is achieved at 100K steps, while Polish is already over-fit and Dutch is still under-fit at this checkpoint. As a trade-off, the checkpoint at 100K is selected since its average WER is the lowest. It is hard to ensure the model can reach optimal performance for all languages at the same checkpoint. We call this problem the asynchronous peak performance issue. To this end, prior work such as MMS [6] increases the model size and inserts language-specific adapters and heads during finetuning for each language. However, models with billions of parameters often have latency concerns and are not suitable for applications that require streaming processing [10]. Also, each language-specific adapter is trained for 2K steps uniformly, ignoring the heterogeneity across languages. Language-specific heads add up to 2% extra parameters per language, which results in an extra burden for deployment. On the other hand, ASR transducer models with cascaded encoders have shown to be a better fit to process the streaming speech [11] by achieving good trade-off between latency and quality. A recent study [12] proposed a language-agnostic 1st pass and a language-aware 2nd pass under the cascaded architecture with promising results. However, this work requires a non-causal encoder and decoder for every language respectively, resulting in a total model size of 0.6B.

To address the aforementioned issues, in this work, we introduce Language-Dependent Adapters (LDAs), which can achieve significant improvements for streaming MASR models with cascaded encoders. Based on a frozen pre-trained foundation model, we add and train the LDAs for tail languages. Since the backbone model is frozen, we can collect and gather adapter weights from different steps. During the LDA finetuning, an off-the-shelf well-trained foundation model [4, 13] is adopted as the backbone, which is a stack of Conformer layers [14]. LDAs are inserted as a residual pass at the end of each Conformer layer. The input to the next Conformer layer is the combination of the output from the previous layer and the LDA output. We support a batch of mixed languages during finetuning by adding language ID to each utterance [15]. Language IDs are converted to one-hot vectors to select the corresponding language-dependent weights in each LDA module. Therefore, each utterance would only update the adapter for the corresponding language without interfering with other languages. During training, the foundation model is frozen. For each language, the related LDA weights only take up 0.4% of the full model size, and we do not add extra projection heads for separate languages, maintaining a total size of 0.2B. After the training process is completed, we select the checkpoints with the best performances for each language. Since they all share the same foundation model, we simply extract LDA weights from those checkpoints and merge them to compose the final LDA module. Therefore, by assembling the checkpoints of peak performances into a single one, we can keep most merits of a single end-to-end model like deployability and low latency while improving the quality. Meanwhile, such an asynchronous strategy eases the pressure of carefully balancing the utterances from different languages during training. Our LDA follows the prior adapter designs [16, 17, 6, 18] but is adapted neatly to fit the MASR domain. Figure 1 illustrates an overview of LDA.

During the optimization of LDA, we also incorporate noisy student training (NST) [19, 20] to utilize the unlabeled data. In our task, NST starts with a non-streaming ASR teacher model and iteratively generates pseudo-labels for unlabeled utterances, as well as a series of student ASR models, with the help of a fixed language model (LM) trained separately. The final student during this iterative process would be our streaming LDA model.

Our MASR model with LDAs is validated on a challenging dictation benchmark, specifically on tail languages including Latin, Greek, Arabic, Cyrillic, etc. On average, our model can bring 12.2% relative improvement on WER among all the 39 languages. If we narrow down the scope to the 27 languages with higher priority in the online traffic, the improvement can rise to 17%. On individual tail languages like Hungarian and Serbian, the WER reduction can even reach 35%, which implies that without comprehensive weight updates or architecture re-design, there is still considerable room to improve in the streaming MASR model. Moreover, we demonstrate that LDA finetuning can achieve performance on par with the full model finetuning where all the parameters including the foundation model are refined for individual languages, which further proves the value of adapter finetuning for streaming MASR: (1) a small number of learnable parameters, (2) flexibility of merging different checkpoints, and (3) promising WER reductions.

2 Related Work

Our framework builds on many prior studies. In [21], Meta-Adapter uses meta-learning to implicitly transfer the learned knowledge from source languages to one unseen target language by updating the adapters through gradients of languages sampled from training data. SimAdapter [17] further adopts an attention mechanism to learn the similarity between source and target languages, forcing the general knowledge transferred from the pre-trained model to the tested language. This work focuses on the monolingual transfer and yet the improvement is not clear when the multilingual task scales up. [22] attempts to learn a common adapter to distill the language-agnostic information, and to facilitate language-specific updates from a loss perspective. The concern for these methods emerges from the overhead during deployment, and the difficulty in learning the common adapter soars when the number of languages increases. Unlike many previous methods that focus on monolingual finetuning, our LDA method targets 39 languages and supports mixed language batch finetuning, which further facilitates the adapter tuning and NST when the number of languages grows. Besides, most of these prior works explore the non-streaming scenario, while our primary focus is the streaming case. We further compare the adapter finetuning with the full model finetuning to validate the capacity of the adapter-based methods, which complements many prior studies. This model also extends from the previous paper [23] but with more sophisticated Conformer architecture and 4 times more tail languages. The training data amounts of our 39 tail languages are all less than 4% of those high-resource languages like English. Other efforts towards this direction include utilizing self-supervision and pretraining for low-resource languages [24], refined sampling to balance different richness [25], and mixture of experts [26, 27].

3 Method

Refer to caption
Fig. 1: An overview of LDA in a Conformer model with cascaded encoders. LDAs are inserted between two consecutive Conformer layers for both 1st and 2nd passes. Each LDA module contains a stack of language-dependent parameters.
Refer to caption
Fig. 2: The improvements brought by our method compared to the baseline which is also the existing launched model on the dictation dataset. The blue bars demonstrate the WERs on each language given by our model. The yellow bars highlight the WER reduction outperforming the baseline. The combination of yellow and blue bars denotes the baseline WERs. As shown in the figure, we can achieve significant gains on most languages. On Slovak, the gain can reach up to 37.5%. On average, the improvement on all locales is 12.2%.

Our backbone foundation model follows a cascaded framework [12] using a Conformer transducer [14], with a causal 1st pass and a non-causal 2nd pass. Unlike [12], both passes are shared across all the languages. The adapters are only inserted to the encoders. The inserted lightweight LDAs contain language-specific parameters during the adapter-only finetuning.

3.1 Adapter Module

The LDA architecture resembles the existing adapter modules [16]. The output from the previous layer xi1B×T×dsubscript𝑥𝑖1superscript𝐵𝑇𝑑x_{i-1}\in\mathbb{R}^{B\times T\times d}italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_T × italic_d end_POSTSUPERSCRIPT is passed to the inserted adapter module, where B𝐵Bitalic_B is the batch size, T𝑇Titalic_T is the utterance length and d𝑑ditalic_d is the feature dimension. Note that under the streaming scenario, the right context for each time step is masked out during training. Each utterance is associated with a language ID, so we also have li1B×Ksubscript𝑙𝑖1superscript𝐵𝐾l_{i-1}\in\mathbb{N}^{B\times K}italic_l start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT italic_B × italic_K end_POSTSUPERSCRIPT where K𝐾Kitalic_K denotes the total number of languages (in our case, K=39𝐾39K=39italic_K = 39). The down projection matrix DKd×h𝐷superscript𝐾𝑑D\in\mathbb{R}^{Kd\times h}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_K italic_d × italic_h end_POSTSUPERSCRIPT maps xi1subscript𝑥𝑖1x_{i-1}italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to a lower-dimensional space (hdmuch-less-than𝑑h\ll ditalic_h ≪ italic_d) where hhitalic_h is the hidden dimension and it is typically much smaller than d𝑑ditalic_d. Similarly, the up projection matrix UKh×d𝑈superscript𝐾𝑑U\in\mathbb{R}^{Kh\times d}italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_K italic_h × italic_d end_POSTSUPERSCRIPT projects the hidden representations back to the d𝑑ditalic_d-dim space. li1subscript𝑙𝑖1l_{i-1}italic_l start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT would firstly be used to select the language-dependent weights from D𝐷Ditalic_D and U𝑈Uitalic_U for the input xi1subscript𝑥𝑖1x_{i-1}italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, DxB×d×hsubscript𝐷𝑥superscript𝐵𝑑D_{x}\in\mathbb{R}^{B\times d\times h}italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_d × italic_h end_POSTSUPERSCRIPT and UxB×h×dsubscript𝑈𝑥superscript𝐵𝑑U_{x}\in\mathbb{R}^{B\times h\times d}italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_h × italic_d end_POSTSUPERSCRIPT. Then the output is given by

xi=Ux(ReLU(Dx(LN(xi1))))+xi1superscriptsubscript𝑥𝑖subscript𝑈𝑥ReLUsubscript𝐷𝑥LNsubscript𝑥𝑖1subscript𝑥𝑖1x_{i}^{\prime}=U_{x}(\textsc{ReLU}(D_{x}(\textsc{LN}(x_{i-1}))))+x_{i-1}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_U start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( ReLU ( italic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( LN ( italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) ) ) + italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT (1)

where LN()LN\textsc{LN}(\cdot)LN ( ⋅ ) represents the pre-LayerNorm [28], a standard normalization to standardize the inputs of the adapter module, and ReLU()ReLU\textsc{ReLU}(\cdot)ReLU ( ⋅ ) is the activation function in the hidden space. xisuperscriptsubscript𝑥𝑖x_{i}^{\prime}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the input for the next layer. We can also further add bias terms DbK×h,UbK×dformulae-sequencesubscript𝐷𝑏superscript𝐾subscript𝑈𝑏superscript𝐾𝑑D_{b}\in\mathbb{R}^{K\times h},U_{b}\in\mathbb{R}^{K\times d}italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_h end_POSTSUPERSCRIPT , italic_U start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT for the down projection and up projection respectively. During LDA finetuning using the Adam optimizer, only D,U,Db,Ub𝐷𝑈subscript𝐷𝑏subscript𝑈𝑏D,U,D_{b},U_{b}italic_D , italic_U , italic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are trainable and the backbone foundation model remains frozen. Under this framework, LDA supports batch training with mixed languages without concerning their mutual interference. Besides, one can mix the utterances from the full set or a subset. Peak performance can be achieved from different runs and the language-dependent weights from them can be merged together to compose a single end-to-end model.

While many aforementioned adapter variations like common adapters and balancing loss in training could potentially further promote the quality, we found in practice LDA is a good trade-off among the deployability, extensibility, and quality. New languages can simply be appended to the projection matrices with parameter-efficient finetuning. If we have new training data incoming, only a small portion of the full model needs to be updated, without soliciting extra infrastructure changes. More importantly, as we will show in experiments, such a straightforward scheme can already achieve impressive improvements over the existing models in the streaming dictation task, and the final performance can match the expensive full model finetuning on most of the tested tail languages.

3.2 Noisy Student Training

The general idea of NST is to use a teacher model trained with supervised data to transcribe unsupervised data in a progressive manner. As better privacy protection has become a societal consensus [29], NST plays an increasingly critical role when the amount of supervision reduces. In our task, we adopt a non-streaming model as the initial teacher, trained with supervised data and SpecAugment [30]. Then we fuse the teacher with an off-the-shelf LM. The unlabeled data are transcribed using the fused model and a portion of the transcriptions are selected based on the normalized filtering score [20]. The selected portion is then mixed with the supervised data for training the student model. We take 4 iterations in total and in the last iteration, the trained student model is the LDA model.

4 Experiments

Refer to caption
Fig. 3: We further compare our model with the full model finetuning. The yellow bars are the same as Fig. 2. The green bars represent the gap between our model and updating all the parameters for individual locales. As shown in the figure, for most locales, our LDA’s performance is on par with the full model finetuning, while ours only updates a small portion of all the parameters. Even on other languages like Czech, Hebrew, the yellow bars outweigh the green bars. The blue bars demonstrate our improvements over the baselines trained with supervised data only, proving the contributions from NST.

4.1 Datasets

We use a dictation dataset across 39 locales, including Latin (Albanian, Icelandic, Slovak), Arabic (Levant, Maghrebi), Cyrillic (Macedonian, Kazakh), Devanagari (Nepali), etc. These are all tail languages in the online traffic. For instance, Slovak utterances only accumulate to 14 hours, which is not even close to 0.01% of the English data. For all the available supervised data, they are anonymized, and manually transcribed. We have in total 15K hours of audio across 39 languages. The total duration for each language ranges from approximately 14 to 4K hours. Besides the human transcriptions, we also incorporate unlabeled data in noisy student training. The amount of unlabeled data is usually much more than the labeled data. As an example, while Slovak only has 14-hour human transcriptions, 700-hour audio-only Slovak utterances are also available. Similarly, on Uzbek, besides the 20-hour human transcriptions, there is also 670-hour audio-only data. Overall, we have 150K unsupervised data and they are effectively utilized using the NST scheme.

The test set for each language consists of 12K utterances on average. Test sets are collected separately from the online dictation traffic, and reserved for testing exclusively. The test data are also anonymized and manually transcribed for the evaluation purpose.

4.2 Architecture

We use the same core architecture as [13, 12]. The inputs are 128-dimensional log Mel filterbank computed on 32ms windows with a 10ms hop. 4 contiguous frames are stacked to form a 512-dim input representation with a 30ms frame rate. We use SpecAugment to improve the robustness of our model against noise. In practice, 2 frequency masks with max length 27 and 2 time masks with max length 50 are used.

We use ten 512-dim Conformer layers in the causal encoder. Each Conformer layer contains causal convolution and left-context attention layers, which strictly excludes future inputs. For each self-attention layer, there are 8 heads and the Convolution kernel size is 15. The non-causal encoder consists of 7 cascaded Conformer layers. Each Conformer layer is followed by an adapter. Every adapter has a down projection layer and an up projection layer. As described in section 3.1, each adapter accommodates 39 languages.

The transducer decoder consists of an embedding prediction network and a joint network. Models use the same vocabulary with 4,096 wordpieces shared by each locale, generated from the pool of transcripts across all languages. The prediction network operates on the previous 2 non-blank model predictions and maps each to a 640-dimensional embedding using a separate 640×\times×4,096 dimensional embedding table. The joint network has 640 dimensions.

Our model is trained with Tensorflow under the Lingvo framework [31] on Tensor Processing Units (TPUs) [32]. The transducer loss [33] is chosen for the model training using the factorization proposed by Hybrid Autoregressive Transducer (HAT) [34] to allow the incorporation of a language model. FastEmit [35] is used with a regularization weight of 5e-3. The batch size is 4,096. Up to 512 TPU cores are used during the optimization using synchronized stochastic gradient descent. Adam optimizer [36] is configured with parameters β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=0.9 and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=0.999. Exponential moving average and learning rate schedule with peak learning rate 1.8e-3 follows [37] to stabilize the weight updates.

4.3 Results

We first compare our LDA results with the strong existing end-to-end model [13] in streaming MASR. Compared to [13] (baseline), our model incorporates efficient language-aware adapter modules and scales up 4 times of the validated languages (all on tail ones excluding mainstream ones such as English). In Fig. 2, LDA achieves improvements on most languages. To highlight the difference between our results and the baseline, we stack the difference on top of our WER numbers (yellow bars). During the LDA finetuning, different locales may obtain peak performance at different checkpoints. However, since the backbone foundation model is frozen, checkpoints only differ in the adapter module. This is a critical consideration when we choose the adapter module for this task. Then we can collect the checkpoints at distinct steps for all languages respectively to synchronize their peak performance and maximize the merits of the end-to-end ASR models. Meanwhile, the light-weight adapters can minimize the risk of over-fitting which hurts the performance. Even if the adapter finetuning might still bring some over-fitting concerns, one can opt to zero out the corresponding weights to skip the adapter module. As a result, we can guarantee that adding adapters can always have a beneficial impact. Fig. 2 demonstrates this point. On 27 out of 39 locales tested, we achieved noticeable improvement by 17% on average. On Slovak, Serbian, Hungarian, the relative improvements reach 37.5%, 37.3%, 35.6% respectively. On 12 locales, the relative WER improvements are over 20%. On 20 locales, the relative WER improvements are at least 10%. The average gain across all languages is over 12%, which validates the effectiveness of the LDA finetuning.

We further compare our model with the full model training (also with NST) where all the parameters including the foundation model can be learnable during the training for each language in both passes. Such monolingual finetuning is often adopted in the existing literature [38]. Though such strategies are less favorable in the real-world use since weights often drastically change during the finetuning, they could sketch the best performance under the given model capacity, and thus serve as an lower bound for WER. Note that the full model finetuning results also encompasses the results from [12], where only the 2nd pass is tunable. Fig. 3 compares our LDA finetuning results with the full model finetuning results. In this figure, the yellow bars have the same meaning as Fig. 2 denoting the WER reductions brought by LDA finetuning. The green bars represent the gaps between the LDA finetuning and the full model finetuning.

On average, the gap between ours and the full finetuning performance is less than 1% relatively. Up to 32 languages are less than 2% worse than the full monolingual adaptation. Even on the languages with noticeable gaps between ours and the best WERs like Czech, Hebrew, they often have a more significant gain compared to the baseline model (yellow bars are much longer than green bars). On Czech, while the full monolingual adaptation is 9% better than the LDA, LDA is already 22% better than the baseline. Hungarian also has 35.6% performance gain brought by our method, even though there still exists a 5% gap to the best performance. Overall, this figure complements Fig. 2, illustrating that LDA adaptation can match or approach the full model finetuning in this task, and proving the value of our proposed LDA modules.

Ablation To evaluate the value of NST, in Fig. 3, we further compare our results with a model finetuned with the supervised data only (i.e., without NST). On average, this model is 6.4% worse than the one with NST incorporated (blue bars). Therefore, both NST and LDA contribute to the WER reductions.

5 Conclusion

In this paper, we show promising improvements in the streaming multilingual ASR system on a tough dictation dataset. We employ the agile LDA adapters and noisy student training with the frozen foundation model, to minimize the changes required while optimizing for the performance. We test our model on a large scale with up to 39 tail languages and achieve impressive 12.2% relative gains. In the future, we will continue to explore other adapters and improve the inference speed as well.

References

  • [1] Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert, “Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters,” in Proc. Interspeech, 2020.
  • [2] Bo Li, Ruoming Pang, Tara N Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W Ronny Huang, Min Ma, and Junwen Bai, “Scaling end-to-end models for large-scale multilingual ASR,” in Proc. ASRU, 2021.
  • [3] Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
  • [4] Bo Li, Ruoming Pang, Yu Zhang, Tara N Sainath, Trevor Strohman, Parisa Haghani, Yun Zhu, Brian Farris, Neeraj Gaur, and Manasa Prasad, “Massively multilingual ASR: A lifelong learning solution,” in Proc. ICASSP, 2022.
  • [5] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  • [6] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
  • [7] Bo Li, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N Sainath, Khe Chai Sim, Yu Zhang, Wei Han, Trevor Strohman, et al., “Efficient domain adaptation for speech foundation models,” in Proc. ICASSP, 2023.
  • [8] OpenAI, “GPT-4 technical report,” 2023.
  • [9] Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, and Tara N Sainath, “Joint unsupervised and supervised training for multilingual ASR,” in Proc. ICASSP, 2022.
  • [10] Tara N Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Weiran Wang, David Qiu, Chung-Cheng Chiu, Rohit Prabhavalkar, Alexander Gruenstein, Anmol Gulati, et al., “Improving the latency and quality of cascaded encoders,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8112–8116.
  • [11] Bo Li, Anmol Gulati, Jiahui Yu, Tara N Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, et al., “A better and faster end-to-end model for streaming ASR,” in Proc. ICASSP, 2021.
  • [12] Sepand Mavandadi, Bo Li, Chao Zhang, Brian Farris, Tara N Sainath, and Trevor Strohman, “A truly multilingual first pass and monolingual second pass streaming on-device ASR system,” in Proc. SLT, 2023.
  • [13] Bo Li, Tara N Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, et al., “A language agnostic multilingual streaming on-device ASR system,” in Proc. Interspeech, 2022.
  • [14] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al., “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020.
  • [15] Austin Waters, Neeraj Gaur, Parisa Haghani, Pedro Moreno, and Zhongdi Qu, “Leveraging language ID in multilingual end-to-end speech recognition,” in Proc. ASRU, 2019.
  • [16] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly, “Parameter-efficient transfer learning for NLP,” in Proc. ICML, 2019.
  • [17] Wenxin Hou, Han Zhu, Yidong Wang, Jindong Wang, Tao Qin, Renjun Xu, and Takahiro Shinozaki, “Exploiting adapters for cross-lingual low-resource speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 317–329, 2021.
  • [18] Qiujia Li, Bo Li, Dongseong Hwang, Tara N Sainath, and Pedro M Mengibar, “Modular domain adaptation for Conformer-based streaming ASR,” in Proc. Interspeech, 2023.
  • [19] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le, “Self-training with noisy student improves imagenet classification,” in Proc. CVPR, 2020.
  • [20] Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le, “Improved noisy student training for automatic speech recognition,” in Proc. Interspeech, 2020.
  • [21] Wenxin Hou, Yidong Wang, Shengzhou Gao, and Takahiro Shinozaki, “Meta-adapter: Efficient cross-lingual adaptation with meta-learning,” in Proc. ICASSP, 2021.
  • [22] Genta Indra Winata, Guangsen Wang, Caiming Xiong, and Steven Hoi, “Adapt-and-adjust: Overcoming the long-tail problem of multilingual speech recognition,” in Proc. Interspeech, 2021.
  • [23] Anjuli Kannan, Arindrima Datta, Tara N Sainath, Eugene Weinstein, Bhuvana Ramabhadran, Yonghui Wu, Ankur Bapna, Zhifeng Chen, and Seungji Lee, “Large-scale multilingual speech recognition with a streaming end-to-end model,” in Proc. Interspeech, 2019.
  • [24] Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, and Alexis Conneau, “mSLAM: Massively multilingual joint pre-training for speech and text,” arXiv preprint arXiv:2202.01374, 2022.
  • [25] Yubei Xiao, Ke Gong, Pan Zhou, Guolin Zheng, Xiaodan Liang, and Liang Lin, “Adversarial meta sampling for multilingual low-resource speech recognition,” in Proc. AAAI, 2021.
  • [26] Eric Sun, Jinyu Li, Yuxuan Hu, Yimeng Zhu, Long Zhou, Jian Xue, Peidong Wang, Linquan Liu, Shujie Liu, Edward Lin, et al., “Building high-accuracy multilingual ASR with gated language experts and curriculum training,” arXiv preprint arXiv:2303.00786, 2023.
  • [27] Ke Hu, Bo Li, Tara N Sainath, Yu Zhang, and Francoise Beaufays, “Mixture-of-expert conformer for streaming multilingual asr,” arXiv preprint arXiv:2305.15663, 2023.
  • [28] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  • [29] Paul Voigt and Axel Von dem Bussche, “The EU general data protection regulation (GDPR),” A Practical Guide, 1st Ed., Cham: Springer International Publishing, vol. 10, no. 3152676, pp. 10–5555, 2017.
  • [30] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019.
  • [31] Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al., “Lingvo: a modular and scalable framework for sequence-to-sequence modeling,” arXiv preprint arXiv:1902.08295, 2019.
  • [32] Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson, “A domain-specific supercomputer for training deep neural networks,” Communications of the ACM, vol. 63, no. 7, pp. 67–78, 2020.
  • [33] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [34] Ehsan Variani, David Rybach, Cyril Allauzen, and Michael Riley, “Hybrid autoregressive transducer (HAT),” in Proc. ICASSP, 2020.
  • [35] Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N Sainath, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, et al., “FastEmit: Low-latency streaming ASR with sequence-level emission regularization,” in Proc. ICASSP, 2021.
  • [36] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Proc. NeurIPS, 2017.
  • [38] Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, 2021.