Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\interspeechcameraready\name

Vrunda N.Sukhadia \nameShammur AbsarChowdhury

Children’s Speech Recognition through Discrete Token Enhancement

Abstract

Children’s speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children’s speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.

keywords:
Child Speech Recognition, Discrete speech tokens, Ensembling, Multi-view clustering
00footnotetext: This paper was accepted at Interspeech 2024.

1 Introduction

Automatic Children’s speech recognition has recently attracted significant attention from research communities. One of the main reasons for such attention is that children increasingly interact with voice-activated assistants and technologies. This trend underscores the potential benefits of ASR technologies tailored for children, which can revolutionize learning tools, such as automated reading assessments [1] and interactive reading tutors [2] among others. These applications promise to enhance language acquisition for both native and non-native learners with immediate and multimodal feedback.

However, designing children’s ASR has its unique challenges. Unlike adults, children’s ASR is limited in resources and is still considered a low-resource task. This is because there is a lack of large-scale publicly available children data, and collecting and annotating such datasets are expensive and also face many difficulties due to privacy and ethical considerations [3, 4, 5]. Moreover, many studies have consistently highlighted the disparities between child and adult ASR performance, especially in English, due to difficulties in acoustic and language modeling [6, 7, 8, 9, 10, 11, 12]. The variabilities seen in children’s speech data are due to the differences in speech development rates (inter-speaker variability) and evolving pronunciation skills within an individual child over time (intra-speaker variability). Moreover, children’s speech includes significant mispronunciations and disfluencies, making it harder to annotate and model [13, 14].

Self-supervised learning (SSL) models have shown remarkable improvement in performance for various speech tasks [15, 16], while reducing the dependency on extensively annotated datasets [17]. Studies such as [18, 19, 20, 21, 22] have shown the efficacy of SSL models in improving child speech recognition, either using it for robust feature extractor or for finetuning the pre-trained model on specific datasets. Few studies have also been conducted to study the encoded information for children’s speech present in the pre-trained SSL [23, 24, 25].

Recent studies [26, 27, 28] have highlighted the usefulness of discrete speech units to represent speech signals, and their effects on ASR performance. Such compression not only reduces the storage and transmission size but also retains the essential acoustic and linguistic information while handling speaker variability better. This strategy also has the potential to handle privacy concerns, always faced when dealing with children’s data.

Therefore, in this study, we design an end-to-end English children’s ASR system using discrete units as input to the models. Our proposed framework exploits the frame-level embeddings from pre-trained SSL models and quantizes them to a handful of discrete tokens considering representation either from a single SSL model (single view representation) or multiple (multi-view) SSL models using k-mean clustering models. These discrete tokens are then passed to an end-to-end ASR model.

We compare our proposed discrete ASR with an ASR trained on continuous embedding extracted from the pretrained HuBERT and WavLM model. Additionally, we compare the designed ASR system with results obtained using the state-of-the-art Whisper model [29] in both zero-shot and fine-tuned settings as the upper-bound for the study. Furthermore, we show its efficacy when tested on unseen datasets, including (i) unseen domain, and (ii) non-native English datasets with both read and spontaneous speech style.

Therefore, our contribution in this paper includes:

  • Design and benchmark End-to-end Discrete ASR for children speech for native and non-native children datasets.

  • Explore multi-view clustering strategy to design discrete tokens and compare it with the single-view method.

  • Show the potential of the discrete children ASR for children ASR, while testing the generalization capability for the unseen domain, speaking styles, and nativity compared to the state-of-the-art Whisper model family.

To the best of our knowledge, this is the first study to explore the effectiveness of discrete tokens in single and multi-view settings for children ASR.

2 Methodology

Figure 1 gives an overview of our proposed discrete Children ASR. Given an input utterance X=[x1,x2,,xT]𝑋subscript𝑥1subscript𝑥2subscript𝑥𝑇X=\left[x_{1},x_{2},\cdots,x_{T}\right]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] of T𝑇Titalic_T frames, the frame-level representation (Z𝑍Zitalic_Z) is first extracted from a SSL pretrained model. A discrete codebook \mathbb{C}blackboard_C is then trained with the frame-level Z𝑍Zitalic_Z from the sampled utterances. For training the discrete codebook, we followed two different strategies utilizing either single representation or multi-view representation from pretrained models. We then utilize the trained \mathbb{C}blackboard_C to infer Z^=(Z)^𝑍𝑍\hat{Z}=\mathbb{C}(Z)over^ start_ARG italic_Z end_ARG = blackboard_C ( italic_Z ), and use the discrete labels as an input to the encoder-decoder ASR model.

2.1 Discrete Codebook

We opt for a simple vector quantization [30, 15] technique for approximating frame-level embeddings through a fixed codebook size. We utilize a sequence of continuous feature vectors Z={z1,z2,,zT}𝑍subscript𝑧1subscript𝑧2subscript𝑧𝑇Z=\{z_{1},z_{2},\ldots,z_{T}\}italic_Z = { italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } and then assign each ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to its nearest neighbor in the trained codebook, \mathbb{C}blackboard_C, with the code Qisubscript𝑄𝑖Q_{i}\in\mathbb{C}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_C assigned to the centroid Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The resultant discrete labels are quantized sequence Z^={z^1,z^2,,z^T}^𝑍subscript^𝑧1subscript^𝑧2subscript^𝑧𝑇\hat{Z}=\{\hat{z}_{1},\hat{z}_{2},\ldots,\hat{z}_{T}\}over^ start_ARG italic_Z end_ARG = { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }.

To train the codebook, we opt for two different strategies: (i) Single-View (D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT), and (ii) Multi-View Codebook (D(MV)superscript𝐷𝑀𝑉D^{(MV)}italic_D start_POSTSUPERSCRIPT ( italic_M italic_V ) end_POSTSUPERSCRIPT). For the single-view strategy, we trained a simple k-means cluster model using representation from a pretrained SSL model. Whereas, for the multi-view, we considered the representations (or views V(1)superscript𝑉1V^{(1)}italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and V(2)superscript𝑉2V^{(2)}italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT) from two different SSL models and trained k-means clustering model. Given the conditional independence of V(1)superscript𝑉1V^{(1)}italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and V(2)superscript𝑉2V^{(2)}italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, the strategy maximizes (M) the log-likelihood of each view, given the expected values for the hidden variables of the other view from the previous iteration and then calculate the expectation (E) for the hidden variables for the given view model parameters. Hence, optimizing for parameters with EM [31] for both views. The optimization process is terminated when the improvement in log-likelihood is plateaued for a fixed number of iterations in each view. The final discrete label (during inference) is then assigned to the cluster that has the largest averaged posterior over both views.

Refer to caption
Figure 1: Discrete children ASR with single-view and multi-view discrete input.

The resultant discrete labels Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG are temporarily aligned with the Z𝑍Zitalic_Z and include repeated or commonly co-existing units. We followed steps such as de-duplication and subword modeling to reduce such redundancies, as proposed in [27]. For de-duplication, we merge the consecutive subsequences of identical tokens into a single token. Following, we transform the discrete sequence into meta-tokens sequence by using the Sentencepiece unigram model [32].

2.2 Pretrained SSL

Given an input utterance, we extracted the representation using the following pretrained models:

  • facebook/HuBERT-large-ll60k: HuBERT identifies acoustic units by employing a clustering method to generate target labels corresponding to input features. Subsequently, masking is employed on the input features, and training is carried out to minimize the masked prediction loss using cluster labels as targets. This model comprises 316M parameters.

  • microsoft/WavLM-large: WavLM introduces gated relative position bias into the transformer architecture. In addition to employing masked prediction loss akin to HuBERT, it also integrates a denoising task during self-supervised learning. This model comprises 316M parameters.

2.3 ASR Architecture

For the children ASR, E-Branchformer [33] encoder and Transformer decoder [34] architecture is trained jointly with Connectionist Temporal Classification (CTC)/attention multi-task learning. E-Branchformer is an improved version of Branchformer [35] with two parallel Macaron-style feed-forward network branches, with one branch responsible for capturing global context using multi-head attention, while the second branch captures local contextual information using multi-layer perceptron with convolutional gating (cgMLP). Following, the two branches are merged by concatenation operation, a 1-D depth-wise convolution, and a linear projection. The transformer decoder is used as the decoder part for the sequence-to-sequence model. The transformer decoder comprises an extra masked self-attention layer on top of an MHSA and a feed-forward layer. The hyperparameters used for experiments are as shown in Table 1.

Table 1: Discrete Children ASR Model Configuration
Hyperparameters Values
Kernel Size 31
Feature dimension 512
# encoder layers 12
Encoder units 1024
# decoder layers 6
Decoder units 2048
Attention heads 4
Number of target
BPE (byte pair encoding)
5000
Number of
source BPE
6000
Number of clusters 2000
CTC weight 0.3

3 Experimental Settings

3.1 Dataset

The My Science Tutor (MyST) Corpus [36] is a collection of American English datasets featuring child speech, totaling over 393 hours from grades 3 to 5. The dataset features dialogs between the virtual tutors and the students, discussing various scientific concepts. For the empirical study, we opt for 221 hours of the transcribed dataset, filtering out the very short (0.1 seconds and below) and too long (60 seconds and above) utterances. This preprocessing helped to reduce the computation memory needed to train the models. Following, we use the official data splits as train (167.48 hours), validate (25.60 hours), and test (27.95 hours) dataset. For discrete codebook training, 10% of the training dataset, which amounts to 16.7 hours, is used.

The CMU Kids Speech Corpus111http://www.ldc.upenn.edu/Catalog/LDC97S63.html is a collection of children’s speech datasets containing 76 speakers, where the majority of the speakers are from grades 1 to 3. The age range of the children spans from six to eleven years old, with a distribution of 24 male and 52 female speakers. The whole dataset includes a total of 5180 read utterances. We opted to use only \approx 2.06 hours (22 % of total data) of read-sentences as the unseen domain and age test set.222Utterances considered for the test have in-depth error analysis; the ids and information are collected from https://isip.piconepress.com/projects/speech/databases/kids_speech

Non-Native children’s speech corpus [37] is a collection of English read and spontaneous speech data from 20 bilingual (Telugu-English) children aged 8 to 12 with English proficiency. The dataset is gender-balanced (11 female and 9 male speakers) and is essential to test our proposed model’s generalization capabilities for non-native speakers.

3.2 ASR Experiments

Baselines

We opt for two strong ASR baselines using the pretrained SSL models: HuBERT and WavLM. These models serve as feature extractors, providing rich and continuous contextual representations. The final input representation is obtained by computing the weighted sum of the embeddings from all layers. Following, we use the same encoder-decoder ASR architecture mentioned in Section 2.3. These baseline models serve as reference points to measure the relative performance of the Discrete ASR model.

Toplines

We compare the performances of the discrete ASRs with the readily available Whisper [29] models in Zero-shot and fine-tuned settings to understand the upper-bound performance. Whisper is a Transformer-based encoder-decoder model trained on 680,000 hours of labeled speech data annotated through weak supervision. These models underwent training using multilingual datasets. For the zero-shot settings, we present upper-bound results using two different model sizes:333https://huggingface.co/openai/{whisper-small.en,whisper-medium.en} small (244M parameters) and medium (769M parameters). We utilize checkpoints for models trained exclusively on English data for the ASR task using 563,000 hours of data. We fine-tuned the whisper model with 55 hours of MyST training set to mimic few-shot. Similarly, we also evaluate the Whisper model fine-tuned on the entire MyST training data, utilizing both444https://huggingface.co/aadel4/{kid-whisper-small-en-myst,kid-whisper-medium-en-myst} the small and medium English-only model checkpoints.

3.3 Model Training

3.3.1 Discrete Codebook Training

The codebook responsible for generating discrete tokens from the SSL features is trained using k-means clustering in both single-view and multi-view scenarios. Consistent settings are applied to ensure a fair comparison between the two methods. For all the settings, the number of clusters is set to 2000, motivated by the success reported in [26], providing a fine granularity in representing the feature space. The k-means++ initialization method is used to enhance the clustering process. Additionally, the number of random initializations (Ninitsubscript𝑁𝑖𝑛𝑖𝑡N_{init}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT) is set to 10, considering multiple starting points to achieve a better overall clustering solution. The maximum number of iterations (maxiter𝑚𝑎subscript𝑥𝑖𝑡𝑒𝑟max_{iter}italic_m italic_a italic_x start_POSTSUBSCRIPT italic_i italic_t italic_e italic_r end_POSTSUBSCRIPT) is limited to 100. These settings balance computational efficiency with clustering accuracy, effectively capturing the essential characteristics of the SSL features. By maintaining these consistent parameters across both single-view and multi-view scenarios, we aim to provide a robust comparison of the clustering performance and the resulting impact on the discrete token generation.

3.3.2 ASR Model training

The architecture specified in section 2.3 is adopted for training single-view and multi-view discrete ASR models. The ESPnet [38] recipe555https://github.com/espnet/espnet/tree/master/egs2/librispeech_100/asr2 is employed for training, utilizing two 32GB V100 GPUs. The models are trained using a learning rate of 0.002, with a warmup learning rate scheduler and the Adam optimizer across 100 epochs. Additionally, to augment the training data and enhance model robustness, the SpecAugment technique is applied to the input, facilitating better generalization.

Table 2: Reported WER (\downarrow) presenting the baselines (HuBERT-E2E and WavLM-E2E) and the topline results using Whisper pre-trained model in zero-shot (0) and fine-tuned with 55 hours and all (All) MyST training data. ΔΔ\Deltaroman_Δ= |WavLMD(S)||WavLM-D^{(S)}-*|| italic_W italic_a italic_v italic_L italic_M - italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT - ∗ |, where is different ASR results. Whisper-S, M: Whisper small (244M parameters) and medium (769M) models. Discrete token results are reported using HuBERT and WavLM models here.
Models WER (ΔΔ\Deltaroman_Δ)
Discrete Single-View ASRs
HuBERT-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT 15.65
WavLM-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT 14.22
Baseline ASRs
HuBERT-E2E 14.98 (0.67)
WavLM-E2E 13.27 (0.95)
Topline ASRs
Whisper-S (0) 13.93 (0.29)
Whisper-M (0) 12.9 (1.32)
Whisper-S (55 hrs) 13.23 (0.99)
Whisper-M (55 hrs) 14.4 (0.18)
Whisper-S (All) 9.11 (5.11)
Whisper-M (All) 8.91 (5.31)
Table 3: Reported WER (\downarrow) presenting the results with discrete labels using HuBERT and WavLM for single- and multi-view representation along with the topline results using Whisper pre-trained model in zero-shot (0) and fine-tuned with 55 hours and All MyST training data. U: Unseen domain/data. Whisper-M: Whisper medium model (769M parameters). All discrete models have 40.36M parameters. CMUk: CMU kids test subset.
WER Seen U: Domain U: Non-native
Models MyST CMUk Read Spont.
Single-View Discrete Tokens
HuBERT-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT 15.65 47.78 38.40 64.63
WavLM-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT 14.22 45.60 32.01 60.84
Multi-View Discrete Tokens
D(MV)superscript𝐷𝑀𝑉D^{(MV)}italic_D start_POSTSUPERSCRIPT ( italic_M italic_V ) end_POSTSUPERSCRIPT 15.37 46.60 38.20 63.35
Topline
Whisper-M (0) 12.9 32.1 30.38 50.59
Whisper-M (All) 8.91 47.64 37.71 49.57

4 Results

We reported the Word Error Rate (WER) for all the ASRs. The WER results are computed on normalized text, utilizing the BasicTextNormalizer from Whisper 666https://github.com/openai/whisper/blob/main/whisper/normalizers/basic.py.

4.1 Traditional vs Discrete Input

Table 2 reports WER for discrete token ASRs and compares it with the baselines and variations of Whisper - small and medium models in zero-shot, few (55 hours) shots, and fully fine-tuned settings. From the reported WER, we observed that discrete tokens perform comparably to the HuBERT and WavLM end-to-end model, with a small performance drop of Δ(WER)=0.67Δ𝑊𝐸𝑅0.67\Delta(WER)=0.67roman_Δ ( italic_W italic_E italic_R ) = 0.67 and Δ(WER)=0.95Δ𝑊𝐸𝑅0.95\Delta(WER)=0.95roman_Δ ( italic_W italic_E italic_R ) = 0.95 respectively. When compared with Whisper model variants (both zero- and few-shots), we noticed a maximum drop of Δ(WER)=1.32Δ𝑊𝐸𝑅1.32\Delta(WER)=1.32roman_Δ ( italic_W italic_E italic_R ) = 1.32. While with full MyST training data, the drop goes to Δ(WER)=5.31Δ𝑊𝐸𝑅5.31\Delta(WER)=5.31roman_Δ ( italic_W italic_E italic_R ) = 5.31. All the aforementioned reported Δ()Δ\Delta(*)roman_Δ ( ∗ ) is w.r.t WavLM-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT. Considering the model sizes (Whisper Medium: 769M, Whisper Small: 242M, and Discrete Token ASR: 40.36M) and the extensive data utilized in Whisper’s pre-training and subsequent fine-tuning, the performance of the Discrete Token ASR demonstrates nearly equivalent results while achieving an 83%absentpercent83\approx 83\%≈ 83 % reduction in model size compared to Whisper Small and a 94%absentpercent94\approx 94\%≈ 94 % reduction compared to Whisper Medium.

Moreover, discrete ASR efficiently reduces data sizes and input length as discussed above. For example, for a T𝑇Titalic_T second utterance, the raw input signal (of 16 kHz sampling rate and 16-bit signed integer form) will need 16×16000×T1616000𝑇16\times 16000\times T16 × 16000 × italic_T bits to encode; for SSL-based features with the rate of 50 frames per second, stored as float vectors and output embedding dimension of 1024 from one layer, we need 32X1024X50XT32𝑋1024𝑋50𝑋𝑇32X1024X50XT32 italic_X 1024 italic_X 50 italic_X italic_T bits. For discrete labels, we only need 11X50XT11𝑋50𝑋𝑇11X50XT11 italic_X 50 italic_X italic_T bits for a maximum of 2048 clusters (11-bit) without even considering further improvement with de-duplication of sequence and subword modeling.

4.2 Single-view vs Multi-view

For the study, we exploit two simple ways to convert continuous speech features into discrete units. Using single-view and multi-view strategies, we reported the results on MyST in Table 3. We observed that in a single-view setup, the WavLM discrete tokens outperform the HuBERT discrete tokens by 1.43 WER. We hypothesize that WavLM model embeddings are more robust due to its added utterance-mixing strategy, addressing the variability in child speech more efficiently.

For multi-view setup, the performance of the D(MV)superscript𝐷𝑀𝑉D^{(MV)}italic_D start_POSTSUPERSCRIPT ( italic_M italic_V ) end_POSTSUPERSCRIPT is superior to the HuBERT-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT model. However, WavLM-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT still outperforms both the variations. This potentially indicates that the selection of robust SSL models is essential to harness the power of multi-view discrete tokens. We keep this as a future exploration.

4.3 Generalization Capabilities

To test the generalization capabilities of these discrete token ASRs, we evaluated two unseen test sets and reported WER with single-, multi-view discrete ASRs along with the Whisper medium models in zero-shot and full (fine-tuned with full training data as the discrete models) settings in Table 3. We observed similar performance patterns across the datasets – with different age groups (CMU kids data), nativity (non-native data), and speaking style (read- and spontaneous corpus). Similar to our previous observation, WavLM-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT outperforms all other discrete ASR systems and also gives comparable results to zero-shot Whisper models.

Table 4: Example of Discrete ASR outputs
Ref: A butterfly starts as an egg
Verbatim: [noise] a butterfly starts /EH/ [human_noise]
an egg [human_noise] [noise]
WavLM-D(S)superscript𝐷𝑆D^{(S)}italic_D start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT: a butterfly starts I as an X
D(MV)superscript𝐷𝑀𝑉D^{(MV)}italic_D start_POSTSUPERSCRIPT ( italic_M italic_V ) end_POSTSUPERSCRIPT: a butterfly starts E as an egg

4.4 Error Analysis

For the study, we briefly studied the effect of added noises on the model performance. Our initial exploration suggests that, with the different errors present in all the discrete ASRs, the multi-view discrete ASR is closer to the verbatim form of the transcription. For example, as shown in Table 4, the multi-view ASR can recognize the word “egg” correctly, even though in spoken form the word is followed by significant human noises. Moreover, the inserted char “E” is closer to the phonemic EH𝐸𝐻EHitalic_E italic_H sound that was actually in the speech. Such fine-grained prediction could help to detect mispronunciation and disfluencies present in the data more effectively.

5 Conclusion

This study presents the first benchmark for children’s speech recognition with discrete tokens as input. From our exploration of discrete children ASR, we observed a comparable ASR performance with a significant reduction in model size and computational costs. Moreover, the discrete ASR provides additional data privacy required when dealing with sensitive speech data like children’s speech. Our findings reflect the potential for multi-view discrete ASR, exploiting ensemble information encoded in separate SSL models. Further future research will involve studying how to enhance these discrete tokens with views extracted from different SSL models with different ASR architectures.

References

  • [1] K. Evanini and X. Wang, “Automated speech scoring for non-native middle school students with multiple task types,” in Proceedings of the INTERSPEECH, 2013.
  • [2] J. Mostow, “Why and how our automated reading tutor listens,” in Proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training (ISADEPT), 2012.
  • [3] F. Claus, H. Gamboa Rosales, R. Petrick, H.-U. Hain, and R. Hoffmann, “A survey about databases of children’s speech,” in INTERSPEECH, 2013.
  • [4] J. Wang, Y. Zhu, R. Fan, W. Chu, and A. Alwan, “Low resource german asr with untranscribed data spoken by non-native children- interspeech 2021 shared task spapl system,” in Proc. Interspeech, 2021.
  • [5] S. Feng, B. M. Halpern, O. Kudina, and O. Scharenborg, “Towards inclusive automatic speech recognition,” Computer Speech & Language, vol. 84, 2024.
  • [6] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental changes of temporal and spectral parameters,” J. Acoustical Soc. Amer., 1999.
  • [7] B. L. Smith, “Relationships between duration and temporal variability in children’s speech,” The Journal of the Acoustical Society of America, 1992.
  • [8] L. L. Koenig, J. C. Lucero, and E. Perlman, “Speech production variability in fricatives of children and adults: Results of functional data analysis,” The Journal of the Acoustical Society of America, 2008.
  • [9] L. L. Koenig and J. C. Lucero, “Stop consonant voicing and intraoral pressure contours in women and children,” The Journal of the Acoustical Society of America, 2008.
  • [10] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s speech: Developmental changes of temporal and spectral parameters,” The Journal of the Acoustical Society of America, 1999.
  • [11] ——, “Analysis of children’s speech: Duration, pitch and formants,” in Fifth European Conference on Speech Communication and Technology, 1997.
  • [12] H. K. Vorperian and R. D. Kent, “Vowel acoustic space development in children: A synthesis of acoustic and anatomic data,” 2007.
  • [13] J. S. Yaruss, R. M. Newman, and T. Flora, “Language and disfluency in nonstuttering children’s conversational speech,” J. Fluency Disord., 1999.
  • [14] T. Tran, M. Tinkler, G. Yeung, A. Alwan, and M. Ostendorf, “Analysis of disfluency in children’s speech,” in Proc. Interspeech, 2020.
  • [15] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  • [16] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  • [17] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023.
  • [18] R. Fan and A. Alwan, “Draft: A novel framework to reduce domain shifting in self-supervised learning and its application to children’s asr,” in Proc. Interspeech 2022, 2022.
  • [19] R. Fan, Y. Zhu, J. Wang, and A. Alwan, “Towards better domain adaptation for self-supervised models: A case study of child asr,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, 2022.
  • [20] R. Jain, A. Barcovschi, M. Y. Yiwere, D. Bigioi, P. Corcoran, and H. Cucu, “A wav2vec2-based experimental study on self-supervised learning methods to improve child speech recognition,” IEEE Access, 2023.
  • [21] R. Lahiri, T. Feng, R. Hebbar, C. Lord, S. H. Kim, and S. Narayanan, “Robust self supervised speech embeddings for child-adult classification in interactions involving children with autism,” in Proc. INTERSPEECH 2023, 2023.
  • [22] A. A. Attia, J. Liu, W. Ai, D. Demszky, and C. Espy-Wilson, “Kid-whisper: Towards bridging the performance gap in automatic speech recognition for children vs. adults,” arXiv preprint arXiv:2309.07927, 2023.
  • [23] V. M. Shetty, S. M. Lulich, and A. Alwan, “Developmental articulatory and acoustic features for six to ten year old children,” in Proc. INTERSPEECH 2023, 2023.
  • [24] M. Lavechin, Y. Sy, H. Titeux, M. A. C. Blandón, O. Räsänen, H. Bredin, E. Dupoux, and A. Cristia, “Babyslm: language-acquisition-friendly benchmark of self-supervised spoken language models,” in Proc. INTERSPEECH 2023, 2023.
  • [25] G. Yeung and A. Alwan, “On the difficulties of automatic speech recognition for kindergarten-aged children,” in Interspeech 2018, 2018.
  • [26] X. Chang, B. Yan, Y. Fujita, T. Maekaku, and S. Watanabe, “Exploration of efficient end-to-end asr using discretized input from self-supervised learning,” arXiv preprint arXiv:2305.18108, 2023.
  • [27] X. Chang, B. Yan, K. Choi, J. Jung, Y. Lu, S. Maiti, R. Sharma, J. Shi, J. Tian, S. Watanabe et al., “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” arXiv preprint arXiv:2309.15800, 2023.
  • [28] Y. E. Kheir, H. Mubarak, A. Ali, and S. A. Chowdhury, “Beyond orthography: Automatic recovery of short vowels and dialectal sounds in arabic,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024.
  • [29] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
  • [30] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech coding,” Proceedings of the IEEE, 1985.
  • [31] S. Bickel and T. Scheffer, “Multi-view clustering,” in Fourth IEEE International Conference on Data Mining (ICDM’04), 2004.
  • [32] T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
  • [33] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-branchformer: Branchformer with enhanced merging for speech recognition,” 2022.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
  • [35] Y. Peng, S. Dalmia, I. Lane, and S. Watanabe, “Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding,” 2022.
  • [36] S. S. Pradhan, R. A. Cole, and W. H. Ward, “My science tutor (myst) – a large corpus of children’s conversational speech,” 2023.
  • [37] K. Radha and M. Bansal, “Audio augmentation for non-native children’s speech recognition through discriminative learning,” Entropy, 2022.
  • [38] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018.