Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition

Neural Processing Letters (2024) 56:168
https://doi.org/10.1007/s11063-024-11614-z
Multi-view Self-supervised Learning and Multi-scale Feature

Fusion for Automatic Speech Recognition
Jingyu Zhao1 · Ruwei Li1 · Maocun Tian1 · Weidong An1
Accepted: 6 April 2024

© The Author(s) 2024
Abstract
To address the challenges of the poor representation capability and low data utilization rate
of end-to-end speech recognition models in deep learning, this study proposes an end-to-end
speech recognition model based on multi-scale feature fusion and multi-view self-supervised
learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed
method emphasizes the importance of inter-layer information within shared encoders, aim-
ing to enhance the model’s characterization capability via the multi-scale feature fusion
module. Moreover, we apply multi-view self-supervised learning to effectively exploit data
information. Our approach is rigorously evaluated on the Aishell-1 dataset and further val-
idated its effectiveness on the English corpus WSJ. The experimental results demonstrate a
noteworthy 4.6% reduction in character error rate, indicating significantly improved speech
recognition performance . These findings showcase the effectiveness and potential of our
proposed MM-ASR model for end-to-end speech recognition tasks.
Keywords End-to-end speech recognition · Multi-scale feature fusion · Multi-view

self-supervised learning · Multi-task learning paradigm
1 Introduction
Automatic Speech Recognition (ASR) technology plays a pivotal role in facilitating human-
computer interaction by converting speech signals into text [1]. Indeed, ASR technology
built on deep learning has made significant strides in recent years [2]. However, as people’s
demands for accuracy and robustness in ASR models continue to grow, there are challenges in
meeting these requirements. While the development of hybrid deep neural network models
(DNNs) [3], encompassing acoustic, linguistic, and lexical models, has led to improved
B Ruwei Li
liruwei@bjut.edu.cn
Jingyu Zhao
jingyuzhao@emails.bjut.edu.cn
Maocun Tian
tianmaocun@163.com
Weidong An
anweidong@bjut.edu.cn
1 Faculty of Information Technology, Beijing University of Technology, Beijing, China
0123456789().: V,-vol 123

168 Page 2 of 20 J. Zhao et al.
accuracy in automatic voice recognition, these models involve multiple modules and a tedious
training procedure. Each module requires independent tuning, which can result in cumulative
errors in the overall model. In response to these challenges, the field of voice recognition
has undergone a noteworthy shift from hybrid models towards end-to-end (E2E) models
[4, 5]. The E2E speech recognition model employs a single network to directly transform
input speech sequences into output token sequences. By merging the acoustic model and
linguistic model from traditional speech recognition into a unified network, the E2E model
effectively simplifies the structure of the speech recognition process. This transition to end-
to-end models brings the advantage of streamlining the ASR model, reducing complexity,
and potentially improving overall performance and robustness. As research continues in this
direction, we can anticipate further advancements in ASR technology, ultimately catering to
the increasing demands of diverse applications and enhancing the quality of life for users.
Currently, there are several research directions in the field of end-to-end speech recog-
nition: connectionist temporal classification (CTC) method [6–8], recurrent neural network
transducers (RNN-T) [9], and attention-based models (AED) [10]. These end-to-end (E2E)
models treat automatic speech recognition (ASR) as a sequence-to-sequence problem, where
a neural network is directly employed to learn the mapping from speech to text. The CTC
method has been extensively researched due to its straightforward modeling process, which
involves only an encoder and outputs each token independently. However, its identifica-
tion accuracy is often subpar because it assumes conditional independence between output
tokens, and decoding speed is fast. The RNN-T model comprises two networks: an encoder
that maps input acoustic frames to a higher level for characterization and a prediction and
union network that forms the decoder [11]. This decoder network utilizes autoregression,
relying on past prediction data. However, RNN-T training can be unstable, requiring more
memory and potentially limiting training speed. Consequently, the resulting model may be
less effective at recognizing objects and more challenging for businesses to implement. Many
advanced ASR systems [12] are based on the AED model, which incorporates an encoder
for encoding acoustic data and a decoder for generating the most likely word sequence or
sentence.While this model considers both previously generated tokens and the acoustic con-
text when producing tokens, it can lead to identification delays. Moreover, the estimated
alignment in the attention-based process is vulnerable to noise corruption, much like in real-
world speech recognition tasks, resulting in subpar recognition performance for the model.
As ASR technology continues to evolve, researchers are actively exploring ways to enhance
the capabilities of end-to-end speech recognition models, aiming to strike a balance between
accuracy, efficiency, and robustness for practical applications.
The combination of CTC-Attention model has emerged as the dominant approach for
end-to-end speech recognition systems [13, 14]. This model utilizes a multi-task learning
framework and is trained with both CTC and Attention model objectives. The architecture
consists of a shared encoder, a CTC layer, and an attention decoder. The shared encoder
employs transformer [15] or conformer [16] blocks to effectively learn local and global
properties of the input speech sequences, enhancing the model’s ability to capture relevant
information. The CTC linear and log-softmax layers use the CTC loss function during train-
ing to optimize the softmax output. The CTC layer operates in streaming mode for the
first channel, allowing for real-time streaming results. The attention decoder, consisting of
transformer blocks, generates improved contextual representations and is utilized for the
second channel during decoding. The attention-based decoder re-scores multiple candidate
outcomes N-best in a teacher-forced manner, enabling more precise results during decoding.
The recognized phrases are then reranked based on the scores, further improving recognition
accuracy. Researchers have discovered that the combination of CTC loss and AED leads to
123
Multi-view Self-supervised... Page 3 of 20 168
faster training convergence and superior recognition results. As a result, this approach has
become the standard reference scheme for training end-to-end speech recognition models.
However, existing end-to-end speech recognition models face limitations in mining super-
vised information from vast amounts of unsupervised data. They primarily focus on the
output characteristics of the last layer of the encoder and overlook inter-layer information.
This limitation leaves room for improvement in model characterization, data utilization, and
model resilience. Continued research in these areas presents opportunities for advancing end-
to-end speech recognition systems, ultimately leading to more powerful and efficient models
that can better utilize unsupervised data and improve recognition performance in various
applications
Based on the latest research developments, we propose an innovative end-to-end speech
recognition model that combines multi-scale feature fusion with multi-view self-supervised
learning. The model is trained using a hybrid strategy, incorporating both supervised and
self-supervised training approaches. The primary focus of the model is on leveraging the inter-
layer information of the shared encoder to enhance its characterization capability. By utilizing
the diversity of this information, the model becomes more adept at representing speech data
accurately. Additionally, the model incorporates multi-view self-supervised learning, which
maximizes the utilization of data information and improves the model’s resilience. This is
achieved by creating various shared encoder sub-models, each excluding some information,
and then using multi-view self-supervised learning to effectively exploit the data. The shared
encoder consists of multiple conformer blocks, allowing it to learn both local and global
features of the input speech sequence. The multi-scale feature fusion module (MFF) plays
a crucial role in the model, providing different weights for the output of various conformer
blocks and combining these weights to generate the final output representation. The outputs
of each conformer block are then stitched together to form the overall representation. The
model’s decoding process involves using both the CTC and Attention decoders on the output
representation. To validate the performance of the proposed model, we use WeNet [17, 18],
a speech recognition tool, as the benchmark, and the Aishell-1 [19] dataset for training and
testing. Subsequently, it was further tested on the English corpus WSJ. The experimental
results demonstrate the significant reduction in character error rate and improved speech
recognition performance when compared to the baseline, employing four different decoding
techniques. This confirms the effectiveness and potential of the proposed end-to-end speech
recognition model, showcasing its capability to enhance voice recognition accuracy and
performance.
2 Related Work
Based on different training objectives, SSL methods can be categorized into generative learn-
ing, discriminative learning, and multi-task learning. The research line of generative learning
can be traced back to the auto-encoding model, which reconstructs the entire speech from
continuous [20–22] or discrete [23] latent variables. Recent works propose to predict future
frames from the history with an autoregressive model [24–27], or recover the masked frames
from the corrupted speech with a non-autoregressive model [28–32]. Apart from generative
learning, discriminative learning has also gathered interests recently. The well-known exam-
ples include CPC [33], wav2vec [34], vq-wav2vec [35], wav2vec 2.0 [36], DiscreteBERT
[37], and HuBERT [38]. However, self-supervised paradigms require careful design, and
123
Fig. 1 MM-ASR model architecture
such representations can be difficult to interpret. There is no guarantee that the model will
learn a "good" speech representation in terms of identifying the most valuable information.
Convolutional neural networks (CNN) have been proven to be a useful model for handling
various visual tasks [39–42]. Despite their great success, CNNs still have their limitations.
They mainly focus on local spatial modeling and lack global context fusion. Models based
on CNNs cannot handle long-range dependencies well. Recently, in the field of speech
processing, ECAPATDNN [43] and its follow-up efforts [44, 45] achieved a significant break-
through based on TDNN blocks and the squeeze-and-excitation (SE) [46] layer unified with
Res2Block [47]. They provided an equal error rate of less than 1% on the VoxCeleb 1-O
benchmark test. Among them, MFA-Conformer [48], which is based on multi-scale fea-
ture fusion, has achieved remarkable results in the speaker recognition task. However, the
application of multi-scale feature fusion in speech recognition tasks is still rare.
Inspired by these recent advancements, we propose an innovative end-to-end speech
recognition model that combines multi-scale feature fusion with multi-view self-supervised
learning. The model uses a mixed training strategy that encompasses both supervised and
self-supervised learning methods.
3 The Overall Architecture of MM-ASR
Figure 1 depicts the overall layout of the multi-view self-supervised learning and multi-scale
feature fusion end-to-end speech recognition model developed in this research. The model is
built on a common joint CTC-Attention model with conformer blocks for the shared encoder
and self-supervised loss construction by contrastive learning. It also includes a self-attentive
mechanism for the multi-scale feature fusion module, a CTC layer, and an attention decoder
made up of transformer blocks for the decoder.
123
Fig. 2 Conformer model

structure diagram
3.1 Conformer Structure
The architecture of the network proposed in this study integrates both Convolutional Neural
Networks (CNN) and the Transformer model to extract vocal representations. While CNNs
are known for their effectiveness in extracting local properties, they often fall short in captur-
ing global properties. The self-attention module, on the other hand, is proficient in capturing
long-range global context dependencies, thereby compensating for the CNN’s inability to
capture global features. Hence, the Transformer network is incorporated to tackle this short-
coming. The network configuration of the encoder used in this study is shown in Fig. 2,
composed of N layers of identical Conformer blocks [16].
The network is organized as a stack of four modules, each employing a residual connection
structure [49]. These modules include the feedforward module, the multi-head self-attention
(MHSA) module, the convolutional module, and a second feedforward module. The MHSA
and the convolution module represent the core components of the Conformer block. The
MHSA utilizes the relative position encoding scheme as proposed in the Transformer-XL
model [50], which encodes the input considering the relative position deviation. It takes into
account both the global content offset and the global position offset.
Following the MHSA is the convolutional module, which comprises Pointwise convolu-
tion, Depthwise convolution, and Glu and Swish activation layers. To assist in learning local
features and facilitate the training of deep learning models, a BatchNorm layer is placed after
the convolutional layer. Mathematically, for the input xi of Conformer block i, the output yi
123
Fig. 3 Structure diagram of a shared encoder based on multi-view self-supervised learning
of Conformer block can be expressed as:
1

xi = L N (xi + F F N (xi )) (1)
2
xi = L N (
xi + M H S A(xi )) (2)
xi = L N (xi + Conv(xi )) (3)
1
yi = L N (xi + F F N (xi )) (4)
2
where FFN refers to feed-forward module, MHSA refers to multi-headed self-attentive

module, Conv refers to convolution module, and LN refers to layer parametric module.
3.2 Shared Encoder Based on Multi-view self-supervised Learning
Supervised learning is a deep learning approach that identifies a functional relationship

between input and output by categorizing or regressing labeled data. However, it cannot fully
exploit the data as it only learns from labeled data. In contrast, self-supervised learning is a
potent technique for extracting applicable and generalizable latent representations from large
volumes of unlabeled data. This approach is commonly employed in sequence-to-sequence
(seq2seq) model pre-training and in facilitating downstream tasks [51–53]. Through auxiliary
or pretext tasks, the network is trained to acquire representations that are beneficial for
downstream tasks, mining its supervised knowledge from large-scale unsupervised data.
Based on the above analysis, this study designs a shared encoder leveraging multi-view
self-supervised learning. Figure 3 illustrates the network structure of this encoder. The green
section in Fig. 3 denotes the encoder employing N layers of identical Conformer blocks to
more efficiently capture speech features. The units that are randomly dropped during the
training phase are depicted in the blue portion of the multi-view self-supervised learning
slab. The self-supervised learning slab employs the dropout regularization technique [54] to
construct two distinct encoder views, thereby reducing the model’s generalization error. The
dropout algorithm specifically randomly discards some units in each layer of the neural net-
work to prevent co-adaptation and overfitting. This study uses a self-supervised approach to
regularize the output prediction of the sub-model, leveraging the structural randomness intro-
duced by the dropout process. The outputs of the encoder views are compared to extract more
123
Fig. 4 MFF structure diagram
reliable characterization information. To better exploit the data and enhance the robustness
of the model, the supervised loss is coupled with the self-supervised contrastive loss.
Given the shared encoder input data xi , xi is fed twice during each training cycle in
order to pass through the network’s forward. As a result, it is possible to derive two
distributions of the common encoder output, designated as P1 (yi | xi ) and P2 (yi | xi ).
Dk L (P1 (yi | xi ) P2 (yi | xi )) gives the Kullback-Leibler (KL) scatter between P1 (yi | xi )
and P2 (yi | xi ). Since the discard operation randomly discards the units in the shared encoder,
the forward pass is carried out using two different shared encoder views of the same encoder,
as was previously indicated. The self-supervised method utilized in this study then reg-
ularizes the model predictions during the training stage by minimizing the bidirectional
Kullback–Leibler (KL) dispersion between the same batches P1 (yi | xi ) and P2 (yi | xi ) as
follows:
1
LKL = (Dk L (P1 (yi | xi ) P2 (yi | xi )) + Dk L (P2 (yi | xi ) P1 (yi | xi ))) (5)
2
3.3 Multi-scale Feature Fusion Module
In existing speech recognition models, the diversity of information between different layers
is often overlooked, limiting their ability to represent the data. When the final speech repre-
sentation is extracted by the encoder, they only pass the features output from the last layer to
the decoder. This study proposes an attention-based multi-scale feature fusion module (MFF)
to address this issue by maximizing the utilization of inter-layer information to enhance the
model representation information capabilities.
Based on the analysis, the scale information is extracted by each conformer block of
each layer in the shared encoder, and there is a reliance between the scale information
of the different layers. In this work, we explicitly model the dependencies between each
conformer block using the proposed multi-scale feature fusion module. After learning these
dependencies, we sum the output of each conformer block and use the scale information
extracted from each layer to form N-dimensional features. This process results in acoustic
features with stronger characterization information. The structure of this module is depicted
in Fig. 4.
The implementation process of the multi-scale feature fusion module involves the fol-
lowing steps: The output from each conformer block is first combined into X ∈ RC×H ×W .
After X is transformed into the matrix A ∈ RC×N and subjected to transposition, matrix
multiplication, and softmax operations, the attention map V ∈ RC×C is produced:

exp ai · a j
v ji = C (6)
i=1 exp ai · a j
123
where the impact of the ith conformer block on the jth conformer block of the metric is
indicated by the notation v ji . The output of dimension (C × N) is then molded into (C ×
H × W) by performing matrix multiplication of this attention graph V with matrix A. After
learning the dependency relationship, the result is multiplied by the scale factor β, and an
element-by-element summing operation is then carried out with X to generate the output of
each conformer block Y ∈ RC×H ×W :

C

yj = β v ji ai + a j (7)
i=1
where β is initialized to 0 and gradually learns to assign larger weights. In Eq. (7),
the process of the multi-scale feature fusion module is described as follows: The weighted
sum of all conformer block output features and the initial conformer block output features,
represents the resultant features of each conformer block after learning the dependencies.
This module models the dependencies between different conformer blocks, which helps to
obtain more robust speech representations. The final acoustic representation, provided to the
decoder, is generated by aggregating the outputs of each conformer block after learning their
dependencies. Through this process of weighted summation and integration of information
from multiple conformer blocks, the end-to-end speech recognition model is empowered to
effectively represent and comprehend complex speech patterns. This enhances the model’s
overall capability to achieve accurate and robust speech recognition.

C
yc = yj (8)
j=1
where yc is the final output of the multi-scale feature fusion module.
3.4 Decoder
The Connectionist Temporal Classification (CTC) method, developed by Graves et al. [6], is
a technique primarily used to address the problem of output alignment between labels and
neural network predictions.
To determine the likelihood of the CTC target sequence, the CTC model takes into account
all feasible alignment routes between the target sequence y and the input sequence x. This
likelihood is specified as:

P(y | x) = P(q | x) (9)
q∈β −1 (y)
where q is one of the pathways, and β −1 (y) is the set of all paths that could map from the
input sequence to the output label. Equation (10) illustrates the definition of the CTC loss
function as the sum of the negative log probability of obtaining the appropriate label during
training.
L C T C = − ln P(y | x) (10)
Therefore, the CTC method significantly simplifies the training and modeling processes
for speech recognition models. In this study, we use the CTC model as one of the decoders.
The CTC model’s architecture comprises linear and log-softmax layers. During the training
phase, we apply the CTC loss function to the softmax output, which helps to transform the
output of the shared encoder after the MMF activation into the CTC model.
123
The Attention Decoder in this paper is made up of many similar Transformer blocks.
Wherein, the Multi-Head Cross-Attention module (MHCA), is added to the Feedforward and
Self-Attention modules in order to execute multi-head attention on the output of the shared
encoder after passing through the MMF. The attention decoder in this study uses relative
position encoding to be consistent with the shared encoder. Mathematically, the output yi for
input xi of transformer block i in the attention decoder can be written as follows:
xi = L N (xi + M H S A(xi )) (11)

xi
= L N (xi + M H C A(xi ,
y)) (12)
yi = L N (xi +
F F N (xi )) (13)
where the shared encoder output following the MMF is referred to as y. MHSA stands for
multi-headed self-attentive module, MHCA for multi-headed cross-attentive module, and LN
for layer norm module, where FFN is for feedforward module.
3.5 Multi-task Learning Paradigm
The model proposed in this study employs two supervised losses, namely the Connectionist
Temporal Classification (CTC) loss and the Attention-based Encoder-Decoder (AED) loss,
in addition to a self-supervised comparison loss. The training process follows a hybrid end-
to-end approach that combines both supervised and self-supervised training methods. By
integrating both CTC and AED losses into one of the supervised losses, the model bene-
fits from improved convergence while fully capturing token dependencies within the data.
Equations (14) and (15) define the joint supervised and self-supervised losses, where x is the
acoustic feature and y is the corresponding label. The CTC decoder and attention decoder
losses are denoted by the variables L C T C (x, y), L AE D (x, y), and λ ∈ (0, 1), which is the
hyperparameter that balances the weights of the two losses. are hyperparameters that weigh
the significance of losses that are both supervised and self-supervised.
L S = λL C T C (x, y) + (1 − λ)L AE D (x, y) (14)

L = λL S + μL K L (15)
3.6 Analyze
Compared with supervised learning, self-supervised learning methods attempt to learn pow-
erful contextual representations from audio data only, and then fine-tune the model on paired
data. Currently, there are some pre-trained models that achieve excellent performance, but
these require a large amount of external data and model parameters for training. Moreover,
these models mainly address general representations for speech tasks. Specifically, models
such as CPC and the wav2vec series use contrastive InfoNCE loss to distinguish between
related positive samples and negative samples. Inspired by masked language model loss in
NLP, DiscreteBERT and HuBERT predict discrete targets in masked regions. However, our
method focuses on an end-to-end ASR model that requires only a small amount of labeled data
for training and achieves excellent performance through the proposed multi-view contrastive
self-supervised approach.
The multi-scale feature fusion network structure is relatively flexible and there is no clear
boundary. The receptive field of the high-level network is relatively large, and the semantic
information representation ability is strong, but the resolution of the feature map is low, and
123
Table 1 Aishell-1 speech corpus composition structure

Aishell-1 Number of participants (persons) Number of audio (bars)
Training set 340 120,098

Validation set 40 14,326
Test set 20 7176
Total 400 141,600
the geometric information representation ability is weak. The receptive field of the low-level
network is relatively small, and the geometric detail information representation ability is
strong. Although the resolution is high, the semantic information representation ability is
weak. The multi-scale feature fusion network makes the model easier to achieve significant
results on complex tasks by fusing deep and shallow layer features. The latest research has
demonstrated the potential of voice models on full-stack voice tasks by using the weighted
sum of embeddings from different layers. They found that different layers contain useful
information for different tasks. For example, the top hidden states are useful for ASR, while
the bottom layers are more effective for speaker verification. Therefore, this study proposes
an attention-based multi-scale feature fusion module (MFF) to enhance the model’s ability
to represent information by maximizing inter-layer information utilization.
4 Performance Testing and Analysis
We first demonstrate our results on the Aishell-1 test dataset to gain a deeper understanding of
our method. Subsequently, we further validate the effectiveness of the method on the English
corpus WSJ (80-h). To evaluate the effectiveness of the multi-scale feature fusion method
and the multi-view self-supervised learning module, we conducted ablation experiments to
compare the differences. The performance of the model is evaluated based on the character
error rate (CER).
4.1 Dataset
The Aishell company provides the Aishell-1 dataset, an open-source speech dataset that
resamples high-fidelity microphone audio data to 16 kHz, 16-bit WAV format. The dataset
consists of speech data from 400 speakers, representing diverse dialect regions in China,
and covers a wide range of topics such as technology, sports, entertainment, current news,
finance, and economics. The Aishell-1 dataset is divided into three sets: a training set with
340 speakers, containing 150 h of speech data, a validation set with 40 speakers, comprising
10 h of speech data, and a test set with 20 speakers, containing 5 h of speech data. In total, the
dataset contains 165 h of speech data. The composition of the Aishell-1 dataset is detailed in
Table 1. The test set consists of 7176 speech samples. For this project, the Aishell-1 dataset
was utilized for both training and testing the proposed speech recognition model.
123
4.2 Experimental Setup
The test configuration for this experiment includes an AMD R9-3090X processor, 32 GB of
RAM, and an NVIDIA RTX-3090 GPU graphics card. The software environment is a 64-bit
Ubuntu 20.04 operating system running the Pytorch deep learning framework.
The input features consist of an 80-dimensional log-Mel filter bank (Fbank) with a 25-ms
window and a 10-ms shift. We perform speed perturbation on the entire data at 0.9, 1.0, and
1.1 speeds to generate a 3x speed variation. SpecAugment is applied with 2 frequency masks
with maximum frequency mask (F = 10) and 2 time masks with maximum time mask (T =
50).
To reduce the computational burden, a two-dimensional convolutional down-sampling
technique is employed at the front end of the shared encoder. The kernel size is 3*3, the
stride is 2, which means a total of 4 subsampling operations. The shared encoder comprises
12 conformer blocks with four multi-headed attentions, each using 256 attention dimensions
and 2048 feedforward dimensions, consistent with the baseline model. The attention decoder
includes six transformer blocks with four multi-headed attentions. During joint training and
decoding, the weights of the CTC branches are set to 0.3 and 0.5, respectively. Gradient
accumulation is used during training to stabilize the process, with gradients updated every 4
batches [55]. To prevent overfitting, dropout operations and label smoothing regularization
are applied to each conformer and transformer block. The Adam optimizer is used for training,
with a learning rate schedule of 25,000 warm-up steps and an initial learning rate of 0.002.
Additionally, We conducted experiments with different hyperparameters μ selected as 0,
0.01, 0.05, 0.1, 1, 10, and different numbers of MFF fusion layers selected as 2, 3, 4, 12.
4.3 Evaluation Metrics
In automatic speech recognition, the results are usually presented as a list of words and
phrases. During this process, three types of errors can occur: insertion, deletion, and substi-
tution errors. Insertion errors involve adding an extra word to the recognition result; Deletion
errors occur when the correct word is missing from the recognition result and substitution
errors replace the correct word in the recognition result with an incorrect word. In English,
the recognition success is typically measured in words, and the error rate is referred to as
Word Error Rate (WER). For languages like Russian and Viennese, the appropriate evalua-
tion metric is the Word Error Rate (WER) as well. However, in languages like Chinese, word
ambiguity is a challenge, making it difficult to directly measure errors in words. Therefore,
the Character Error Rate (CER) is commonly used as the evaluation index for Chinese speech
recognition, and similar languages like Japanese also employ CER. As the Chinese speech
dataset Aishell-1 is employed in this experiment, CER is used as the evaluation index, and
its formula is as follows:
N Del + N Sub + N I ns
CER = (16)
N Re f
where, N Sub represents the number of words in which a substitution error occurs; N I ns
represents the number of words in which an insertion error occurs; and N Re f represents the
total number of words in the test set. N Del represents the number of words for which the
recognition result has a deletion error in comparison to the actual annotation. Insertion errors
make it possible for CER to be greater than 100% with a minimum of 0.
123
4.4 Performance Testing and Analysis
The experiments for the multi-scale feature fusion module aim to investigate the impact of
fusing the output data from different numbers of conformer blocks on the model’s recognition
performance. The experimental results are summarized in Table 2 as follows: B6+B12 in the
shared encoder correspond to fusing the output data from the sixth and twelfth conformer
blocks. B4+B8+B12 in the shared encoder indicate the fusion of the output data from the
fourth, eighth, and twelfth conformer blocks. B3+B6+B9+B12 in the shared encoder repre-
sent the fusion of the output data from the third, sixth, ninth, and twelfth conformer blocks.
All blocks, as proposed in this work, symbolize the fusion of the output data from every
conformer block in the shared encoder. The ablation experiment only focuses on the MFF
module without the addition of SSL. The results clearly demonstrate that the recognition
performance is positively influenced by the number of fused blocks. And with the increase
in the number of fusion blocks, the recognition performance also improves. Specifically, the
performance of models with two, three, or four blocks fused is inferior to that of the model
with all blocks fused, confirming the importance of incorporating the output data from all
conformer blocks for improved recognition performance.
In this study, experiments are carried out for the multi-view self-supervised learning
module to examine the impact of the hyperparameter μ on the model recognition performance.
The experimental results are displayed in Table 3. When μ = 0.05, the self-supervised loss
and supervised loss are balanced to obtain the best performance, which implies that it is
crucial to balance the self-supervised loss and supervised loss in joint training.
In this study, ablation experiments are conducted to demonstrate the effectiveness of the
MM-ASR model’s multi-scale feature fusion module and multi-view self-supervised learn-
ing method. The experimental results are displayed in Table 4. The baseline model is the
original WeNet model, with the decoder trained in supervised learning mode using features
from the network’s final layer. The MM-ASR model, proposed in this paper, incorporates
both the multi-scale feature fusion module and multi-view self-supervised learning method.
Two additional variants are also evaluated: -SSL, which is the MM-ASR model with the
multi-view self-supervised learning method eliminated, and -MFF, which is the MM-ASR
model with the multi-scale feature fusion module removed. The experimental results demon-
strate the efficacy of both multi-scale feature fusion and multi-view self-supervised learning.
The MM-ASR model, which combines supervised and self-supervised losses for training and
focuses on interlayer information, exhibits improved model resilience and achieves a lower
character error rate (CER) compared to the original WeNet model. The proposed approach
leads to a significant enhancement in voice recognition ability, reducing the character error
rate by approximately 4.6% when compared to the baseline. This demonstrates the effective-
ness of the multi-scale feature fusion and multi-view self-supervised learning techniques in
improving the performance of the end-to-end speech recognition model.
Table 5 presents a comparison of the Character Error Rate (CER) results between the
MM-ASR model proposed in this study and several widely available models on the Aishell-
1 test dataset. The models used for comparison include CTC/Attention [56], CAT [57],
ESPnet [58], BAT [59], Paraformer [60], UMA [61] and WeNet [17, 18]. All assessment
results in the paper are rounded to two decimal places for consistency. The findings in
Table 5 demonstrate that the MM-ASR model outperforms the other models, indicating its
superior performance in terms of speech recognition accuracy. This clearly demonstrates the
effectiveness of multi-scale feature fusion and self-supervised learning within a single neural
network. The experimental outcomes provide strong evidence supporting the effectiveness
and usefulness of the proposed MM-ASR model for end-to-end speech recognition tasks,
123
Multi-view Self-supervised...
Table 2 Experimental results of different number of blocks (CER%)

Number of blocks Attention % Attention_rescoring % CTC_greedy_search % CTC_prefix_beam_search %
No MFF 5.10 4.74 5.22 5.22

B6+B12 5.08 4.72 5.20 5.20
B4+B8+B12 5.06 4.66 5.16 5.16
B3+B6+B9+B12 5.03 4.64 5.11 5.11
All blocks 4.92 4.60 4.99 4.99
Page 13 of 20
168
123
168
123
Page 14 of 20
Table 3 Weight sensitivity study on μ

Method Attention % Attention_rescoring % ctc_greedy_search % ctc_prefix_beam_search %
μ=0 5.10 4.74 5.22 5.22

μ = 0.01 5.08 4.69 5.18 5.18
μ = 0.05 5.05 4.66 5.10 5.10
μ = 0.1 5.09 4.72 5.20 5.20
μ=1 5.34 4.88 5.43 5.43
μ = 10 5.64 5.15 5.77 5.77
J. Zhao et al.
Multi-view Self-supervised...
Table 4 Ablation study of the MM-ASR (CER%)

Method Attention % Attention_rescoring % ctc_greedy_search % ctc_prefix_beam_search %
Baseline 5.10 4.74 5.22 5.22

MM-ASR 4.85 4.52 4.92 4.92
-SSL 4.92 4.60 4.99 4.99
-MFF 5.05 4.66 5.10 5.10
Page 15 of 20
168
123
Table 5 Experimental results on Method CER %

the Aishell-1 test dataset (CER%)
CTC/Attention 6.70
CAT 6.34
ESPnet 4.90
WeNet 4.74
BAT 4.97
Paraformer 4.95
UMA 4.70
MM-ASR 4.52
Table 6 Experimental results on Method CER %

the WSJ dataset (CER%)
CTC/Attention 6.80
CAT 5.70
ESPnet 12.40
LF-MMI 6.00
CTC-CRF ST-NAS 5.68
Wav2letter++ 7.50
WeNet 5.11
MM-ASR 5.03
confirming its superiority compared to publicly available models like CTC/Attention, CAT,
ESPnet, BAT, Paraformer, UMA and WeNet.
Table 6 shows a comparison of character error rate (CER) results between the MM-ASR
model proposed in this study and several widely available models on the English corpus WSJ
(80-h). The models used for comparison include CTC/attention, CAT, ESPnet, LF-MMI [62],
CTC-CRF ST-NAS [63], Wav2letter++ [64], and WeNet. The results in Table 6 demonstrate
that on the English corpus WSJ, the MM-ASR model still outperforms other models.
5 Conclusion
In this paper, a combination of supervised and self-supervised training techniques is leveraged

to construct and train an end-to-end speech recognition model based on multi-scale feature
fusion and multi-view self-supervised learning. The proposed method emphasizes the use
of inter-layer information in a shared encoder to improve the model’s ability to represent
and process speech data. Self-supervised contrast loss is proposed in the shared encoder
section to increase the model’s robustness, and the model is trained by combining supervised
and self-supervised loss techniques. Additionally, the multi-view self-supervised learning
component and the multi-scale feature fusion module’s ablation experiments are carried out to
show their usefulness in the performance of model identification, respectively. Experimental
research is done to determine the impact of combining various numbers of conformer blocks
and balancing the hyperparameters μ of self-supervised loss and supervised loss on the
performance of model recognition. The Aishell-1 dataset is used in this study to assess the
suggested technique. We further validate the effectiveness of this method on the English
123
corpus WSJ. The experimental findings demonstrate that the strategy enhances the speech
recognition model’s performance in terms of recognition.
Author Contributions The contributions of the authors are as follows: Corresponding author RL designed the
algorithm together with JZ, and provided the experimental equipment as well as improved the writing and
logic of the paper. JZ verified the algorithm experimentally and wrote the first draft of the paper. MT and
Weidong An organized the experimental data, visualized the experimental results, and assisted JZ to complete
the experiments. This paper was co-authored by all authors, who have read and approved the final manuscript.
Availability of data and materials The data that support the findings of this study are available from the
corresponding author, upon reasonable request.
Declarations
Conflict of interest We the authors of this manuscript entitled “Multi-view self-supervised learning and multi-
scale feature fusion for automatic speech recognition” declare that we have no known competing financial
interests or personal relationships that could have appeared to influence the work reported in this paper.
Informed consent The submission of this article has been approved by all authors, and the data used in this
article has been agreed by the relevant authorities and does not raise issues such as privacy and information
security.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included in the
article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is
not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References
1. Seltzer ML, Ju Y-C, Tashev I, Wang Y-Y, Yu D (2011) In-car media search. IEEE Signal Process Mag
28(4):50–60. https://doi.org/10.1109/MSP.2011.941065
2. Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In:
2013 IEEE international conference on acoustics, speech and signal processing, vancouver, BC, Canada,
pp 6645-6649, https://doi.org/10.1109/ICASSP.2013.6638947
3. Hinton G et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared
views of four research groups. IEEE Signal Process Mag 29(6):82–97. https://doi.org/10.1109/MSP.
2012.2205597
4. Wang D, Wang X, Lv S (2019) An overview of end-to-end automatic speech recognition. Symmetry
11(8):1018
5. Li J (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal Inf Process
11(1)
6. Graves A, Fernández S, Gomez F, et al (2006) Connectionist temporal classification: labelling unseg-
mented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference
on Machine learning. pp 369-376
7. Deng K, et al (2022) Improving CTC-Based Speech Recognition Via Knowledge Transferring from
Pre-Trained Language Models. In: ICASSP 2022–2022 IEEE international conference on acoustics,
speech and signal processing (ICASSP), Singapore, Singapore, pp 8517-8521, https://doi.org/10.1109/
ICASSP43922.2022.9747887
8. Nakagome Y, Komatsu T, Fujita Y, et al (2022) InterAug: augmenting noisy intermediate predictions for
CTC-based ASR. arXiv preprint arXiv:2204.00174
9. Graves A (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711
123
10. Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using
multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing
(ICASSP), New Orleans, LA, USA, pp 4835-4839, https://doi.org/10.1109/ICASSP.2017.7953075
11. Rao K, Sak H, Prabhavalkar R (2017) Exploring architectures, data and units for streaming end-
to-end speech recognition with RNN-transducer. In: (2017) IEEE automatic speech recognition and
understanding workshop (ASRU). Okinawa, Japan, pp 193–199. https://doi.org/10.1109/ASRU.2017.
8268935
12. Karita S, Soplin NEY, Watanabe S et al (2019) Improving transformer-based end-to-end speech recog-
nition with connectionist temporal classification and language model integration[C]//Proceedings of
the Annual Conference of the International Speech Communication Association. INTERSPEECH.
2019:1408–1412
13. Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using
multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing
(ICASSP), New Orleans, LA, USA, pp 4835-4839, https://doi.org/10.1109/ICASSP.2017.7953075
14. Zhang B, Wu D, Yao Z, et al (2020) Unified streaming and non-streaming two-pass end-to-end model for
speech recognition. arXiv preprint arXiv:2012.05481
15. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
16. Gulati A, Qin J, Chiu CC, et al (2020) Conformer: convolution-augmented transformer for speech
recognition. arXiv preprint arXiv:2005.08100
17. Yao Z, Wu D, Wang X, et al (2021) Wenet: production oriented streaming and non-streaming end-to-end
speech recognition toolkit. arXiv preprint arXiv:2102.01547
18. Zhang B, Wu D, Peng Z, et al (2022) Wenet 2.0: more productive end-to-end speech recognition toolkit.
arXiv preprint arXiv:2203.15455
19. Bu H, Du J, Na X, Wu B, Zheng H (2017) AISHELL-1: An open-source Mandarin speech corpus and a
speech recognition baseline. In: 20th conference of the oriental chapter of the international coordinating
committee on speech databases and speech I/O systems and assessment (O-COCOSDA), Seoul. Korea
(South) 2017, pp 1–5. https://doi.org/10.1109/ICSDA.2017.8384449
20. Chen Y-C, Huang S-F, Lee H-y, Wang Y-H, Shen C-H (2019) Audio word2vec: sequence-to-sequence
autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Trans
Audio Speech Lang Process (TASLP) 27(9):1481–1493
21. Hsu W-N, Zhang Y, Glass J (2017) Learning latent representations for speech generation and
transformation. In: Interspeech, pp 1273–1277
22. Hsu W N, Zhang Y, Glass J (2017) Unsupervised learning of disentangled and interpretable representations
from sequential data. Adv Neural Inf Process Syst 30
23. Chorowski J, Weiss RJ, Bengio S et al (2019) Unsupervised speech representation learning using wavenet
autoencoders. IEEE/ACM Trans Audio Speech Lang Process 27(12):2041–2053
24. Chung Y A, Tang H, Glass J (2020) Vector-quantized autoregressive predictive coding. arXiv preprint
arXiv:2005.08392
25. Chung Y A, Hsu W N, Tang H, et al (2019) An unsupervised autoregressive model for speech
representation learning. arXiv preprint arXiv:1904.03240
26. Chung Y A, Glass J (2020) Generative pre-training for speech with autoregressive predictive coding. In:
ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP).
IEEE pp 3497-3501
27. Chung Y A, Glass J (2020) Improved speech representations with multi-target autoregressive predictive
coding. arXiv preprint arXiv:2004.05274
28. Liu A H, Chung Y A, Glass J (2020) Non-autoregressive predictive coding for learning speech
representations from local dependencies. arXiv preprint arXiv:2011.00406
29. Liu AT, Li SW, Lee H (2021) Tera: Self-supervised learning of transformer encoder representation for
speech. IEEE/ACM Trans Audio Speech Lang Process 29:2351–2366
30. Liu A T, Yang S, Chi P H, et al (2020) Mockingjay: unsupervised speech representation learning with deep
bidirectional transformer encoders. In: ICASSP 2020-2020 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, pp 6419–6423
31. Ling S, Liu Y, Salazar J, et al (2020) Deep contextualized acoustic representations for semi-supervised
speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, pp 6429–6433
32. Ling S, Liu Y (2020) Decoar 2.0: deep contextualized acoustic representations with vector quantization.
arXiv preprint arXiv:2012.06659
33. Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint
arXiv:1807.03748
123
34. Schneider S, Baevski A, Collobert R, et al (2019) wav2vec: Unsupervised pre-training for speech
recognition. arXiv preprint arXiv:1904.05862
35. Baevski A, Schneider S, Auli M (2019) vq-wav2vec: self-supervised learning of discrete speech
representations. arXiv preprint arXiv:1910.05453
36. Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of
speech representations. Adv Neural Inf Process Syst 33:12449–12460
37. Baevski A, Mohamed A (2020) Effectiveness of self-supervised pre-training for ASR. In: ICASSP 2020-
2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp
7694–7698
38. Hsu WN, Bolte B, Tsai YHH et al (2021) Hubert: Self-supervised speech representation learning by
masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Process 29:3451–3460
39. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
networks. Adv Neural Inf Process Syst 25
40. Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 1653–1660
41. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
42. Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal
networks. Adv Neural Inf Process Syst 28:91–99
43. Desplanques B, Thienpondt J, Demuynck K (2020) Ecapatdnn: Emphasized channel attention, propaga-
tion and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143
44. Thienpondt J, Desplanques B, Demuynck K (2021) Integrating frequency translational invariance in
tdnns and frequency positional information in 2d resnets to enhance speaker verification. arXiv preprint
arXiv:2104.02370
45. Liu T, Das R K, Lee K A, et al (2022) MFA: TDNN with multi-scale frequency-channel attention for
text-independent speaker verification with short utterances. In: ICASSP 2022-2022 IEEE international
conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7517–7521
46. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp7132–7141
47. Gao SH, Cheng MM, Zhao K et al (2019) Res2net: a new multi-scale backbone architecture. IEEE Trans
Pattern Anal Mach Intell 43(2):652–662
48. Zhang Y, Lv Z, Wu H, et al (2022) Mfa-conformer: multi-scale feature aggregation conformer for
automatic speaker verification. arXiv preprint arXiv:2203.15249
49. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp 770–778
50. Dai Z, Yang Z, Yang Y, et al (2019) Transformer-xl: attentive language models beyond a fixed-length
context. arXiv preprint arXiv:1901.02860
51. Devlin J, Chang M W, Lee K, et al (2018) Bert: pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805
52. Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint
arXiv:1807.03748
53. Chen Z, Zhang Y, Rosenberg A, Ramabhadran B, Wang G, Moreno P (2021) Injecting text in self-
supervised speech pretraining. In: IEEE automatic speech recognition and understanding workshop
(ASRU). Cartagena, Colombia pp 251–258. https://doi.org/10.1109/ASRU51503.2021.9688018
54. Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks
from overfitting. J Mach Learn Res 15(1):1929–1958
55. Hermans JR, Spanakis G, Möckel R (2017) Accumulated gradient normalization. In: Asian conference
on machine learning. PMLR, pp 439–454
56. Karita S, et al (2019) A comparative study on transformer vs rnn in speech applications. In: IEEE automatic
speech recognition and understanding workshop (ASRU), Singapore, pp 449–456, https://doi.org/10.
1109/ASRU46091.2019.9003750
57. An K, Xiang H, Ou Z (2020) CAT: a CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end
approaches towards data efficiency and low latency. arXiv preprint arXiv:2005.13326
58. Watanabe S, Hori T, Karita S, et al (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint
arXiv:1804.00015
59. An K, Shi X, Zhang S (2023) BAT: boundary aware transducer for memory-efficient and low-latency
ASR. arXiv preprint arXiv:2305.11571
60. Gao Z, Li Z, Wang J, et al (2023) FunASR: a fundamental end-to-end speech recognition toolkit. arXiv
preprint arXiv:2305.11013
123
61. Fang Y, Li X (2023) Unimodal aggregation for CTC-based speech recognition. arXiv preprint
arXiv:2309.08150
62. Hadian H, Sameti H, Povey D, Khudanpur S (2018) Flatstart single-stage discriminatively trained HMM-
based models for ASR. IEEE/ACM Trans Audio Speech Lang Process 26(11):1949–1961
63. Zheng H, An K, Ou Z (2021) Efficient neural architecture search for end-to-end speech recognition via
straight-through gradients. In: 2021 IEEE spoken language technology workshop (SLT). IEEE, pp 60–67
64. Zeghidour N, Xu Q, Liptchinsky V, Usunier N, Synnaeve G, Collobert R (2018) Fully convolutional
speech recognition. arXiv preprint arXiv:1812.06864
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
123

Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition

Uploaded by

Copyright:

Available Formats

Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition

Uploaded by

Copyright:

Available Formats

Neural Processing Letters (2024) 56:168

Multi-view Self-supervised Learning and Multi-scale Feature

Jingyu Zhao1 · Ruwei Li1 · Maocun Tian1 · Weidong An1

Accepted: 6 April 2024

Keywords End-to-end speech recognition · Multi-scale feature fusion · Multi-view

0123456789().: V,-vol 123

Fig. 1 MM-ASR model architecture

3 The Overall Architecture of MM-ASR

Fig. 2 Conformer model

3.1 Conformer Structure

Fig. 3 Structure diagram of a shared encoder based on multi-view self-supervised learning

of Conformer block can be expressed as:

where FFN refers to feed-forward module, MHSA refers to multi-headed self-attentive

3.2 Shared Encoder Based on Multi-view self-supervised Learning

Supervised learning is a deep learning approach that identifies a functional relationship

Fig. 4 MFF structure diagram

3.3 Multi-scale Feature Fusion Module

where yc is the final output of the multi-scale feature fusion module.

xi = L N (xi + M H S A(xi )) (11)

3.5 Multi-task Learning Paradigm

L S = λL C T C (x, y) + (1 − λ)L AE D (x, y) (14)

Table 1 Aishell-1 speech corpus composition structure

Training set 340 120,098

4 Performance Testing and Analysis

4.2 Experimental Setup

4.3 Evaluation Metrics

4.4 Performance Testing and Analysis

Table 2 Experimental results of different number of blocks (CER%)

No MFF 5.10 4.74 5.22 5.22

Table 3 Weight sensitivity study on μ

μ=0 5.10 4.74 5.22 5.22

Table 4 Ablation study of the MM-ASR (CER%)

Baseline 5.10 4.74 5.22 5.22

Table 5 Experimental results on Method CER %

Table 6 Experimental results on Method CER %

In this paper, a combination of supervised and self-supervised training techniques is leveraged

You might also like