Less peaky and more accurate CTC forced alignment by label priors

Abstract

Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC’s token offset timestamps by $12-40\%$ in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.

Index Terms— CTC, forced alignment, label priors

1 Introduction

Speech-to-text forced alignment (FA) is a task to automatically produce exact time intervals for the written tokens (e.g., phonemes, words) in speech recordings. It’s a fundamental step in speech dataset preparation [1, 2], keyword search [3], closed captioning [4], and analytical tasks such as phonetic and linguistic studies [5, 6, 7].

Traditionally, FA was based on Gaussian Mixture Models (GMM). Popular toolkits include Montreal forced aligner (MFA) [8], FAVE [9], Prosodylab-Aligner [10] and Gentle [11]. Among them, MFA is the most widely used and considered as state-of-the-art [12]. However, compared with deep learning based approaches, GMM-based approaches have limited learning capacity and noise robustness, and their multi-stage modeling pipeline is quite complicated.

As deep learning based ASR becomes popular, FA solutions based on neural networks trained with Connectionist Temporal Classification (CTC) [13, 14, 15, 16] , attention mechanism [12, 17], Hidden Markov Model (HMM) topology [18] or self/semi-supervised learning [19] have emerged. Among those, CTC-based FA is the most popular option, as it’s the by-product of CTC-based ASR and it’s very easy to train.

The major issue with CTC-based FA is its peaky behavior. Originally, CTC was proposed for ASR tasks [20]. The blank symbol was introduced to serve as a silence token and a placeholder for aligning features/token sequences of different lengths. Empirically, people observed that blanks dominate the predicted sequence [21, 22, 23], i.e., CTC models tend to output spikes of non-blank symbols surrounded by many blanks (Figure 1.(b)). This is not a problem for ASR, which concerns only the accuracy of hypothesis with blanks removed. However, FA task needs to assign token labels to each acoustic frame, such peaky behavior causes inaccurate alignments by assigning too many blanks to non-silence acoustic frames.

Refer to caption — Fig. 1: (a) Spectrogram and human-labeled phoneme-level timestamps from Buckeye; (b) Posteriors of the best alignment path of a standard CTC model, with a peaky behavior where each symbol fires for only one frame; (c) Posteriors of the best alignment path of the CTC model trained by our method, where the peaky behavior has been alleviated.

To remedy this issue, [21] proposed maximum entropy regularization on sequence level to encourage exploration of paths containing fewer blanks. [22] introduced a strategy that sets the blank and non-blank proportions in the posteriors and focused on key frames during training. [23] extended the CTC loss by including label priors, which is known as hybrid model loss. [24] sampled one path from all feasible alignments and converted CTC to cross entropy (CE) training. [3] used a simple heuristic to smooth the CTC posteriors where each token is assumed to last for a constant number of frames until the next non-blank token is predicted.

Compared to previous work, this paper takes [23] further by applying label priors on real-world FA tasks instead of toy examples (For a comparison with concurrent work [25], see the appendix). We show that with this simple yet effective method, CTC’s peaky behavior can be alleviated and and the FA accuracy is improved (Fig 1.(c)). We apply the CTC loss with label priors on various model architectures, modeling units and model downsampling rates. It turns out a small TDNN-FFN model with $5M$ parameters works the best, as opposed to ASR where Conformer [26] is considered better. We derive the gradients of CTC loss with label priors to understand the role of the label priors in the optimization process. Finally, we provide the first experimental benchmark of deep learning based FA methods against GMM-based FA methods on data with human transcribed phoneme-level timestamps¹¹1Previous work [14] evaluates only on sentence-level. [12] reported better performance than MFA on Buckeye. Yet their method requires very strong supervision (human annotated phoneme boundaries), which is impractical.. We show that our method significantly improves alignment accuracy over the standard CTC model and a heuristic-based approach [3] for better predicting CTC’s token on-/offset. Moreover, we rival the state-of-the-art MFA [8] on Buckeye data and perform close to MFA on TIMIT. Nonetheless, our method has a simpler pipeline and faster runtime. Our recipe and pretrained model are available in TorchAudio²²2https://github.com/pytorch/audio [27].

2 Connectionist temporal classification revisited

2.1 Standard CTC loss

Consider a speech feature sequence $\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{T})$ of length $T$ , where $\mathbf{x}_{t}\in\mathbb{R}^{D}$ is a $D$ -dimensional feature vector; a corresponding token sequence $\mathbf{\bar{W}}=(\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{U})$ of length $U$ ; and a token vocabulary $\mathcal{V}$ . The ASR task is to predict the most probable token sequence $\mathbf{\hat{W}}\in\mathcal{V}^{*}$ given the features $\mathbf{X}$ :

\vspace{-1pt}\mathbf{\hat{W}}=\arg\max_{\mathbf{W}\in\mathcal{V}^{*}}p(\mathbf% {W}|\mathbf{X})\vspace{-1pt}

(1)

CTC [20] is proposed to compute this probability $p(\mathbf{W}|\mathbf{X})$ directly during training. It introduces a blank symbol $\oslash$ to make an extended vocabulary $\mathcal{V}^{\prime}:=\mathcal{V}\cup\{\oslash\}$ . The alignment paths are defined as sequences $\pi\in\mathcal{V}^{\prime T}$ of the same length as $\mathbf{X}$ . Then, the token sequence $\mathbf{W}$ can be obtained by merging repeated tokens and removing any blank tokens $\oslash$ in $\pi$ . This operation is defined by a many-to-one function: $\mathbf{W}=\mathcal{B}(\pi)$ . CTC proposes to compute $p(\mathbf{W}|\mathbf{X})$ by summing over all valid alignment paths by dynamic programming:

\vspace{-1pt}P_{\mathrm{ctc}}(\mathbf{W}|\mathbf{X})=\sum_{\pi\in\mathcal{B}^{% -1}(\mathbf{W})}P(\pi|\mathbf{X})\vspace{-1pt}

(2)

The probability for each alignment $\pi$ is computed under a conditional independence assumption for each time step:

\vspace{-1pt}P(\pi|\mathbf{X})=\prod_{t=1}^{T}y_{\pi_{t}}^{t}\vspace{-1pt}

(3)

where $y_{\pi_{t}}^{t}$ is the posterior probability of token ${\pi_{t}}\in\mathcal{V}^{\prime}$ at time $t$ predicted by the model given $\mathbf{X}$ . Thus, the model can be trained by the principle of maximum likelihood, i.e., maximizing $P_{\mathrm{ctc}}(\mathbf{W}|\mathbf{X})$ over the training data, or equivalently, minimizing $-\log P_{\mathrm{ctc}}(\mathbf{W}|\mathbf{X})$ .

2.2 Optimal alignment path selection problem

Given a model trained by CTC and an input feature sequence $\mathbf{X}$ of length $T$ , we can generate a frame-wise posterior distribution (“posteriorgram”) of shape $T\times|\mathcal{V}^{\prime}|$ along the time axis and over the extended vocabulary. In other words, the model probabilistically associates each frame in $\mathbf{X}$ with either a token in $\mathcal{V}$ or the blank token $\oslash$ . At sequence level, we hope to find the optimal alignment path $\pi^{*}\in\mathcal{B}^{-1}(\mathbf{W})$ with the highest probability $P(\pi|\mathbf{X})$ given the transcription $\mathbf{W}$ . This is the optimal alignment path selection problem. This problem can be solved in polynomial time with Viterbi algorithm. From the best alignment path, we can obtain the timestamp for each token in the audio. Note, for CTC FA, the alignments are learned without any frame-level supervision, unlike [12] or [19].

2.3 CTC loss with label priors

It’s well-known that standard CTC models have a peaky behavior. In literature [21, 22, 23], there are in-depth analyses of why such behavior happens. In summary, since the blank token $\oslash$ is the most versatile and frequent token in the space of feasible alignment paths $\mathcal{B}^{-1}(\mathbf{W})\subseteq\mathcal{V}^{\prime T}$ , it is easier for the CTC model to pick up optimal paths containing more blanks at the beginning. Once the paths containing blanks are learned, the model can reinforce itself by tending to go through such paths when computing $P_{\mathrm{ctc}}$ , getting gradients dominantly for such paths and ignoring alternative alignment paths. In this way, the peaky CTC posteriors come into place.

To address the peaky behavior issue, i.e., reducing the number of blanks in the optimal alignment paths, it is natural to use unigram label priors $P(k)$ , $k\in\mathcal{V}^{\prime}$ to penalize paths containing too many blank symbols. We can use the following loss function in place of the standard CTC loss (Equation 2):

	$\displaystyle\vspace{-4pt}P_{\mathrm{ctc\_with\_priors}}(\mathbf{W}\|\mathbf{X})$	$\displaystyle=\sum_{\pi\in\mathcal{B}^{-1}(\mathbf{W})}P_{\mathrm{with\_priors% }}(\pi\|\mathbf{X}),$		(4)
	$\displaystyle\mathrm{where,\;\;}P_{\mathrm{with\_priors}}(\pi\|\mathbf{X})$	$\displaystyle=\prod_{t=1}^{T}y_{\pi_{t}}^{t}/P(\pi_{t})^{\alpha}\vspace{-4pt}$		(5)

The hyper-parameter $\alpha\in\mathbb{R}$ is a scaling factor. When $\alpha=0$ , no priors are applied and this is exactly the standard CTC loss. Intuitively, in Equation 5, if a token, e.g., the blank token, occurs more frequently, it will have a larger prior $P(y_{\pi_{t}}^{t})$ and will get more penalty in its posterior probability, so that all alignment paths including the optimal one will avoid such token. Note that the label priors are not only applied during the Viterbi search of optimal paths, but are also applied during training. In this way, the model intrinsically learns to produce paths containing fewer blanks.

In fact, the idea of leveraging label priors originated from hybrid NN-HMM models for ASR [23, 28], where the output posteriors of neural networks (NN) are divided by label priors before integrating into the generative framework of HMM. To train the NN models, the alignment outputs from GMM-HMM models are used as the supervision for a frame-level cross entropy loss. In contrast, here we train the model with CTC loss directly without frame-level supervision.

During training, label priors $P(k)$ are first initialized to be a uniform distribution. Later, they are updated at the end of every epoch till convergence. We follow [29] to compute $P(k)$ by marginalizing over the posteriorgram for each token in $\mathcal{V}^{\prime}$ over all frames, and accumulate the statistics in the training epoch. Alternatively, we can get $P(k)$ by counting token frequency in the Viterbi alignment paths on training examples. It turns out two methods both converge and produce similar results, while the first one is simpler to implement.

2.4 Gradients of CTC loss with label priors

To understand the impact of label priors in the optimization process, we derive the gradients of Equation 4 following the notations in [30]. The objective function $O$ is defined as the negative logarithm of $P_{\mathrm{ctc\_with\_priors}}$ . The un-normalized network output is denoted as $u^{t}_{k}$ such that $y^{t}_{k}=\text{Softmax}(u^{t}_{k})$ . The gradients of CTC loss with label priors is in Equation 6 below, which has a very simple form and is reminiscent of the gradients without label priors as in [30]:

\frac{\partial O_{\mathrm{with\_priors}}}{\partial u_{k}^{t}}=y_{k}^{t}-\frac{% 1}{p^{\star}(\mathbf{\bar{W}}|\mathbf{X})}\sum_{s\in lab(\mathbf{\bar{W}},k))}% \alpha_{t}^{\star}(s)\beta_{t}^{\star}(s)

(6)

To save space, we use the star ( $\star$ ) symbol to denote “ $\mathrm{\_with\_priors}$ ”. Basically, the gradient of the objective $O$ with respect to the un-normalized network output $u^{t}_{k}$ for symbol $k$ at time $t$ consists of two terms. The first term $y_{k}^{t}$ is the posterior probability without applying label priors, which is exactly the same as the corresponding term in the gradients in [30]. The only difference is in the second term, with or without label priors. In Equation 6, it computes how much proportion of $p^{\star}(\mathbf{\bar{W}}|\mathbf{X})$ go through the symbol $k$ at time $t$ after the label priors are applied. The optimization process just tries to match posterior $y_{k}^{t}$ with such proportion. If the two terms are equal, the gradient becomes zero and an local optimum is reached. This elaborates the “error signals” received by the network during training.

3 Experiments And Analysis

3.1 Datasets

We train our model using Librispeech [31] of 960 hours of read English speech from audio books. For evaluation, we use Buckeye [32] and TIMIT [33] corpora. TIMIT contains 5.4 hours of read speech with time-aligned phoneme-level transcriptions. Buckeye contains spontaneous English conversations (interviews) of 40 speakers and 20 hours. It comes with forced aligned then manually corrected phoneme-level timestamps. Buckeye comes in many 10-minute-ish recordings. Following the common practice in [8, 12], we segment the long recordings into smaller chunks. We take chunks separated by non-speech (pauses, noise, etc.) of more than one second instead of 150 ms as in [8] to avoid too short segments.

We randomly selected 4 speakers as development data to tune hyper parameters. Note, however, for the FA task, it is always possible to fine-tune the model on the target/test data (audio and transcription) to reduce acoustic condition mismatch, before actually producing alignments. We will provide FA results with and without fine-tuning on the target data.

3.2 Metrics

Phoneme or word boundary error (PBE/WBE) is to measure how close the predicted and manually labeled timestamps are. Following [8], PBE is defined as the average of $N$ utterance-level PBEs (where $p$ stands for phonemes and $u_{i}$ is the i-th utterance below):

PBE\doteq\frac{1}{N}\sum_{i}\frac{1}{|u_{i}|}\sum_{p\in u_{i}}\frac{1}{2}\left% (|p_{beg}^{ref}-p_{beg}^{pred}|+|p_{end}^{ref}-p_{end}^{pred}|\right)

WBE is similarly defined on word-level. Ideally, PBE and WBE should be close to 0.

Phoneme or word average duration (PDUR/WDUR) is the average predicted duration of phonemes or words. PDUR and WDUR should match the average duration of manual timestamps, the closer the better. PDUR, WDUR and the final label priors are used to measure the spikiness of CTC models.

3.3 Model configuration and implementation details

For the encoder of our CTC aligner, we compared different model architectures (TDNN-FFN, TDNN-BLSTM [18] and Conformer [26] of $5M$ , $27M$ and $85M$ parameters, respectively), different modeling units (phonemes, characters, sentencepieces [34]) and different sub-sampling rates ( $1$ , $2$ and $4$ ). Our models take Mel-spectrogram features with 10 ms frame shift as input, and are trained by minimizing the CTC loss with or without label priors (Section 2.3). For phoneme models, the phoneme set (of $93$ phonemes), pronunciation dictionary and G2P model are taken from MFA [35]. We train our models for $20$ epochs and select the model with the best loss value on development data. We observed no further improvement for FA when training for more epochs, which is different from ASR.

Noteworthy is that, at the time we work on this paper, the CTC loss implementation ³³3https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html in Pytorch only supports inputs that are sum-to-one probabilities. However, with label priors, the inputs to CTC do not sum to one anymore. On the other hand, the CTC dynamic programming does not require a valid probability distribution. As alternatives for Pytorch’s CTC, we found the CTC loss implemented in k2 ⁴⁴4https://k2-fsa.github.io/k2/python_api/api.html#ctc-loss library or in another open source implementation ⁵⁵5https://github.com/vadimkantorov/ctc can match the gradients computed by Equation 6, when label priors are applied.

3.4 Results

3.4.1 Effectiveness of the proposed CTC model

We trained a TDNN-FFN model of $5M$ parameters with phoneme outputs, which is a stack of 3 TDNN layers (with kernel size $5,3,3$ and stride (sub-sampling factor of the acoustic encoder) size $2,1,1$ ) and 5 feedforward layers. In Table 1, we compare our aligner with the aligner trained with standard CTC loss, the aligners based on Wav2Vec2 [36] ⁶⁶6https://pytorch.org/audio/main/tutorials/forced_alignment_tutorial.html or MMS [1] ⁷⁷7https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html models, heuristics [3] as well as MFA based on GMM-HMM triphone model.

Table 1: Comparing the alignment accuracy of several FA solutions on Buckeye and TIMIT data. All metrics are in milliseconds. The closer to the ground truth (the last row), the better.

		Buckeye				TIMIT
		PBE	WBE	PDUR	WDUR	PBE	WBE	PDUR	WDUR
1	Wav2Vec2 [36]	-	89	-	177	-	48	-	229
2	MMS [1]	-	53	-	181	-	37	-	242
3	Heuristics [3]	42	55	60	212	31	41	30	239
4	MFA [8]	30	41	84	251	17	23	85	313
5	+fine-tuning	27	36	83	250	16	22	86	314
6	Standard CTC	44	58	21	169	32	42	21	229
7	+fine-tuning	39	52	22	163	31	40	23	234
8	Our CTC	38	43	64	221	28	29	72	288
9	+fine-tuning	30	34	74	232	27	28	79	301
10	Ground truth	0	0	82	241	0	0	76	305

From Table 1, the CTC model trained by our proposed method (row 8, $\alpha=0.3$ ⁸⁸8When $\alpha=0$ , the results are exactly the same as row 6; when $\alpha>0.3$ , the model does not converge.) clearly outperforms the standard CTC model (row 6), the heuristic-based method (row 3) as well as Wav2Vec2/MMS aligners (row 1, 2) in all metrics for the FA task. In particular, the predicted PDUR value being $21$ ms in row 6 matches the input frame size of the model ( $20$ ms with stride size $2$ ), which means the non-blank tokens fire for just one frame. In contrast, our model’s PDUR (row 8) is closer to the ground truth (row 10). On the other hand, the label prior probability for the blank token on training data are $0.80$ and $0.32$ for row 6 and 8 respectively (which is not shown in the table). This verifies that the standard CTC model is indeed peaky, and our proposed method can effectively reduce the peakiness (Fig. 1).

Note that the MFA model we used (english_mfa) is trained on $3700$ + hours of English speech including Librispeech. To reduce acoustic condition mismatch and further improve FA, we finetune MFA and our aligners (for 6 epochs) on Buckeye data. From Table 1, finetuning is effective (row 5, 7, 9 vs. row 4, 6, 8) as all metrics become closer to the ground truth. On Buckeye, our best BPE/WBE results (row 9) are close to the best of MFA (row 5), whereas WBE even outperforms MFA slightly. We conjecture that our aligner is good at modeling tokens next to silence (e.g., word boundaries), due to the larger modeling capacity of neural networks compared to GMMs. On TIMIT, our best aligner still falls behind MFA. We notice our aligner and MFA have different frame rates, $20$ ms and $10$ ms respectively, which is probably the reason for our aligner not being able to produce even finer-grained timestamps. However, when we set our frame rate to be $10$ ms, the CTC model becomes harder to train due to longer input sequences [23] and does not produce better alignments. This is left as future work.

For PBE/WBE improvement, we investigate it further by breaking down PBE/WBE (of the models in Table 1 row 6 and 8) into onset/offset timestamps errors. From Table 2, we find that the standard CTC makes more errors on onset prediction than offset prediction, which aligns with our expectation that a standard CTC model tends to delay token predictions. On the other hand, our CTC model improves both onset and offset predictions, with more improvements on the onset side, so that the onset/offset errors are more balanced.

Table 2: The breakdown of onset/offset timestamps errors on Buckeye for Table 1 row 6 and 8. All metrics are in milliseconds.

	phoneme		word
	onset	offset	onset	offset
Standard CTC	51	39	63	54
Our CTC	39	36	44	42

In above experiments, it takes MFA $9.5$ minutes to finish alignment generation on $20$ -hour Buckeye data, with $8$ CPU jobs and multithreading turned on. In contrast, it takes only $3$ minutes for our model to finish on $1$ NVIDIA Titan RTX GPU, or less than $1$ minute on $4$ GPUs, thanks to the PyTorch-based implementation. This is not an apple-to-apple comparison given MFA does not support GPU, yet it gives us a rough estimate of the best runtime of both systems.

3.4.2 Impact of network configurations

We start with the baseline configuration (TDNN-FFN, stride=2, phoneme) as in Table 3 row 1 (corresponding to row 6 and 8 in Table 1), and vary the model architecture (row 2-3), modeling unit (row 4-5), and model stride size (row 6-7) independently on top of the baseline configuration. We report results on Buckeye with and without applying our method in each cell of Table 3.

Table 3: Comparing different network configurations on Buckeye. The baseline configuration is (TDNN-FFN, stride=2, phoneme). We vary the model architecture, modeling unit and model stride size independently on top of the baseline configuration. In each cell, results for the standard/proposed CTC model are reported. All metrics are in milliseconds. The closer to ground truth (the last row), the better.

		PBE	WBE	PDUR	WDUR
1	Baseline	44 / 38	58 / 43	21 / 64	169 / 221
2	TDNN-BLSTM	42 / 53	53 / 74	23 / 75	175 / 261
3	Conformer	43 / 55	51 / 75	25 / 77	180 / 278
4	char	-	62 / 52	-	196 / 235
5	sentencepiece	-	101 / 52	-	80 / 230
6	stride=1	51 / 46	65 / 49	11 / 78	156 / 242
7	stride=4	45 / 40	56 / 47	43 / 71	196 / 230
8	Ground truth	0	0	82	241

When the models with long-range memory are used (Table 3, row 2 and 3), our method actually degrades PBE/WBE. It turns out such models have learned to predict non-blank tokens repeatedly even when there is no speech in the audio, in order to avoid blank penalties from the priors. Only the TDNN-FFN model with $5M$ parameters works well, which is quite different from ASR where Conformer is considered better. Note, our TDNN-FFN model has a very limited perception range over the input features (13 frames), so it only has the ability to learn the short-range acoustic information, instead of the long-range language dependencies. In other words, $5M$ parameters are probably sufficient to make a good acoustic model. Thus, we conjecture that in the $85M$ -param Conformer ASR model, only a small portion of parameters are used for acoustic modeling, whereas the rest of parameters are used for language modeling of long-range dependency.

As shown in rows 4 $\sim$ 7, the proposed CTC (right) works better than standard CTC (left) in each cell, but none of the other configurations in rows 4 $\sim$ 7 work better than row 1. From the PDUR column, the standard CTC model (left) tends to predict PDUR to be almost the same size as the model input frame size, while PDUR predicted by our CTC model (right) is closer to the ground truth.

3.4.3 Other variations of the proposed method

We experimented with the following variations of the proposed CTC model: (1) we use label priors only during decoding the standard CTC model, and got no improvements, which suggests incorporating priors into training loss is important. (2) Instead of applying label priors to all tokens, we apply penalties only to the blank tokens to the standard CTC model during training, and got only minimal improvements over standard CTC. (3) We disallow intra-word blanks during alignment for Table 1 row 8 or 9, which resembles 1-state HMM topology [18] disallowing intra-word blanks, and got minimal improvements over the standard CTC. This result agrees with Figure. 3 from [23], where models with HMM topology tend to clump up non-blank tokens. (4) We fine-tune the standard CTC model with the proposed loss on Buckeye, and got similar results as our best results, suggesting that fine-tuning a standard CTC model with the proposed method is sufficient to resolve the peaky behavior.

4 Conclusion and Future work

In this paper, we proposed using a CTC loss with label priors to train CTC models to produce less peaky distribution, enabling the CTC model to more accurately predict not only the onset but also the offset of the tokens. Such property is especially suitable for the forced alignment (FA) task. From our benchmarks using human transcribed phoneme-level timestamps, we see both the predicted onset and offset timestamps are improved by the proposed model, leading to significant alignment accuracy improvement (measured by phoneme/word boundary errors and phoneme/word durations) compared to the standard CTC model and a heuristics-based baseline method. Our proposed model also rivals the performance of the state-of-the-art MFA aligner on Buckeye and TIMIT data.

On top of this work, there are several directions to further improve the CTC-based or deep learning based forced aligners. Firstly, the neural aligners’ noise robustness to either noisy audios or noisy transcriptions should be investigated. Neural networks are considered more robust than GMMs. Second, besides the light-weighted TDNN-FFN model, it would be useful if the peaky behavior of the TDNN-BLSTM or Conformer based CTC models can also be reduced. In this way, we can re-purpose pretrained ASR systems, which are usually based on Conformer-type architectures nowadays, for FA applications. Third, methods to effectively train FA models with smaller sub-sampling rates should be investigated. Finally, it’s worthwhile to explore if self-supervised learning features or objectives can further benefit our proposed CTC aligner.

Appendix A Appendix

There is an independent concurrent work [25] which is derived from the same theoretical foundation [23] as ours. We briefly summarize the contributions and main differences.

Both this paper and [25] adopts label priors to reduce the peakiness of CTC models on real-world FA tasks and to improve FA time stamp accuracy. Both works show applying label priors on different real-world datasets is effective. Notably, both papers’ frame-level classifiers are models without long-range memory: FFN is used in [25] on top of encoder’s outputs and speech features, while TDNN-FFN is used in our paper. This validates our choice of model architecture in Section 3.4.2.

In [25], they derive time stamps for the attention-based encoder-decoder architecture (LAS), while our work is primarily for CTC models. Secondly, [25] proposes other loss function terms and regularization techniques to improve FA, while our work focuses primarily on label priors and investigate the gradients. Third, [25] employs Tensorflow, Lingvo and RETURNN toolkits (check out their paper for details) for implementation, while our solution is based on Pytorch and is publicly available ⁹⁹9https://github.com/huangruizhe/audio/blob/aligner_label_priors/examples/asr/librispeech_alignment/loss.py. We also spot an issue ¹⁰¹⁰10https://github.com/pytorch/pytorch/issues/122243 of Pytorch’s native CTC implementation and have addressed the issue in our solution. Finally, [25] uses Librispeech (whose ground-truth timetamps are automatically derived by a HMM-GMM model) and internal data for experiments. This work uses TIMIT and Buckeye, which comes with manually annotated timestamps which are fairer when comparing CTC models with HMM-GMM models.

References

[1] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, et al., “Scaling speech technology to 1,000+ languages,” ArXiv, vol. abs/2305.13516, 2023.
[2] Matt Le, Apoorv Vyas, Bowen Shi, Brian Karrer, et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” ArXiv, vol. abs/2306.15687, 2023.
[3] Ruizhe Huang, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, et al., “Building keyword search system from end-to-end asr systems,” ICASSP, 2023.
[4] Chih-Wei Huang, “Automatic closed caption alignment based on speech recognition transcripts,” 2003.
[5] Laurel Mackenzie and Danielle Turton, “Assessing the accuracy of existing forced alignment software on varieties of british english,” Linguistics Vanguard, vol. 6, 2020.
[6] Hongchen Wu, Jiwon Yun, Xiang Li, Huiyi Huang, et al., “Using a forced aligner for prosody research,” Humanities and Social Sciences Communications, vol. 10, pp. 1–13, 2023.
[7] Jiahong Yuan, Wei Lai, Chris Cieri, and Mark Liberman, “Using forced alignment for phonetics research,” Chinese Language Resources and Processing: Text, Speech and Language Technology. Springer, 2018.
[8] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, et al., “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in Interspeech, 2017.
[9] Ingrid Rosenfelder, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, et al., “Fave (forced alignment and vowel extraction) suite version 1.1.3,” 2014.
[10] Kyle Gorman, Jonathan Howell, and Michael Wagner, “Prosodylab-aligner: A tool for forced alignment of laboratory speech,” Canadian Acoustics, vol. 39, pp. 192–193, 2011.
[11] Robert M Ochshorn and Max Hawkins, “Gentle: A robust yet lenient forced aligner built on kaldi,” 2015, https://github.com/lowerquality/gentle.
[12] Jingbei Li, Yi Meng, Zhiyong Wu, Helen M. Meng, et al., “Neufa: Neural network based end-to-end forced alignment with bidirectional attention mechanism,” ICASSP, 2022.
[13] Albert Zeyer, Eugen Beck, Ralf Schlüter, and Hermann Ney, “Ctc in the context of generalized full-sum hmm training,” in Interspeech, 2017.
[14] Ludwig Kurzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel, et al., “Ctc-segmentation of large corpora for german end-to-end speech recognition,” in International Conference on Speech and Computer, 2020.
[15] Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, et al., “Bayes risk ctc: Controllable ctc alignment in sequence-to-sequence tasks,” ICLR, 2023.
[16] Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” Interspeech, 2023.
[17] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, et al., “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
[18] Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, et al., “On lattice-free boosted mmi training of hmm and ctc-based full-context asr models,” ASRU, 2021.
[19] Jian Zhu, Cong Zhang, and David Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,” ICASSP, 2022.
[20] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” ICML, 2006.
[21] Hu Liu, Sheng Jin, and Changshui Zhang, “Connectionist temporal classification with maximum entropy regularization,” in NeurIPS, 2018.
[22] Hongzhu Li and Weiqiang Wang, “Reinterpreting ctc training as iterative fitting,” Pattern Recognit., vol. 105, 2020.
[23] Albert Zeyer, Ralf Schluter, and Hermann Ney, “Why does ctc result in peaky behavior?,” ArXiv, vol. abs/2105.14849, 2021.
[24] Ehsan Variani, Tom Bagby, Kamel Lahouel, Erik McDermott, et al., “Sampled connectionist temporal classification,” ICASSP, 2018.
[25] Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, et al., “Improving frame-level classifier for word timings with non-peaky ctc in end-to-end automatic speech recognition,” Interspeech, 2023.
[26] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
[27] Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, et al., “Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch,” ASRU, 2023.
[28] Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, pp. 82, 2012.
[29] Vimal Manohar, Daniel Povey, and Sanjeev Khudanpur, “Semi-supervised maximum mutual information training of deep neural network acoustic models,” in Interspeech, 2015.
[30] Alex Graves, “Supervised sequence labelling with recurrent neural networks,” in Studies in Computational Intelligence, 2012.
[31] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” ICASSP, 2015.
[32] Mark A. Pitt, Keith Johnson, Elizabeth Hume, Scott F. Kiesling, et al., “The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability,” Speech Commun., vol. 45, pp. 89–95, 2005.
[33] John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993.
[34] Taku Kudo and John Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP, 2018.
[35] Michael McAuliffe and Morgan Sonderegger, “English mfa dictionary v2.0.0a,” Tech. Rep., https://mfa-models.readthedocs.io/pronunciation%****␣Template.bbl␣Line␣200␣****dictionary/English/EnglishMFAdictionaryv2_0_0a.html, May 2022.
[36] Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” ArXiv, vol. abs/2006.11477, 2020.