Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Less peaky and more accurate CTC forced alignment by label priors

Abstract

Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC’s token offset timestamps by 1240%12percent4012-40\%12 - 40 % in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.

Index Terms—  CTC, forced alignment, label priors

1 Introduction

Speech-to-text forced alignment (FA) is a task to automatically produce exact time intervals for the written tokens (e.g., phonemes, words) in speech recordings. It’s a fundamental step in speech dataset preparation [1, 2], keyword search [3], closed captioning [4], and analytical tasks such as phonetic and linguistic studies [5, 6, 7].

Traditionally, FA was based on Gaussian Mixture Models (GMM). Popular toolkits include Montreal forced aligner (MFA) [8], FAVE [9], Prosodylab-Aligner [10] and Gentle [11]. Among them, MFA is the most widely used and considered as state-of-the-art [12]. However, compared with deep learning based approaches, GMM-based approaches have limited learning capacity and noise robustness, and their multi-stage modeling pipeline is quite complicated.

As deep learning based ASR becomes popular, FA solutions based on neural networks trained with Connectionist Temporal Classification (CTC) [13, 14, 15, 16] , attention mechanism [12, 17], Hidden Markov Model (HMM) topology [18] or self/semi-supervised learning [19] have emerged. Among those, CTC-based FA is the most popular option, as it’s the by-product of CTC-based ASR and it’s very easy to train.

The major issue with CTC-based FA is its peaky behavior. Originally, CTC was proposed for ASR tasks [20]. The blank symbol was introduced to serve as a silence token and a placeholder for aligning features/token sequences of different lengths. Empirically, people observed that blanks dominate the predicted sequence [21, 22, 23], i.e., CTC models tend to output spikes of non-blank symbols surrounded by many blanks (Figure 1.(b)). This is not a problem for ASR, which concerns only the accuracy of hypothesis with blanks removed. However, FA task needs to assign token labels to each acoustic frame, such peaky behavior causes inaccurate alignments by assigning too many blanks to non-silence acoustic frames.

Refer to caption
Fig. 1: (a) Spectrogram and human-labeled phoneme-level timestamps from Buckeye; (b) Posteriors of the best alignment path of a standard CTC model, with a peaky behavior where each symbol fires for only one frame; (c) Posteriors of the best alignment path of the CTC model trained by our method, where the peaky behavior has been alleviated.

To remedy this issue, [21] proposed maximum entropy regularization on sequence level to encourage exploration of paths containing fewer blanks. [22] introduced a strategy that sets the blank and non-blank proportions in the posteriors and focused on key frames during training. [23] extended the CTC loss by including label priors, which is known as hybrid model loss. [24] sampled one path from all feasible alignments and converted CTC to cross entropy (CE) training. [3] used a simple heuristic to smooth the CTC posteriors where each token is assumed to last for a constant number of frames until the next non-blank token is predicted.

Compared to previous work, this paper takes [23] further by applying label priors on real-world FA tasks instead of toy examples (For a comparison with concurrent work [25], see the appendix). We show that with this simple yet effective method, CTC’s peaky behavior can be alleviated and and the FA accuracy is improved (Fig 1.(c)). We apply the CTC loss with label priors on various model architectures, modeling units and model downsampling rates. It turns out a small TDNN-FFN model with 5M5𝑀5M5 italic_M parameters works the best, as opposed to ASR where Conformer [26] is considered better. We derive the gradients of CTC loss with label priors to understand the role of the label priors in the optimization process. Finally, we provide the first experimental benchmark of deep learning based FA methods against GMM-based FA methods on data with human transcribed phoneme-level timestamps111Previous work [14] evaluates only on sentence-level. [12] reported better performance than MFA on Buckeye. Yet their method requires very strong supervision (human annotated phoneme boundaries), which is impractical.. We show that our method significantly improves alignment accuracy over the standard CTC model and a heuristic-based approach [3] for better predicting CTC’s token on-/offset. Moreover, we rival the state-of-the-art MFA [8] on Buckeye data and perform close to MFA on TIMIT. Nonetheless, our method has a simpler pipeline and faster runtime. Our recipe and pretrained model are available in TorchAudio222https://github.com/pytorch/audio [27].

2 Connectionist temporal classification revisited

2.1 Standard CTC loss

Consider a speech feature sequence 𝐗=(𝐱1,𝐱2,,𝐱T)𝐗subscript𝐱1subscript𝐱2subscript𝐱𝑇\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{T})bold_X = ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) of length T𝑇Titalic_T, where 𝐱tDsubscript𝐱𝑡superscript𝐷\mathbf{x}_{t}\in\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is a D𝐷Ditalic_D-dimensional feature vector; a corresponding token sequence 𝐖¯=(𝐰1,𝐰2,,𝐰U)¯𝐖subscript𝐰1subscript𝐰2subscript𝐰𝑈\mathbf{\bar{W}}=(\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{U})over¯ start_ARG bold_W end_ARG = ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_w start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) of length U𝑈Uitalic_U; and a token vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. The ASR task is to predict the most probable token sequence 𝐖^𝒱^𝐖superscript𝒱\mathbf{\hat{W}}\in\mathcal{V}^{*}over^ start_ARG bold_W end_ARG ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT given the features 𝐗𝐗\mathbf{X}bold_X:

𝐖^=argmax𝐖𝒱p(𝐖|𝐗)^𝐖subscript𝐖superscript𝒱𝑝conditional𝐖𝐗\vspace{-1pt}\mathbf{\hat{W}}=\arg\max_{\mathbf{W}\in\mathcal{V}^{*}}p(\mathbf% {W}|\mathbf{X})\vspace{-1pt}over^ start_ARG bold_W end_ARG = roman_arg roman_max start_POSTSUBSCRIPT bold_W ∈ caligraphic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p ( bold_W | bold_X ) (1)

CTC [20] is proposed to compute this probability p(𝐖|𝐗)𝑝conditional𝐖𝐗p(\mathbf{W}|\mathbf{X})italic_p ( bold_W | bold_X ) directly during training. It introduces a blank symbol \oslash to make an extended vocabulary 𝒱:=𝒱{}assignsuperscript𝒱𝒱\mathcal{V}^{\prime}:=\mathcal{V}\cup\{\oslash\}caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := caligraphic_V ∪ { ⊘ }. The alignment paths are defined as sequences π𝒱T𝜋superscript𝒱𝑇\pi\in\mathcal{V}^{\prime T}italic_π ∈ caligraphic_V start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT of the same length as 𝐗𝐗\mathbf{X}bold_X. Then, the token sequence 𝐖𝐖\mathbf{W}bold_W can be obtained by merging repeated tokens and removing any blank tokens \oslash in π𝜋\piitalic_π. This operation is defined by a many-to-one function: 𝐖=(π)𝐖𝜋\mathbf{W}=\mathcal{B}(\pi)bold_W = caligraphic_B ( italic_π ). CTC proposes to compute p(𝐖|𝐗)𝑝conditional𝐖𝐗p(\mathbf{W}|\mathbf{X})italic_p ( bold_W | bold_X ) by summing over all valid alignment paths by dynamic programming:

Pctc(𝐖|𝐗)=π1(𝐖)P(π|𝐗)subscript𝑃ctcconditional𝐖𝐗subscript𝜋superscript1𝐖𝑃conditional𝜋𝐗\vspace{-1pt}P_{\mathrm{ctc}}(\mathbf{W}|\mathbf{X})=\sum_{\pi\in\mathcal{B}^{% -1}(\mathbf{W})}P(\pi|\mathbf{X})\vspace{-1pt}italic_P start_POSTSUBSCRIPT roman_ctc end_POSTSUBSCRIPT ( bold_W | bold_X ) = ∑ start_POSTSUBSCRIPT italic_π ∈ caligraphic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_W ) end_POSTSUBSCRIPT italic_P ( italic_π | bold_X ) (2)

The probability for each alignment π𝜋\piitalic_π is computed under a conditional independence assumption for each time step:

P(π|𝐗)=t=1Tyπtt𝑃conditional𝜋𝐗superscriptsubscriptproduct𝑡1𝑇superscriptsubscript𝑦subscript𝜋𝑡𝑡\vspace{-1pt}P(\pi|\mathbf{X})=\prod_{t=1}^{T}y_{\pi_{t}}^{t}\vspace{-1pt}italic_P ( italic_π | bold_X ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (3)

where yπttsuperscriptsubscript𝑦subscript𝜋𝑡𝑡y_{\pi_{t}}^{t}italic_y start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the posterior probability of token πt𝒱subscript𝜋𝑡superscript𝒱{\pi_{t}}\in\mathcal{V}^{\prime}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at time t𝑡titalic_t predicted by the model given 𝐗𝐗\mathbf{X}bold_X. Thus, the model can be trained by the principle of maximum likelihood, i.e., maximizing Pctc(𝐖|𝐗)subscript𝑃ctcconditional𝐖𝐗P_{\mathrm{ctc}}(\mathbf{W}|\mathbf{X})italic_P start_POSTSUBSCRIPT roman_ctc end_POSTSUBSCRIPT ( bold_W | bold_X ) over the training data, or equivalently, minimizing logPctc(𝐖|𝐗)subscript𝑃ctcconditional𝐖𝐗-\log P_{\mathrm{ctc}}(\mathbf{W}|\mathbf{X})- roman_log italic_P start_POSTSUBSCRIPT roman_ctc end_POSTSUBSCRIPT ( bold_W | bold_X ).

2.2 Optimal alignment path selection problem

Given a model trained by CTC and an input feature sequence 𝐗𝐗\mathbf{X}bold_X of length T𝑇Titalic_T, we can generate a frame-wise posterior distribution (“posteriorgram”) of shape T×|𝒱|𝑇superscript𝒱T\times|\mathcal{V}^{\prime}|italic_T × | caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | along the time axis and over the extended vocabulary. In other words, the model probabilistically associates each frame in 𝐗𝐗\mathbf{X}bold_X with either a token in 𝒱𝒱\mathcal{V}caligraphic_V or the blank token \oslash. At sequence level, we hope to find the optimal alignment path π1(𝐖)superscript𝜋superscript1𝐖\pi^{*}\in\mathcal{B}^{-1}(\mathbf{W})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_W ) with the highest probability P(π|𝐗)𝑃conditional𝜋𝐗P(\pi|\mathbf{X})italic_P ( italic_π | bold_X ) given the transcription 𝐖𝐖\mathbf{W}bold_W. This is the optimal alignment path selection problem. This problem can be solved in polynomial time with Viterbi algorithm. From the best alignment path, we can obtain the timestamp for each token in the audio. Note, for CTC FA, the alignments are learned without any frame-level supervision, unlike [12] or [19].

2.3 CTC loss with label priors

It’s well-known that standard CTC models have a peaky behavior. In literature [21, 22, 23], there are in-depth analyses of why such behavior happens. In summary, since the blank token \oslash is the most versatile and frequent token in the space of feasible alignment paths 1(𝐖)𝒱Tsuperscript1𝐖superscript𝒱𝑇\mathcal{B}^{-1}(\mathbf{W})\subseteq\mathcal{V}^{\prime T}caligraphic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_W ) ⊆ caligraphic_V start_POSTSUPERSCRIPT ′ italic_T end_POSTSUPERSCRIPT, it is easier for the CTC model to pick up optimal paths containing more blanks at the beginning. Once the paths containing blanks are learned, the model can reinforce itself by tending to go through such paths when computing Pctcsubscript𝑃ctcP_{\mathrm{ctc}}italic_P start_POSTSUBSCRIPT roman_ctc end_POSTSUBSCRIPT, getting gradients dominantly for such paths and ignoring alternative alignment paths. In this way, the peaky CTC posteriors come into place.

To address the peaky behavior issue, i.e., reducing the number of blanks in the optimal alignment paths, it is natural to use unigram label priors P(k)𝑃𝑘P(k)italic_P ( italic_k ), k𝒱𝑘superscript𝒱k\in\mathcal{V}^{\prime}italic_k ∈ caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to penalize paths containing too many blank symbols. We can use the following loss function in place of the standard CTC loss (Equation 2):

Pctc_with_priors(𝐖|𝐗)subscript𝑃ctc_with_priorsconditional𝐖𝐗\displaystyle\vspace{-4pt}P_{\mathrm{ctc\_with\_priors}}(\mathbf{W}|\mathbf{X})italic_P start_POSTSUBSCRIPT roman_ctc _ roman_with _ roman_priors end_POSTSUBSCRIPT ( bold_W | bold_X ) =π1(𝐖)Pwith_priors(π|𝐗),absentsubscript𝜋superscript1𝐖subscript𝑃with_priorsconditional𝜋𝐗\displaystyle=\sum_{\pi\in\mathcal{B}^{-1}(\mathbf{W})}P_{\mathrm{with\_priors% }}(\pi|\mathbf{X}),= ∑ start_POSTSUBSCRIPT italic_π ∈ caligraphic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_W ) end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_with _ roman_priors end_POSTSUBSCRIPT ( italic_π | bold_X ) , (4)
where,Pwith_priors(π|𝐗)wheresubscript𝑃with_priorsconditional𝜋𝐗\displaystyle\mathrm{where,\;\;}P_{\mathrm{with\_priors}}(\pi|\mathbf{X})roman_where , italic_P start_POSTSUBSCRIPT roman_with _ roman_priors end_POSTSUBSCRIPT ( italic_π | bold_X ) =t=1Tyπtt/P(πt)αabsentsuperscriptsubscriptproduct𝑡1𝑇superscriptsubscript𝑦subscript𝜋𝑡𝑡𝑃superscriptsubscript𝜋𝑡𝛼\displaystyle=\prod_{t=1}^{T}y_{\pi_{t}}^{t}/P(\pi_{t})^{\alpha}\vspace{-4pt}= ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT / italic_P ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT (5)

The hyper-parameter α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R is a scaling factor. When α=0𝛼0\alpha=0italic_α = 0, no priors are applied and this is exactly the standard CTC loss. Intuitively, in Equation 5, if a token, e.g., the blank token, occurs more frequently, it will have a larger prior P(yπtt)𝑃superscriptsubscript𝑦subscript𝜋𝑡𝑡P(y_{\pi_{t}}^{t})italic_P ( italic_y start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and will get more penalty in its posterior probability, so that all alignment paths including the optimal one will avoid such token. Note that the label priors are not only applied during the Viterbi search of optimal paths, but are also applied during training. In this way, the model intrinsically learns to produce paths containing fewer blanks.

In fact, the idea of leveraging label priors originated from hybrid NN-HMM models for ASR [23, 28], where the output posteriors of neural networks (NN) are divided by label priors before integrating into the generative framework of HMM. To train the NN models, the alignment outputs from GMM-HMM models are used as the supervision for a frame-level cross entropy loss. In contrast, here we train the model with CTC loss directly without frame-level supervision.

During training, label priors P(k)𝑃𝑘P(k)italic_P ( italic_k ) are first initialized to be a uniform distribution. Later, they are updated at the end of every epoch till convergence. We follow [29] to compute P(k)𝑃𝑘P(k)italic_P ( italic_k ) by marginalizing over the posteriorgram for each token in 𝒱superscript𝒱\mathcal{V}^{\prime}caligraphic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT over all frames, and accumulate the statistics in the training epoch. Alternatively, we can get P(k)𝑃𝑘P(k)italic_P ( italic_k ) by counting token frequency in the Viterbi alignment paths on training examples. It turns out two methods both converge and produce similar results, while the first one is simpler to implement.

2.4 Gradients of CTC loss with label priors

To understand the impact of label priors in the optimization process, we derive the gradients of Equation 4 following the notations in [30]. The objective function O𝑂Oitalic_O is defined as the negative logarithm of Pctc_with_priorssubscript𝑃ctc_with_priorsP_{\mathrm{ctc\_with\_priors}}italic_P start_POSTSUBSCRIPT roman_ctc _ roman_with _ roman_priors end_POSTSUBSCRIPT. The un-normalized network output is denoted as uktsubscriptsuperscript𝑢𝑡𝑘u^{t}_{k}italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT such that ykt=Softmax(ukt)subscriptsuperscript𝑦𝑡𝑘Softmaxsubscriptsuperscript𝑢𝑡𝑘y^{t}_{k}=\text{Softmax}(u^{t}_{k})italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = Softmax ( italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). The gradients of CTC loss with label priors is in Equation 6 below, which has a very simple form and is reminiscent of the gradients without label priors as in [30]:

Owith_priorsukt=ykt1p(𝐖¯|𝐗)slab(𝐖¯,k))αt(s)βt(s)\frac{\partial O_{\mathrm{with\_priors}}}{\partial u_{k}^{t}}=y_{k}^{t}-\frac{% 1}{p^{\star}(\mathbf{\bar{W}}|\mathbf{X})}\sum_{s\in lab(\mathbf{\bar{W}},k))}% \alpha_{t}^{\star}(s)\beta_{t}^{\star}(s)divide start_ARG ∂ italic_O start_POSTSUBSCRIPT roman_with _ roman_priors end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG = italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over¯ start_ARG bold_W end_ARG | bold_X ) end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_l italic_a italic_b ( over¯ start_ARG bold_W end_ARG , italic_k ) ) end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) (6)

To save space, we use the star (\star) symbol to denote “_with_priors_with_priors\mathrm{\_with\_priors}_ roman_with _ roman_priors”. Basically, the gradient of the objective O𝑂Oitalic_O with respect to the un-normalized network output uktsubscriptsuperscript𝑢𝑡𝑘u^{t}_{k}italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for symbol k𝑘kitalic_k at time t𝑡titalic_t consists of two terms. The first term yktsuperscriptsubscript𝑦𝑘𝑡y_{k}^{t}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the posterior probability without applying label priors, which is exactly the same as the corresponding term in the gradients in [30]. The only difference is in the second term, with or without label priors. In Equation 6, it computes how much proportion of p(𝐖¯|𝐗)superscript𝑝conditional¯𝐖𝐗p^{\star}(\mathbf{\bar{W}}|\mathbf{X})italic_p start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( over¯ start_ARG bold_W end_ARG | bold_X ) go through the symbol k𝑘kitalic_k at time t𝑡titalic_t after the label priors are applied. The optimization process just tries to match posterior yktsuperscriptsubscript𝑦𝑘𝑡y_{k}^{t}italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with such proportion. If the two terms are equal, the gradient becomes zero and an local optimum is reached. This elaborates the “error signals” received by the network during training.

3 Experiments And Analysis

3.1 Datasets

We train our model using Librispeech [31] of 960 hours of read English speech from audio books. For evaluation, we use Buckeye [32] and TIMIT [33] corpora. TIMIT contains 5.4 hours of read speech with time-aligned phoneme-level transcriptions. Buckeye contains spontaneous English conversations (interviews) of 40 speakers and 20 hours. It comes with forced aligned then manually corrected phoneme-level timestamps. Buckeye comes in many 10-minute-ish recordings. Following the common practice in [8, 12], we segment the long recordings into smaller chunks. We take chunks separated by non-speech (pauses, noise, etc.) of more than one second instead of 150 ms as in [8] to avoid too short segments.

We randomly selected 4 speakers as development data to tune hyper parameters. Note, however, for the FA task, it is always possible to fine-tune the model on the target/test data (audio and transcription) to reduce acoustic condition mismatch, before actually producing alignments. We will provide FA results with and without fine-tuning on the target data.

3.2 Metrics

Phoneme or word boundary error (PBE/WBE) is to measure how close the predicted and manually labeled timestamps are. Following [8], PBE is defined as the average of N𝑁Nitalic_N utterance-level PBEs (where p𝑝pitalic_p stands for phonemes and uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i-th utterance below):

PBE1Ni1|ui|pui12(|pbegrefpbegpred|+|pendrefpendpred|)approaches-limit𝑃𝐵𝐸1𝑁subscript𝑖1subscript𝑢𝑖subscript𝑝subscript𝑢𝑖12superscriptsubscript𝑝𝑏𝑒𝑔𝑟𝑒𝑓superscriptsubscript𝑝𝑏𝑒𝑔𝑝𝑟𝑒𝑑superscriptsubscript𝑝𝑒𝑛𝑑𝑟𝑒𝑓superscriptsubscript𝑝𝑒𝑛𝑑𝑝𝑟𝑒𝑑PBE\doteq\frac{1}{N}\sum_{i}\frac{1}{|u_{i}|}\sum_{p\in u_{i}}\frac{1}{2}\left% (|p_{beg}^{ref}-p_{beg}^{pred}|+|p_{end}^{ref}-p_{end}^{pred}|\right)italic_P italic_B italic_E ≐ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( | italic_p start_POSTSUBSCRIPT italic_b italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_b italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT | + | italic_p start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT | )

WBE is similarly defined on word-level. Ideally, PBE and WBE should be close to 0.

Phoneme or word average duration (PDUR/WDUR) is the average predicted duration of phonemes or words. PDUR and WDUR should match the average duration of manual timestamps, the closer the better. PDUR, WDUR and the final label priors are used to measure the spikiness of CTC models.

3.3 Model configuration and implementation details

For the encoder of our CTC aligner, we compared different model architectures (TDNN-FFN, TDNN-BLSTM [18] and Conformer [26] of 5M5𝑀5M5 italic_M, 27M27𝑀27M27 italic_M and 85M85𝑀85M85 italic_M parameters, respectively), different modeling units (phonemes, characters, sentencepieces [34]) and different sub-sampling rates (1111, 2222 and 4444). Our models take Mel-spectrogram features with 10 ms frame shift as input, and are trained by minimizing the CTC loss with or without label priors (Section 2.3). For phoneme models, the phoneme set (of 93939393 phonemes), pronunciation dictionary and G2P model are taken from MFA [35]. We train our models for 20202020 epochs and select the model with the best loss value on development data. We observed no further improvement for FA when training for more epochs, which is different from ASR.

Noteworthy is that, at the time we work on this paper, the CTC loss implementation 333https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html in Pytorch only supports inputs that are sum-to-one probabilities. However, with label priors, the inputs to CTC do not sum to one anymore. On the other hand, the CTC dynamic programming does not require a valid probability distribution. As alternatives for Pytorch’s CTC, we found the CTC loss implemented in k2 444https://k2-fsa.github.io/k2/python_api/api.html#ctc-loss library or in another open source implementation 555https://github.com/vadimkantorov/ctc can match the gradients computed by Equation 6, when label priors are applied.

3.4 Results

3.4.1 Effectiveness of the proposed CTC model

We trained a TDNN-FFN model of 5M5𝑀5M5 italic_M parameters with phoneme outputs, which is a stack of 3 TDNN layers (with kernel size 5,3,35335,3,35 , 3 , 3 and stride (sub-sampling factor of the acoustic encoder) size 2,1,12112,1,12 , 1 , 1) and 5 feedforward layers. In Table 1, we compare our aligner with the aligner trained with standard CTC loss, the aligners based on Wav2Vec2 [36] 666https://pytorch.org/audio/main/tutorials/forced_alignment_tutorial.html or MMS [1] 777https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html models, heuristics [3] as well as MFA based on GMM-HMM triphone model.

Table 1: Comparing the alignment accuracy of several FA solutions on Buckeye and TIMIT data. All metrics are in milliseconds. The closer to the ground truth (the last row), the better.
Buckeye TIMIT

PBE

WBE

PDUR

WDUR

PBE

WBE

PDUR

WDUR

1 Wav2Vec2 [36] - 89 - 177 - 48 - 229
2 MMS [1] - 53 - 181 - 37 - 242
3 Heuristics [3] 42 55 60 212 31 41 30 239
4 MFA [8] 30 41 84 251 17 23 85 313
5 +fine-tuning 27 36 83 250 16 22 86 314
6 Standard CTC 44 58 21 169 32 42 21 229
7 +fine-tuning 39 52 22 163 31 40 23 234
8 Our CTC 38 43 64 221 28 29 72 288
9 +fine-tuning 30 34 74 232 27 28 79 301
10 Ground truth 0 0 82 241 0 0 76 305

From Table 1, the CTC model trained by our proposed method (row 8, α=0.3𝛼0.3\alpha=0.3italic_α = 0.3888When α=0𝛼0\alpha=0italic_α = 0, the results are exactly the same as row 6; when α>0.3𝛼0.3\alpha>0.3italic_α > 0.3, the model does not converge.) clearly outperforms the standard CTC model (row 6), the heuristic-based method (row 3) as well as Wav2Vec2/MMS aligners (row 1, 2) in all metrics for the FA task. In particular, the predicted PDUR value being 21212121 ms in row 6 matches the input frame size of the model (20202020 ms with stride size 2222), which means the non-blank tokens fire for just one frame. In contrast, our model’s PDUR (row 8) is closer to the ground truth (row 10). On the other hand, the label prior probability for the blank token on training data are 0.800.800.800.80 and 0.320.320.320.32 for row 6 and 8 respectively (which is not shown in the table). This verifies that the standard CTC model is indeed peaky, and our proposed method can effectively reduce the peakiness (Fig. 1).

Note that the MFA model we used (english_mfa) is trained on 3700370037003700+ hours of English speech including Librispeech. To reduce acoustic condition mismatch and further improve FA, we finetune MFA and our aligners (for 6 epochs) on Buckeye data. From Table 1, finetuning is effective (row 5, 7, 9 vs. row 4, 6, 8) as all metrics become closer to the ground truth. On Buckeye, our best BPE/WBE results (row 9) are close to the best of MFA (row 5), whereas WBE even outperforms MFA slightly. We conjecture that our aligner is good at modeling tokens next to silence (e.g., word boundaries), due to the larger modeling capacity of neural networks compared to GMMs. On TIMIT, our best aligner still falls behind MFA. We notice our aligner and MFA have different frame rates, 20202020 ms and 10101010 ms respectively, which is probably the reason for our aligner not being able to produce even finer-grained timestamps. However, when we set our frame rate to be 10101010 ms, the CTC model becomes harder to train due to longer input sequences [23] and does not produce better alignments. This is left as future work.

For PBE/WBE improvement, we investigate it further by breaking down PBE/WBE (of the models in Table 1 row 6 and 8) into onset/offset timestamps errors. From Table 2, we find that the standard CTC makes more errors on onset prediction than offset prediction, which aligns with our expectation that a standard CTC model tends to delay token predictions. On the other hand, our CTC model improves both onset and offset predictions, with more improvements on the onset side, so that the onset/offset errors are more balanced.

Table 2: The breakdown of onset/offset timestamps errors on Buckeye for Table 1 row 6 and 8. All metrics are in milliseconds.
phoneme word
onset offset onset offset
Standard CTC 51 39 63 54
Our CTC 39 36 44 42

In above experiments, it takes MFA 9.59.59.59.5 minutes to finish alignment generation on 20202020-hour Buckeye data, with 8888 CPU jobs and multithreading turned on. In contrast, it takes only 3333 minutes for our model to finish on 1111 NVIDIA Titan RTX GPU, or less than 1111 minute on 4444 GPUs, thanks to the PyTorch-based implementation. This is not an apple-to-apple comparison given MFA does not support GPU, yet it gives us a rough estimate of the best runtime of both systems.

3.4.2 Impact of network configurations

We start with the baseline configuration (TDNN-FFN, stride=2, phoneme) as in Table 3 row 1 (corresponding to row 6 and 8 in Table 1), and vary the model architecture (row 2-3), modeling unit (row 4-5), and model stride size (row 6-7) independently on top of the baseline configuration. We report results on Buckeye with and without applying our method in each cell of Table 3.

Table 3: Comparing different network configurations on Buckeye. The baseline configuration is (TDNN-FFN, stride=2, phoneme). We vary the model architecture, modeling unit and model stride size independently on top of the baseline configuration. In each cell, results for the standard/proposed CTC model are reported. All metrics are in milliseconds. The closer to ground truth (the last row), the better.
PBE WBE PDUR WDUR
1 Baseline 44 / 38 58 / 43 21 / 64 169 / 221
2 TDNN-BLSTM 42 / 53 53 / 74 23 / 75 175 / 261
3 Conformer 43 / 55 51 / 75 25 / 77 180 / 278
4 char - 62 / 52 - 196 / 235
5 sentencepiece - 101 / 52 - 80 / 230
6 stride=1 51 / 46 65 / 49 11 / 78 156 / 242
7 stride=4 45 / 40 56 / 47 43 / 71 196 / 230
8 Ground truth 0 0 82 241

When the models with long-range memory are used (Table 3, row 2 and 3), our method actually degrades PBE/WBE. It turns out such models have learned to predict non-blank tokens repeatedly even when there is no speech in the audio, in order to avoid blank penalties from the priors. Only the TDNN-FFN model with 5M5𝑀5M5 italic_M parameters works well, which is quite different from ASR where Conformer is considered better. Note, our TDNN-FFN model has a very limited perception range over the input features (13 frames), so it only has the ability to learn the short-range acoustic information, instead of the long-range language dependencies. In other words, 5M5𝑀5M5 italic_M parameters are probably sufficient to make a good acoustic model. Thus, we conjecture that in the 85M85𝑀85M85 italic_M-param Conformer ASR model, only a small portion of parameters are used for acoustic modeling, whereas the rest of parameters are used for language modeling of long-range dependency.

As shown in rows 4similar-to\sim7, the proposed CTC (right) works better than standard CTC (left) in each cell, but none of the other configurations in rows 4similar-to\sim7 work better than row 1. From the PDUR column, the standard CTC model (left) tends to predict PDUR to be almost the same size as the model input frame size, while PDUR predicted by our CTC model (right) is closer to the ground truth.

3.4.3 Other variations of the proposed method

We experimented with the following variations of the proposed CTC model: (1) we use label priors only during decoding the standard CTC model, and got no improvements, which suggests incorporating priors into training loss is important. (2) Instead of applying label priors to all tokens, we apply penalties only to the blank tokens to the standard CTC model during training, and got only minimal improvements over standard CTC. (3) We disallow intra-word blanks during alignment for Table 1 row 8 or 9, which resembles 1-state HMM topology [18] disallowing intra-word blanks, and got minimal improvements over the standard CTC. This result agrees with Figure. 3 from [23], where models with HMM topology tend to clump up non-blank tokens. (4) We fine-tune the standard CTC model with the proposed loss on Buckeye, and got similar results as our best results, suggesting that fine-tuning a standard CTC model with the proposed method is sufficient to resolve the peaky behavior.

4 Conclusion and Future work

In this paper, we proposed using a CTC loss with label priors to train CTC models to produce less peaky distribution, enabling the CTC model to more accurately predict not only the onset but also the offset of the tokens. Such property is especially suitable for the forced alignment (FA) task. From our benchmarks using human transcribed phoneme-level timestamps, we see both the predicted onset and offset timestamps are improved by the proposed model, leading to significant alignment accuracy improvement (measured by phoneme/word boundary errors and phoneme/word durations) compared to the standard CTC model and a heuristics-based baseline method. Our proposed model also rivals the performance of the state-of-the-art MFA aligner on Buckeye and TIMIT data.

On top of this work, there are several directions to further improve the CTC-based or deep learning based forced aligners. Firstly, the neural aligners’ noise robustness to either noisy audios or noisy transcriptions should be investigated. Neural networks are considered more robust than GMMs. Second, besides the light-weighted TDNN-FFN model, it would be useful if the peaky behavior of the TDNN-BLSTM or Conformer based CTC models can also be reduced. In this way, we can re-purpose pretrained ASR systems, which are usually based on Conformer-type architectures nowadays, for FA applications. Third, methods to effectively train FA models with smaller sub-sampling rates should be investigated. Finally, it’s worthwhile to explore if self-supervised learning features or objectives can further benefit our proposed CTC aligner.

Appendix A Appendix

There is an independent concurrent work [25] which is derived from the same theoretical foundation [23] as ours. We briefly summarize the contributions and main differences.

Both this paper and [25] adopts label priors to reduce the peakiness of CTC models on real-world FA tasks and to improve FA time stamp accuracy. Both works show applying label priors on different real-world datasets is effective. Notably, both papers’ frame-level classifiers are models without long-range memory: FFN is used in [25] on top of encoder’s outputs and speech features, while TDNN-FFN is used in our paper. This validates our choice of model architecture in Section 3.4.2.

In [25], they derive time stamps for the attention-based encoder-decoder architecture (LAS), while our work is primarily for CTC models. Secondly,  [25] proposes other loss function terms and regularization techniques to improve FA, while our work focuses primarily on label priors and investigate the gradients. Third, [25] employs Tensorflow, Lingvo and RETURNN toolkits (check out their paper for details) for implementation, while our solution is based on Pytorch and is publicly available 999https://github.com/huangruizhe/audio/blob/aligner_label_priors/examples/asr/librispeech_alignment/loss.py. We also spot an issue 101010https://github.com/pytorch/pytorch/issues/122243 of Pytorch’s native CTC implementation and have addressed the issue in our solution. Finally, [25] uses Librispeech (whose ground-truth timetamps are automatically derived by a HMM-GMM model) and internal data for experiments. This work uses TIMIT and Buckeye, which comes with manually annotated timestamps which are fairer when comparing CTC models with HMM-GMM models.

References

  • [1] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, et al., “Scaling speech technology to 1,000+ languages,” ArXiv, vol. abs/2305.13516, 2023.
  • [2] Matt Le, Apoorv Vyas, Bowen Shi, Brian Karrer, et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” ArXiv, vol. abs/2306.15687, 2023.
  • [3] Ruizhe Huang, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, et al., “Building keyword search system from end-to-end asr systems,” ICASSP, 2023.
  • [4] Chih-Wei Huang, “Automatic closed caption alignment based on speech recognition transcripts,” 2003.
  • [5] Laurel Mackenzie and Danielle Turton, “Assessing the accuracy of existing forced alignment software on varieties of british english,” Linguistics Vanguard, vol. 6, 2020.
  • [6] Hongchen Wu, Jiwon Yun, Xiang Li, Huiyi Huang, et al., “Using a forced aligner for prosody research,” Humanities and Social Sciences Communications, vol. 10, pp. 1–13, 2023.
  • [7] Jiahong Yuan, Wei Lai, Chris Cieri, and Mark Liberman, “Using forced alignment for phonetics research,” Chinese Language Resources and Processing: Text, Speech and Language Technology. Springer, 2018.
  • [8] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, et al., “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in Interspeech, 2017.
  • [9] Ingrid Rosenfelder, Josef Fruehwald, Keelan Evanini, Scott Seyfarth, et al., “Fave (forced alignment and vowel extraction) suite version 1.1.3,” 2014.
  • [10] Kyle Gorman, Jonathan Howell, and Michael Wagner, “Prosodylab-aligner: A tool for forced alignment of laboratory speech,” Canadian Acoustics, vol. 39, pp. 192–193, 2011.
  • [11] Robert M Ochshorn and Max Hawkins, “Gentle: A robust yet lenient forced aligner built on kaldi,” 2015, https://github.com/lowerquality/gentle.
  • [12] Jingbei Li, Yi Meng, Zhiyong Wu, Helen M. Meng, et al., “Neufa: Neural network based end-to-end forced alignment with bidirectional attention mechanism,” ICASSP, 2022.
  • [13] Albert Zeyer, Eugen Beck, Ralf Schlüter, and Hermann Ney, “Ctc in the context of generalized full-sum hmm training,” in Interspeech, 2017.
  • [14] Ludwig Kurzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel, et al., “Ctc-segmentation of large corpora for german end-to-end speech recognition,” in International Conference on Speech and Computer, 2020.
  • [15] Jinchuan Tian, Brian Yan, Jianwei Yu, Chao Weng, et al., “Bayes risk ctc: Controllable ctc alignment in sequence-to-sequence tasks,” ICLR, 2023.
  • [16] Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman, “Whisperx: Time-accurate speech transcription of long-form audio,” Interspeech, 2023.
  • [17] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, et al., “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  • [18] Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, et al., “On lattice-free boosted mmi training of hmm and ctc-based full-context asr models,” ASRU, 2021.
  • [19] Jian Zhu, Cong Zhang, and David Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,” ICASSP, 2022.
  • [20] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” ICML, 2006.
  • [21] Hu Liu, Sheng Jin, and Changshui Zhang, “Connectionist temporal classification with maximum entropy regularization,” in NeurIPS, 2018.
  • [22] Hongzhu Li and Weiqiang Wang, “Reinterpreting ctc training as iterative fitting,” Pattern Recognit., vol. 105, 2020.
  • [23] Albert Zeyer, Ralf Schluter, and Hermann Ney, “Why does ctc result in peaky behavior?,” ArXiv, vol. abs/2105.14849, 2021.
  • [24] Ehsan Variani, Tom Bagby, Kamel Lahouel, Erik McDermott, et al., “Sampled connectionist temporal classification,” ICASSP, 2018.
  • [25] Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, et al., “Improving frame-level classifier for word timings with non-peaky ctc in end-to-end automatic speech recognition,” Interspeech, 2023.
  • [26] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  • [27] Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, et al., “Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch,” ASRU, 2023.
  • [28] Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, pp. 82, 2012.
  • [29] Vimal Manohar, Daniel Povey, and Sanjeev Khudanpur, “Semi-supervised maximum mutual information training of deep neural network acoustic models,” in Interspeech, 2015.
  • [30] Alex Graves, “Supervised sequence labelling with recurrent neural networks,” in Studies in Computational Intelligence, 2012.
  • [31] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” ICASSP, 2015.
  • [32] Mark A. Pitt, Keith Johnson, Elizabeth Hume, Scott F. Kiesling, et al., “The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability,” Speech Commun., vol. 45, pp. 89–95, 2005.
  • [33] John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993.
  • [34] Taku Kudo and John Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP, 2018.
  • [35] Michael McAuliffe and Morgan Sonderegger, “English mfa dictionary v2.0.0a,” Tech. Rep., https://mfa-models.readthedocs.io/pronunciation%****␣Template.bbl␣Line␣200␣****dictionary/English/EnglishMFAdictionaryv2_0_0a.html, May 2022.
  • [36] Alexei Baevski, Henry Zhou, Abdel rahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” ArXiv, vol. abs/2006.11477, 2020.