Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\interspeechcameraready\name

[affiliation=1,2]TobiasWeise \name[affiliation=2]PhilippKlumpp \name[affiliation=1]KubilayCan Demir \name[affiliation=2,4]Paula AndreaPérez-Toro \name[affiliation=3]MariaSchuster \name[affiliation=2]ElmarNoeth \name[affiliation=2]BjoernHeismann \name[affiliation=2]AndreasMaier \name[affiliation=1]Seung HeeYang

Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech

Abstract

This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.730.730.730.73 mean correlation for the AAI task and achieve up to approximately \qty87%\qtypercent87\qty{87}{\%}87 % frame overlap compared to a state-of-the-art text-dependent phoneme force aligner.

keywords:
speech inversion, attention, phoneme alignment, wav2vec 2.0, HPRC, tract variables, multi-task learning

1 Introduction

In phonetics, articulatory configurations are analyzed to understand how different sounds are produced and how they can be classified into phonemes within a particular language’s phonological system. Articulators refer to the various parts of the vocal tract and other structures (e.g. tongue, lips, palate) involved in the production of sounds. They are typically measured by placing sensor coils, in a procedure called electromagnetic articulography (EMA), and tracking the position and movement over time during speech. These sensor coordinates are naturally speaker-specific since they depend on the particular vocal tract anatomy of the recorded speaker. Tract Variables (TVs), introduced by Brownman et. al. [1], on the other hand, combine multiple individual vocal tract articulator movements, that achieve a specific linguistic objective, into defined gestures relevant to articulation. Transformations were introduced by Ji [2] to convert EMA sensor coordinates into TVs, which were shown to be less speaker dependent [3] than the original measurements.

Refer to caption
Figure 1: Nine tract variables (TVs), used for speaker-independent articulatory speech inversion. Adapted from [4, 5].

The problem of inverting an original speech signal back to its articulator positions is referred to as acoustic-to-articulatory speech inversion (AAI), which can involve TVs or EMA coordinates as targets. This task has been studied speaker-dependent and speaker-independently in literature: multi-task learning (MTL) [6, 7], generative adversarial networks [8], the application to dysarthric speech [9], and speech therapy [10, 11, 12], the incorporation of fundamental frequency [13], and others [14, 15, 16] have been explored. A related but less studied problem is taking a sequence of phonemes and mapping it to articulator movements (PTA): gated bidirectional recurrent neural networks [17], attempts to model the entire vocal tract [18], comparative studies [19], and feed-forward transformers [20] have been applied, where the latter authors also applied it to AAI in a speaker-dependent setting.

Refer to caption
Figure 2: Proposed APTAI model, based on wav2vec2.0 fine-tuning via frame-classification and TV regression.

Phoneme recognition can be described as taking an audio signal as input and producing the corresponding frame-asynchronous phoneme sequence. However, the frame-synchronous relation [21] is required for the task of phoneme alignment [22, 23, 24], boundary detection, and segmentation [25]. This paper focuses on phoneme recognition and subsequent alignment to the individual frames, which can be beneficial e.g. during speech therapy [26, 27]. Here, we explore frame-wise classification and forced alignment. Our upper bound is a state-of-the-art (SOTA) text-dependent force aligner. This system relies on both audio and transcriptions as input, which are converted from graphemes to phonemes.

This paper introduces APTAI, a novel combination of AAI and PTA in combination with phoneme recognition and alignment. We require that resulting models predict end-to-end (in a therapeutic context) while working speaker- and text-independently during inference. To this end, two different approaches are explored, with Figure 1 illustrating the TV regression targets to model articulation.

2 Proposed Approach(es)

This paper introduces two approaches, sharing the same requirements outlined in the last paragraph of the introduction. Both make use of MTL optimization, composed of articulator movement regression and phoneme prediction paired with alignment. The main difference is the way they deal with the phoneme-related objective: APTAI is based on frame classification, whereas f-APTAI utilizes forced alignment during a two-staged training procedure. Our code is available online111https://github.com/tobwei/APTAI.

Both approaches make use of self-supervised learning (SSL) models but in different setups. Taking ASR as an example, SOTA performance has been achieved using this paradigm, which includes pre-training on large amounts of unlabeled data and fine-tuning on a smaller, labeled dataset relevant to the desired downstream task. We chose wav2vec2 [28], which optimizes a contrastive loss during pre-training to learn a finite set of speech representations. These can be fine-tuned for a broad set of applications, with ASR as the original intended use case. Thus, such embeddings are expected to capture meaningful features of speech that are relevant for phonemes, which in turn can be identified by specific articulator configurations.

Table 1: Fine-tuned phoneme recognizer results (PER [%][\%]\downarrow[ % ] ↓), using CP train/dev splits, for different pre-trained models.
wav2vec2- CP–test HPRC–N HPRC–F
base-960h 17.7717.7717.7717.77 10.1010.1010.1010.10 19.9819.9819.9819.98
large-960h 18.7118.7118.7118.71 11.4711.4711.4711.47 24.2724.2724.2724.27
large-lv60 9.759.759.759.75 4.964.964.964.96 13.7613.7613.7613.76
large-960h-lv60 9.309.309.309.30 4.554.554.554.55 10.6910.6910.6910.69
large-robust 8.83 4.45 10.53
xls-r-300m 11.7011.7011.7011.70 7.777.777.777.77 19.3819.3819.3819.38
xls-r-1b 18.5018.5018.5018.50 12.6912.6912.6912.69 27.2927.2927.2927.29
large-xlsr-53 10.1710.1710.1710.17 5.405.405.405.40 14.5514.5514.5514.55

2.1 Frame Classification: APTAI

Of the two proposed approaches, APTAI follows a more classical setup, refer to Figure 2 for an overview. The general idea is to fine-tune wav2vec2 to make use of its pre-trained speech representations, which is the reason why we keep the feature extractor frozen (pre-trained weights), and only train the transformer layers (pre-trained initialization) in addition to two added heads (randomly initialized). Furthermore, we add a convolutional layer (fixed parameters), which behaves like a low-pass (sinc) filter, adapted from [29]. This enforces the smoothness of the predicted TV trajectories, which is required since frame-based signal regression typically suffers from high-frequency noise between the individual frame predictions.

An \qty16kHz\qty16𝑘𝐻𝑧\qty{16}{kHz}16 italic_k italic_H italic_z input speech signal x(t)𝑥𝑡x(t)italic_x ( italic_t ) is divided into T𝑇Titalic_T frames 𝒙t512subscript𝒙𝑡superscript512\bm{x}_{t}\in\mathbb{R}^{512}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT at \qty49Hz\qty49𝐻𝑧\qty{49}{Hz}49 italic_H italic_z by the feature encoder. After passing the transformer layers, producing 𝒉t1024subscript𝒉𝑡superscript1024\bm{h}_{t}\in\mathbb{R}^{1024}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT, the TV head takes this output and ultimately predicts 𝒚^ttvTVsubscriptsuperscriptbold-^𝒚𝑡𝑣𝑡superscript𝑇𝑉\bm{\hat{y}}^{tv}_{t}\in\mathbb{R}^{TV}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_V end_POSTSUPERSCRIPT smoothed TV=9𝑇𝑉9TV=9italic_T italic_V = 9 values for each frame t𝑡titalic_t. As part of the MTL goal, this head optimizes the reconstruction mean square error (MSE) loss between the predicted 𝒚^ttvsubscriptsuperscriptbold-^𝒚𝑡𝑣𝑡\bm{\hat{y}}^{tv}_{t}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ground truth 𝒚ttvsubscriptsuperscript𝒚𝑡𝑣𝑡\bm{y}^{tv}_{t}bold_italic_y start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT TV values, which is expressed in the second term of Equation 1. The phoneme head also takes 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as input and predicts a probability distribution p^t,csubscript^𝑝𝑡𝑐\hat{p}_{t,c}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT over 𝒞=45𝒞45\mathcal{C}=45caligraphic_C = 45 phoneme labels per frame t𝑡titalic_t, with c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C. This frame-wise classification is optimized via cross-entropy (CE) loss between the predicted p^t,csubscript^𝑝𝑡𝑐\hat{p}_{t,c}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT and ground truth pt,csubscript𝑝𝑡𝑐{p}_{t,c}italic_p start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT probability distribution (see first term in Equation 1). Applying softmax𝑠𝑜𝑓𝑡𝑚𝑎𝑥softmaxitalic_s italic_o italic_f italic_t italic_m italic_a italic_x to the resulting logits and choosing the phoneme label c𝑐citalic_c that yields the maximum probability per frame t𝑡titalic_t will result in an alignment, whilst a phoneme sequence can be obtained by grouping over the individual frame predictions. Finally, Equation 1 shows the MTL loss FCsubscriptFC\mathcal{L_{\text{FC}}}caligraphic_L start_POSTSUBSCRIPT FC end_POSTSUBSCRIPT for the APTAI approach, with λ𝜆\lambdaitalic_λ as weighting factor.

FC=1Tt=1Tc=1𝒞pt,clog(p^t,c)+λ1Tt=1T(𝒚ttv𝒚^ttv)2subscriptFC1𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝑐1𝒞subscript𝑝𝑡𝑐subscript^𝑝𝑡𝑐𝜆1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptsuperscript𝒚𝑡𝑣𝑡subscriptsuperscriptbold-^𝒚𝑡𝑣𝑡2\mathcal{L_{\text{FC}}}=-\frac{1}{T}\sum_{t=1}^{T}\sum_{c=1}^{\mathcal{C}}p_{t% ,c}\log(\hat{p}_{t,c})+\lambda\frac{1}{T}\sum_{t=1}^{T}(\bm{y}^{tv}_{t}-\bm{% \hat{y}}^{tv}_{t})^{2}caligraphic_L start_POSTSUBSCRIPT FC end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_C end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT ) + italic_λ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

2.2 Forced Alignment: f-APTAI

The idea behind the second approach f-APTAI is to make use of hidden representations from a fine-tuned phoneme recognizer in combination with a forced alignment of the predicted output phoneme sequence. To this end, we use a two-staged approach during training, depicted in Figure 3. We make use of different datasets for the two stages, more details in section 3.1.

For the first stage, we fine-tune the same SSL architecture (wav2vec2) used in APTAI, by adding a linear layer producing 𝒍t𝒞subscript𝒍𝑡superscriptsubscript𝒞\bm{l}_{t}\in\mathbb{R}^{\mathcal{C}_{\emptyset}}bold_italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT caligraphic_C start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT representing the same 𝒞=45𝒞45\mathcal{C}=45caligraphic_C = 45 phoneme labels with the addition of a blank token \emptyset, per frame tT𝑡𝑇t\in Titalic_t ∈ italic_T (see section 2.1). Similar to the ASR application, we optimize this model using the connectionist temporal classification (CTC) loss. This optimization behaves like a state machine, similar to hidden markov models (HMM), and only requires a phoneme sequence as additional input during training. However, CTC does not produce an alignment but rather outputs a frame-asynchronous (in our case) phoneme label sequence through a frame-synchronous decoding procedure (beam search), utilizing the blank token and multiple possible alignment paths. Given a true phoneme label sequence 𝒲𝒲\mathcal{W}caligraphic_W, then 𝒮𝒮\mathcal{S}caligraphic_S represents all possible paths that map from 𝒲𝒲\mathcal{W}caligraphic_W to T𝑇Titalic_T by removing repeated labels and blanks. Then, P(st𝒍t)𝑃conditionalsubscript𝑠𝑡subscript𝒍𝑡P(s_{t}\mid\bm{l}_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represents the output of the model at t𝑡titalic_t by applying softmax𝑠𝑜𝑓𝑡𝑚𝑎𝑥softmaxitalic_s italic_o italic_f italic_t italic_m italic_a italic_x to 𝒍tsubscript𝒍𝑡\bm{l}_{t}bold_italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with [s1:T]𝒮delimited-[]subscript𝑠:1𝑇𝒮[s_{1:T}]\in\mathcal{S}[ italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ] ∈ caligraphic_S. Adapted from [21], the CTC loss can be defined as:

CTC=log𝒮t=1TP(st𝒍t)subscriptCTCsubscript𝒮superscriptsubscriptproduct𝑡1𝑇𝑃conditionalsubscript𝑠𝑡subscript𝒍𝑡\mathcal{L_{\text{CTC}}}=-\log\sum_{\mathcal{S}}\prod_{t=1}^{T}P(s_{t}\mid\bm{% l}_{t})caligraphic_L start_POSTSUBSCRIPT CTC end_POSTSUBSCRIPT = - roman_log ∑ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

The second stage of f-APTAI incorporates the frozen model trained during stage-1. Specifically, two parts are extracted and used during training of stage-2: the predicted CTC-based phoneme sequence (upper bound for stage-2) and the output of the last transformer layer. Here, let the former be [p1:N]𝒫Ndelimited-[]subscript𝑝:1𝑁superscript𝒫𝑁[p_{1:N}]\in\mathcal{P}^{N}[ italic_p start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT ] ∈ caligraphic_P start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where pn𝒞subscript𝑝𝑛𝒞p_{n}\in\mathcal{C}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_C, and N𝑁Nitalic_N the maximum sequence length. The last transformer layer output can be expressed as matrix 𝑯𝑯\bm{H}bold_italic_H, consisting of 𝒉t1024subscript𝒉𝑡superscript1024\bm{h}_{t}\in\mathbb{R}^{1024}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT column vectors, with tT𝑡𝑇t\in Titalic_t ∈ italic_T. This can be understood as acoustic phoneme embeddings since the stage-1 objective (see Equation 2) led to accordingly optimized weights. A principal component analysis (PCA) of these embeddings (extracted from the HPRC–N dataset, see section 3.1) can be seen in Figure 4. The setup is similar to [30] and shows good speaker independence with phoneme clustering of exemplary chosen elongated vowels, a fricative, nasal, and plosive. The performed neural forced alignment is inspired by [23] and has the goal of producing a monotonic alignment, such that it aligns each phoneme label pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to a subset of consecutive hidden frame representations 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Therefore, one of the MTL optimization goals of f-APTAI is to learn a matrix 𝑨NxT𝑨superscript𝑁𝑥𝑇\bm{A}\in\mathbb{R}^{NxT}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_x italic_T end_POSTSUPERSCRIPT that aligns 𝒫Nsuperscript𝒫𝑁\mathcal{P}^{N}caligraphic_P start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to 𝑯𝑯\bm{H}bold_italic_H. This objective is centered around a cross-attention computation between a learned linear projection of 𝒉tsubscript𝒉𝑡\bm{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝒉tp128subscriptsuperscript𝒉𝑝𝑡superscript128\bm{h}^{p}_{t}\in\mathbb{R}^{128}bold_italic_h start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 128 end_POSTSUPERSCRIPT resulting in 𝑯pTx128subscript𝑯𝑝superscript𝑇𝑥128\bm{H}_{p}\in\mathbb{R}^{Tx128}bold_italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_x 128 end_POSTSUPERSCRIPT, and a learned embedding of 𝒫Nsuperscript𝒫𝑁\mathcal{P}^{N}caligraphic_P start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. This embedding is created via projection of each pn𝒫Nsubscript𝑝𝑛superscript𝒫𝑁p_{n}\in\mathcal{P}^{N}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to 128superscript128\mathbb{R}^{128}blackboard_R start_POSTSUPERSCRIPT 128 end_POSTSUPERSCRIPT and the addition of a sinusoidal positional encoding [31], ultimately resulting in matrix 𝑷128xN𝑷superscript128𝑥𝑁\bm{P}\in\mathbb{R}^{128xN}bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT 128 italic_x italic_N end_POSTSUPERSCRIPT. Finally, the cross-attention layer computes the alignment matrix 𝑨=softmax(𝑯p𝑷)𝑨𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑯𝑝𝑷\bm{A}=softmax(\bm{H}_{p}\cdot\bm{P})bold_italic_A = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( bold_italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ bold_italic_P ). We constrain 𝑨𝑨\bm{A}bold_italic_A to be monotonic and diagonal, which is inspired by the forward-sum (FS) loss used in HMM systems, and adapted from [22, 24]. See the first term in Equation 3, where 𝒪𝒪\mathcal{O}caligraphic_O is the optimal alignment.

FA=𝑯p,𝑷𝒪logP(𝑷𝑯p)+λ1Tt=1T(𝒚ttv𝒚^ttv)2subscriptFAsubscriptsubscript𝑯𝑝𝑷𝒪𝑃conditional𝑷subscript𝑯𝑝𝜆1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptsuperscript𝒚𝑡𝑣𝑡subscriptsuperscriptbold-^𝒚𝑡𝑣𝑡2\mathcal{L_{\text{FA}}}=-\sum_{\bm{H}_{p},\bm{P}\in\mathcal{O}}\log P(\bm{P}% \mid\bm{H}_{p})+\lambda\frac{1}{T}\sum_{t=1}^{T}(\bm{y}^{tv}_{t}-\bm{\hat{y}}^% {tv}_{t})^{2}caligraphic_L start_POSTSUBSCRIPT FA end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT bold_italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_italic_P ∈ caligraphic_O end_POSTSUBSCRIPT roman_log italic_P ( bold_italic_P ∣ bold_italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + italic_λ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

Additionally, the cross-attention layer produces a hidden representation matrix 256×Tabsentsuperscript256𝑇\in\mathbb{R}^{256\times T}∈ blackboard_R start_POSTSUPERSCRIPT 256 × italic_T end_POSTSUPERSCRIPT. This sequence of column vectors over T𝑇Titalic_T frames serves as input for the TV regression part of the f-APTAI model. Initially, it is passed through a single bi-directional long short-term memory (LSTM) layer, the output of which is ultimately projected to TVsuperscript𝑇𝑉\mathbb{R}^{TV}blackboard_R start_POSTSUPERSCRIPT italic_T italic_V end_POSTSUPERSCRIPT. Moreover, the same fixed-parameter convolutional low-pass (sinc) filter as in APTAI is used to ensure the prediction of smooth TV trajectories 𝒚^ttvsubscriptsuperscriptbold-^𝒚𝑡𝑣𝑡\bm{\hat{y}}^{tv}_{t}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_t italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Consequently, the same MSE loss is also optimized, see the second term in Equation 3.

Refer to caption
Figure 3: Proposed f-APTAI model, based on TV regression and a two-staged forced alignment via cross-attention.

3 Experimental Setup

It should be noted that our upper bound for both approaches, in terms of phoneme recognition and alignment, is a SOTA [23] text-dependent force aligner from WebMAUS [32]. The reason for this is that we produce our ground truth phoneme labels and time steps via this web API. We make use of CommonPhone (see section 3.1) for its robustness and this dataset utilized the same process, so we apply the same to HPRC, the second dataset that we use to guarantee compatibility.

Refer to caption
Figure 4: PCA of the embeddings from the best-performing fine-tuned phoneme recognition model from Table 1.

3.1 Datasets

One of the two datasets that we use during experiments is Common Phone (CP) [33], which is based on the crowd-sourced Common Voice [34]. Here, we utilize the English subset (45 phoneme labels). The main motivation behind using CP is that we want to build a robust system. When comparing CP to e.g. TIMIT [35], this robustness becomes evident: one is recorded in the same acoustically controlled environment with professional equipment, and the other is based on recordings from people’s smartphones in many different uncontrolled environments.

Table 2: Leave-one-speaker-out results (mean and deviation across eight test speakers) for the two proposed approaches.
Model, Test Data PCC\uparrow RSME[mm]delimited-[]𝑚𝑚absent[mm]\downarrow[ italic_m italic_m ] ↓ PER[%][\%]\downarrow[ % ] ↓ Overlap[%][\%]\uparrow[ % ] ↑
APTAI, HPRC–N 0.73 ±plus-or-minus\pm± 0.03 0.67 ±plus-or-minus\pm± 0.03 6.25 ±plus-or-minus\pm± 1.30 87.38 ±plus-or-minus\pm± 1.16
APTAI, HPRC–F 0.69 ±plus-or-minus\pm± 0.03 0.72 ±plus-or-minus\pm± 0.03 6.41 ±plus-or-minus\pm± 1.76 84.91 ±plus-or-minus\pm± 1.93
f-APTAI, HPRC–N 0.71 ±plus-or-minus\pm± 0.03 0.68 ±plus-or-minus\pm± 0.03 4.36 ±plus-or-minus\pm± 0.07 76.18 ±plus-or-minus\pm± 1.59
f-APTAI, HPRC–F 0.65 ±plus-or-minus\pm± 0.03 0.74 ±plus-or-minus\pm± 0.03 10.29 ±plus-or-minus\pm± 3.62 72.93 ±plus-or-minus\pm± 2.92

The second dataset we use contains articulator-related information in the form of EMA sensor data. This dataset is the Haskins Production Rate Comparison (HPRC) [36], which contains recordings from four female and four male subjects reciting 720 phonetically balanced IEEE sentences at ”normal” (HPRC–N) and ”fast” (HPRC–F) speaking rates. The speakers in this dataset repeat utterances, however, we randomly select only one repetition per utterance and speaker. Furthermore, we used the MAUS aligner to create our ground truth phoneme labels and time steps. This dataset comes with labels from another aligner, but we wanted to make it compatible with the CP dataset. Next, we performed pre-processing on the EMA data: some of the coordinates contained NaN values, where we applied linear interpolation to remedy this problem before low-pass (Butterworth) filtering the sensor data with \qty20Hz\qty20𝐻𝑧\qty{20}{Hz}20 italic_H italic_z to eliminate recording related noise. After this, the EMA coordinates were transformed into nine TVs (see Figure 1) and some final processing was applied to them. The original EMA data was sampled at \qty100Hz\qty100𝐻𝑧\qty{100}{Hz}100 italic_H italic_z, resulting in TVs at the same rate. We resampled them to \qty49Hz\qty49𝐻𝑧\qty{49}{Hz}49 italic_H italic_z to synchronize them with the output frame rate of wav2vec2. Finally, we applied utterance-wise z-score normalization based on the individual TVs.

Refer to caption
Figure 5: Example model prediction (APTAI) and ground truth for an unseen speaker (only a selection of all TVs is shown).

3.2 Model Evaluation

We evaluate the APTAI task in terms of the two MTL sub-objectives. The articulation regression performance is evaluated using two well-known metrics: the root mean square error (RMSE) based on the normalized values and the Pearson correlation coefficient (PCC). To evaluate the phoneme recognition and alignment performance, we use the phoneme error rate (PER), where the ground truth is based on the webMAUS grapheme-to-phoneme conversion. Phoneme alignment is also evaluated regarding this text-dependent upper bound, using the frame-wise overlap (percentage of correctly predicted frames).

3.3 Model Training

The following setup was used to train/validate our two proposed approaches, using the PyTorch framework. For CP, we used the official train/dev/test splits. To test the performance of our models, we used HPRC. Here, we applied leave-one-speaker-out testing, i.e., data from seven speakers was used for training/validation (90%/10%), and the data of the remaining speaker was used to test (separated by speaking rates). Additionally, we performed the training split in such a way that only unseen utterances were used for validation. The same optimizer (Adam), learning rate (1e51e51\mathrm{e}{-5}1 roman_e - 5), learning-rate scheduler (warm-up, static, and decaying epochs), batch size of 5, and model selection metric (TV RMSE) were used for both proposed approaches. We experimented with MTL strategies (e.g. alternating epochs) but with no improvement in performance.

APTAI, utilizing wav2vec2-large-robust (see Table 1), was trained for 20 epochs, with 20% dropout, and combined HPRC–N and –F for training/validation. In terms of the MTL loss optimization, we set λ=1𝜆1\lambda=1italic_λ = 1 thus weighting both tasks equally, which resulted in the best performance.

Fine-tuning of the phoneme recognizer for stage-1 of f-APTAI was based on wav2vec2-large-robust (best performance, see Table 1) with a batch size of 2, 160 epochs, learning rate of 5e65e65\mathrm{e}{-6}5 roman_e - 6, a final dropout of 10%, and model selection based on validation PER. For stage-2, we trained for 60 epochs, used only HPRC–N (since including F would negatively impact the PER of stage-1), set λ=0.4𝜆0.4\lambda=0.4italic_λ = 0.4, and N=60𝑁60N=60italic_N = 60, with shorter phoneme sequences being padded. Finally, the implementation of the FS loss was taken from [24].

4 Results and Discussion

Table 1 reveals that CP is a noisy dataset, while HPRC is not. This results in better PER for ”normal” speaking rates, while ”fast” are more challenging (also for human listeners), with wav2vec2-large-robust performing best.

Table 2 shows the main evaluation test results of the introduced APTAI task, conducted in a speaker-independent (LOSO) setting. Figure 5 illustrates prediction performance, showing a selection of TVs for improved readability, whilst Figure 6 shows all TVs individually. In terms of TV metrics, both models perform similarly, with APTAI achieving the best mean PCC of 0.730.730.730.73. Comparing this result to other works is difficult since setups are not uniform (e.g. trimming of silence), and reproduced results do not match originally reported ones [6, 16]. However, reported speaker-independent PCC results on HPRC roughly range from 35%percent3535\%35 % to 76%percent7676\%76 %, so we achieve competitive performance. In terms of phoneme recognition and alignment, frame classification outperforms the forced alignment approach by 11.20%percent11.2011.20\%11.20 %, achieving a frame overlap of 87.38%percent87.3887.38\%87.38 %. Shih et. al. [24] reported that in their experiments, a wider receptive field lead to alignment instability. The fact that we use hidden transformer representations, capturing weighted global sequence dependencies, might explain the reduced alignment performance, which requires future research. Overall, the work of Siriwardena et. al. [7] is similar, however, they report a PER of approx. 27% (and no alignment metric) since they see the phoneme-related objective as an auxiliary task to improve TV-related performance, while we see both tasks as equally important.

When looking at Table 3 and Figure 6, it is noticeable that especially the regression of TMCD and TBCD perform significantly worse when compared to the other TVs, hampering the overall mean PCC. This needs further investigation since other papers do not seem to suffer from this problem.

Refer to caption
Figure 6: Ground truth and model prediction (APTAI) in red color, of an unseen speaker (refer to Figure 1 as legend).
Table 3: Individual TV metrics, in terms of mean and deviation across the leave-one-speaker-out experiments (APTAI model).
HPRC–N HPRC–F
TV’s PCC\uparrow RSME[mm]delimited-[]𝑚𝑚absent[mm]\downarrow[ italic_m italic_m ] ↓ PCC\uparrow RSME[mm]delimited-[]𝑚𝑚absent[mm]\downarrow[ italic_m italic_m ] ↓
LA 0.87±plus-or-minus\pm±0.03 0.49±plus-or-minus\pm±0.06 81.76±plus-or-minus\pm±4.89 0.57±plus-or-minus\pm±0.07
LP 0.75±plus-or-minus\pm±0.08 0.66±plus-or-minus\pm±0.10 66.93±plus-or-minus\pm±8.57 0.75±plus-or-minus\pm±0.10
JA 0.82±plus-or-minus\pm±0.04 0.57±plus-or-minus\pm±0.06 73.97±plus-or-minus\pm±4.19 0.67±plus-or-minus\pm±0.06
TTCL 0.84±plus-or-minus\pm±0.04 0.54±plus-or-minus\pm±0.06 81.85±plus-or-minus\pm±3.25 0.56±plus-or-minus\pm±0.05
TTCD 0.79±plus-or-minus\pm±0.04 0.61±plus-or-minus\pm±0.06 74.14±plus-or-minus\pm±5.48 0.67±plus-or-minus\pm±0.06
TMCL 0.82±plus-or-minus\pm±0.03 0.57±plus-or-minus\pm±0.04 79.38±plus-or-minus\pm±2.47 0.60±plus-or-minus\pm±0.04
TMCD 0.37±plus-or-minus\pm±0.11 1.07±plus-or-minus\pm±0.09 27.94±plus-or-minus\pm±11.34 1.13±plus-or-minus\pm±0.09
TBCL 0.77±plus-or-minus\pm±0.04 0.64±plus-or-minus\pm±0.05 74.36±plus-or-minus\pm±4.36 0.67±plus-or-minus\pm±0.06
TBCD 0.54±plus-or-minus\pm±0.15 0.88±plus-or-minus\pm±0.14 56.57±plus-or-minus\pm±14.53 0.85±plus-or-minus\pm±0.14

5 Conclusion

This paper introduced APTAI, a novel combination of two tasks previously viewed separately. We investigated two different approaches, sharing the same robust requirements but differing mainly in their method of phoneme prediction and alignment. Here, the frame classification based APTAI model performed better, especially in terms of phoneme-related metrics. However, f-APTAI, based on forced alignment, has potentially more room for improvement in future work. An example of this, applicable to both models and requiring new pre-training, is changing the output frame rate of wav2vec2 to \qty10ms\qty10𝑚𝑠\qty{10}{ms}10 italic_m italic_s instead of \qty20ms\qty20𝑚𝑠\qty{20}{ms}20 italic_m italic_s by changing the stride of the feature extractor, to improve alignment performance [23] and enable \qty100Hz\qty100𝐻𝑧\qty{100}{Hz}100 italic_H italic_z TV regression.

6 Acknowledgements

Suppressed due to anonymous submission to INTERSPEECH 2024.

References

  • [1] C. P. Browman and L. Goldstein, “Gestural specification using dynamically-defined articulatory structures,” Journal of Phonetics, vol. 18, no. 3, pp. 299–320, 1990.
  • [2] A. Ji, “Speaker independent acoustic-to-articulatory inversion,” Ph.D. dissertation, Marquette University, 2014.
  • [3] R. S. McGowan, “Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests,” Speech Communication, vol. 14, no. 1, pp. 19–48, 1994.
  • [4] J. Chartier, G. K. Anumanchipalli, K. Johnson et al., “Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex,” Neuron, vol. 98, no. 5, pp. 1042–1054, 2018.
  • [5] P. Wu, L.-W. Chen, C. J. Cho, S. Watanabe, L. Goldstein, A. W. Black, and G. K. Anumanchipalli, “Speaker-independent acoustic-to-articulatory speech inversion,” 2023.
  • [6] J. Wang, J. Liu, L. Zhao, S. Wang, R. Yu, and L. Liu, “Acoustic-to-articulatory inversion based on speech decomposition and auxiliary feature,” in ICASSP 2022-2022).   IEEE, 2022, pp. 4808–4812.
  • [7] Y. M. Siriwardena, G. Sivaraman, and C. Espy-Wilson, “Acoustic-to-articulatory speech inversion with multi-task learning,” arXiv preprint arXiv:2205.13755, 2022.
  • [8] G. Beguš, A. Zhou, P. Wu, and G. K. Anumanchipalli, “Articulation gan: Unsupervised modeling of articulatory learning,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [9] S. K. Maharana, A. Illa et al., “Acoustic-to-articulatory inversion for dysarthric speech by using cross-corpus acoustic-articulatory data,” in ICASSP 2021.   IEEE, pp. 6458–6462.
  • [10] N. R. Benway, Y. M. Siriwardena et al., “Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders,” in Proc. INTERSPEECH 2023, pp. 4568–4572.
  • [11] C. Haldin et al., “Speech rehabilitation in post-stroke aphasia using visual illustration of speech articulators: A case report study,” Clinical Linguistics & Phonetics, vol. 35, no. 3, pp. 253–276, 2021.
  • [12] T. Sweeney, F. Hegarty et al., “Randomized controlled trial comparing parent led therapist supervised articulation therapy (plat) with routine intervention for children with speech disorders associated with cleft palate,” International Journal of Language & Communication Disorders, vol. 55, no. 5, pp. 639–660, 2020.
  • [13] Y. M. Siriwardena and C. Espy-Wilson, “The secret source: Incorporating source features to improve acoustic-to-articulatory speech inversion,” in ICASSP 2023.   IEEE, pp. 1–5.
  • [14] N. Seneviratne, G. Sivaraman, and C. Espy-Wilson, “Multi-Corpus Acoustic-to-Articulatory Speech Inversion,” in Proc. Interspeech 2019, 2019, pp. 859–863.
  • [15] G. Sivaraman, V. Mitra, H. Nam, M. Tiede, and C. Espy-Wilson, “Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion,” The Journal of the Acoustical Society of America, vol. 146, no. 1, pp. 316–329, 2019.
  • [16] A. S. Shahrebabaki, S. M. Siniscalchi et al., “Sequence-to-sequence articulatory inversion through time convolution of sub-band frequency signals,” Proc. Interspeech 2020.
  • [17] T. Biasutto-Lervat and S. Ouni, “Phoneme-to-articulatory mapping using bidirectional gated rnn,” in Interspeech 2018.
  • [18] V. Ribeiro, K. Isaieva et al., “Automatic generation of the complete vocal tract shape from the sequence of phonemes to be articulated,” Speech Communication, vol. 141, pp. 1–13, 2022.
  • [19] A. Singh, A. Illa, and P. K. Ghosh, “A comparative study of estimating articulatory movements from phoneme sequences and acoustic features,” in ICASSP 2020, pp. 7334–7338.
  • [20] S. Udupa, A. Roy, A. Singh, A. Illa, and P. K. Ghosh, “Estimating articulatory movements in speech production with transformer networks,” Proc. Interspeech 2021.
  • [21] Q. Li, C. Zhang, and P. C. Woodland, “Combining frame-synchronous and label-synchronous systems for speech recognition,” arXiv preprint arXiv:2107.00764, 2021.
  • [22] R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B. Catanzaro, “One tts alignment to rule them all,” in ICASSP 2022.   IEEE, 2022, pp. 6092–6096.
  • [23] J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,” in ICASSP 2022.   IEEE, pp. 8167–8171.
  • [24] K. J. Shih, R. Valle et al., “Rad-tts: Parallel flow-based tts with robust alignment learning and diverse synthesis,” in ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  • [25] F. Kreuk, J. Keshet, and Y. Adi, “Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation,” in Proc. Interspeech 2020, pp. 3700–3704.
  • [26] Y. Li, B. J. Wohlan, D.-S. Pham, K. Y. Chan, R. Ward, N. Hennessey, and T. Tan, “Improving text-independent forced alignment to support speech-language pathologists with phonetic transcription,” Sensors, vol. 23, no. 24, p. 9650, 2023.
  • [27] J. Lian, C. Feng, N. Farooqi, S. Li et al., “Unconstrained dysfluency modeling for dysfluent speech transcription and detection,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8.
  • [28] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  • [29] M. Parrot, J. Millet, and E. Dunbar, “Independent and automatic evaluation of acoustic-to-articulatory inversion models,” Proc. Interspeech 2020.
  • [30] T. tom Dieck, P.-A. Pérez-Toro, T. Arias-Vergara, E. Nöth, and P. Klumpp, “Wav2vec behind the scenes: How end2end models learn phonetics,” Proc. Interspeech 2022, pp. 5130–5134.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [32] T. Kisler, U. Reichel, and F. Schiel, “Multilingual processing of speech via web services,” Computer Speech & Language, vol. 45, pp. 326–347, 2017.
  • [33] P. Klumpp et al., “Common phone: A multilingual dataset for robust acoustic modelling,” Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 763–768, 2022.
  • [34] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer et al., “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
  • [35] J. S. Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
  • [36] M. Tiede, C. Y. Espy-Wilson et al., “Quantifying kinematic aspects of reduction in a contrasting rate production task,” The Journal of the Acoustical Society of America, vol. 141, no. 5_Supplement, pp. 3580–3580, 2017.