Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\interspeechcameraready\name

[affiliation=1,2]AngeloOrtiz Tandazo \name[affiliation=3]ThomasSchatz \name[affiliation=2]ThomasHueber \name[affiliation=1, 4]EmmanuelDupoux

Simulating articulatory trajectories with phonological feature interpolation

Abstract

As a first step towards a complete computational model of speech learning involving perception-production loops, we investigate the forward mapping between pseudo-motor commands and articulatory trajectories. Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence. Different interpolation techniques are compared to generate smooth trajectories in these feature spaces, with a potential optimisation of the target value and timing to capture co-articulation effects. We report the Pearson correlation between a linear projection of the generated trajectories and articulatory data derived from a multi-speaker dataset of electromagnetic articulography (EMA) recordings. A correlation of 0.67 is obtained with an extended feature set based on generative phonology and a linear interpolation technique. We discuss the implications of our results for our understanding of the dynamics of biological motion.

keywords:
speech production, computational modelling, phonological features, articulatory-to-acoustic mapping

1 Introduction

Recent advances in self-supervised learning (SSL) have led to progress in various speech processing tasks [1, 2] and language modelling from speech units [3]. These SSL models require increasingly greater amounts of (unlabelled) data to capture as much acoustic variance as possible. Moreover, their capacity to learn high-level language representations hinges on the quality of the underlying speech units [4], which are not linguistically interpretable [2, 5]. Importantly, these representations remain sensitive to contextual effects such as co-articulation [6] making them sub-optimal to efficiently code context-invariant phonological units.

According to the motor [7] or perceptuo-motor theories [8] of speech perception, humans’ ‘quest for invariance’ [9] is done by recovering motor or articulatory representations from an auditory input. These representations are supposed to be less variable than their acoustic counterparts. Several studies have found neuro-physiological correlates of this mental ‘sensory-to-motor’ inverse mapping in speech perception [10]. Incorporating such motor representation into SSL speech models could potentially improve their performance (e.g. noise robustness, low-resource downstream tasks, etc.) and lead to more plausible computational models of speech and language acquisition.

In a simplified perception-production loop of speech motor control [11] (see the bottom left of Figure 1), the motor commands are derived from the sensory, acoustic signal and used to generate the underlying articulatory trajectories (via a so-called forward model).

Refer to caption
Figure 1: Simplified diagram of a speech perception-production loop (to the left). The focus of this work lies in the forward model and the linear probing (to the right).

Several computational models of such speech perception-production loop have been proposed in the literature [12, 13, 14, 15]. However, in most studies, the motor or articulatory representations are derived from a specific speaker or a specific articulatory model. As a result, these models, while enabling to study some of the underlying processes of speech acquisition, perception and motor control by simulation, are not designed to scale up to large numbers of speakers or languages.

As a first step towards a more universal SSL speech model integrating motor or articulatory representations, we focus in the present study on the forward model, i.e. from the motor commands to the generation of articulatory trajectories. First, we investigate different feature sets to encode a given phonetic target sequence, by relying either on the generative phonology (GP, [16]) or on the articulatory phonology (AP, [17]) theories. Importantly, a phonetic target is here encoded in terms of phonologically-motivated and articulatory-related categories (e.g. the place of articulation for a consonant in GP, the location and degree of a constriction in AP).

To generate continuous (and smooth) trajectories in these feature spaces (with a forward model), we test different interpolation methods, which differ from the dynamic properties desired at each phonetic target (e.g. zero and/or continuous velocity at each target). To account for the uncertainty of our timing heuristic and the (potential) target undershoot phenomenon, we also consider variants performing target optimisation both in space (find the offset from the ideal feature value, e.g. lips only partly closed) and time (reach a target sooner or later). Moreover, the proposed approach can deal with unspecified features for which the value depends on the context (for instance, the position of a constriction modulated by the vocalic context). The generated trajectories in the GP or AP feature spaces are evaluated using a linear probing technique. A linear model is learnt between the generated trajectories on the one hand, and the parameters of an articulatory model [18] built from a multi-speaker electromagnetic articulography (EMA) dataset on the other. Such an evaluation was also used in [19] to probe HuBERT representations, among other SSL models.

The main contributions and findings of the paper are the following: (i) we propose a general methodology to probe pseudo-motor commands and forward models in a computational model of a speech perception-production loop; (ii) we show that features derived from generative phonology (GP) correlate better with real articulatory recordings than articulatory phonology (AP) ones; (iii) a bit surprisingly, we show that a linear interpolation between these features better captures the dynamics of real articulatory data compared to a more complex one (spline based), with constraints on the velocity and/or acceleration at each phonetic target; (iv) we show that the use of unspecified (context-dependent) phonological features improves performance, probably by allowing the forward model to better account for natural co-articulation patterns.

The code with the data processing and interpolation methods can be found at [to ensure author anonymity, the link to the resource will be added after the review process].

2 Methodology

2.1 Phonological feature set

Two phonological feature sets were used. The first one, based on generative phonology and referred to as the GP feature set, was proposed by [20, Chapter 4]. It describes phonetic targets in terms of 26262626 manner, laryngeal and place features. The GP feature set is considered under two variants: with unknown support and binary. Following [20, p. 91], some phonemes have zero-value features, notably because their values depend on their local context within an utterance or simply because they are irrelevant to the underlying phoneme. Hence, the ternary-valued (including the zero values) GP feature set is considered as is (to be used by interpolating methods handling unknown values) but also in a binary form, in which a zero value is considered as being the ‘absence’ of the given feature (thus, negatively valued).

The second feature set is based on articulatory phonology (AP), and the location and degrees of constriction of 5 major articulators in the vocal tract. AP-based features have been successfully used in automatic speech recognition, first within a Bayesian framework [21] to deal with pronunciation variability in spontaneous speech, and then in a DNN-based system [22] to increase the robustness to noise. The AP feature values come in the form of categorical distributions over totally ordered categories [21, p. 126]. Feature values are typically Dirac distributions, except for some phoneme features that depend on the phonemic context. Similarly to the GP feature set, we have an AP unknown variant by considering the non-Dirac distributed feature values as unknown values to be found contextually. This feature set is then used in a scalar version, in which each feature-value category is mapped to a real value (scalar AP: 8 features); and a one-hot version, in which the feature-value categories are one-hot encoded (one-hot AP: 32 features).

Table 1: Articulatory score: average Pearson correlation coefficients of the 6666 articulatory parameters and 6666 speakers. The scores correspond to each interpolation method’s best configuration. Default: binary features without optimisation. Variants: unknown featuresμ, timing optimisation, timing and position optimisation. (Standard error across the 6 different speakers was found to be 0.010.010.010.01 on average.)
Feature set # Features Method Score \uparrow
GP + one-hot phoneme 73737373 piecewise-cst \cellcolorhigh!58!low!700.595\cellcolor𝑖𝑔58𝑙𝑜𝑤700.595\cellcolor{high!58!low!70}0.595italic_h italic_i italic_g italic_h ! 58 ! italic_l italic_o italic_w ! 700.595
linearμ \cellcolor𝐡𝐢𝐠𝐡!𝟏𝟎𝟎!𝐥𝐨𝐰!700.679\cellcolor𝐡𝐢𝐠𝐡100𝐥𝐨𝐰700.679\mathbf{\cellcolor{high!100!low!70}0.679}bold_high ! bold_100 ! bold_low ! bold_700.679
cubic Hermiteμ \cellcolorhigh!94!low!700.668\cellcolor𝑖𝑔94𝑙𝑜𝑤700.668\cellcolor{high!94!low!70}0.668italic_h italic_i italic_g italic_h ! 94 ! italic_l italic_o italic_w ! 700.668
natural cubic \cellcolorhigh!92!low!700.663\cellcolor𝑖𝑔92𝑙𝑜𝑤700.663\cellcolor{high!92!low!70}0.663italic_h italic_i italic_g italic_h ! 92 ! italic_l italic_o italic_w ! 700.663
one-hot AP + one-hot phoneme 94 linearμ \cellcolorhigh!92!low!700.663\cellcolor𝑖𝑔92𝑙𝑜𝑤700.663\cellcolor{high!92!low!70}0.663italic_h italic_i italic_g italic_h ! 92 ! italic_l italic_o italic_w ! 700.663
cubic Hermiteμ \cellcolorhigh!84!low!700.648\cellcolor𝑖𝑔84𝑙𝑜𝑤700.648\cellcolor{high!84!low!70}0.648italic_h italic_i italic_g italic_h ! 84 ! italic_l italic_o italic_w ! 700.648
natural cubicμ \cellcolorhigh!74!low!700.628\cellcolor𝑖𝑔74𝑙𝑜𝑤700.628\cellcolor{high!74!low!70}0.628italic_h italic_i italic_g italic_h ! 74 ! italic_l italic_o italic_w ! 700.628
scalar AP + one-hot phoneme 70 linearμ \cellcolorhigh!88!low!700.656\cellcolor𝑖𝑔88𝑙𝑜𝑤700.656\cellcolor{high!88!low!70}0.656italic_h italic_i italic_g italic_h ! 88 ! italic_l italic_o italic_w ! 700.656
cubic Hermiteμ \cellcolorhigh!81!low!700.642\cellcolor𝑖𝑔81𝑙𝑜𝑤700.642\cellcolor{high!81!low!70}0.642italic_h italic_i italic_g italic_h ! 81 ! italic_l italic_o italic_w ! 700.642
natural cubicμ \cellcolorhigh!72!low!700.624\cellcolor𝑖𝑔72𝑙𝑜𝑤700.624\cellcolor{high!72!low!70}0.624italic_h italic_i italic_g italic_h ! 72 ! italic_l italic_o italic_w ! 700.624
one-hot phoneme 47474747111British long vowels and the silence are included. piecewise-cst \cellcolorhigh!55!low!700.589\cellcolor𝑖𝑔55𝑙𝑜𝑤700.589\cellcolor{high!55!low!70}0.589italic_h italic_i italic_g italic_h ! 55 ! italic_l italic_o italic_w ! 700.589
linear \cellcolorhigh!83!low!700.645\cellcolor𝑖𝑔83𝑙𝑜𝑤700.645\cellcolor{high!83!low!70}0.645italic_h italic_i italic_g italic_h ! 83 ! italic_l italic_o italic_w ! 700.645
cubic Hermite \cellcolorhigh!81!low!700.642\cellcolor𝑖𝑔81𝑙𝑜𝑤700.642\cellcolor{high!81!low!70}0.642italic_h italic_i italic_g italic_h ! 81 ! italic_l italic_o italic_w ! 700.642
natural cubic \cellcolorhigh!75!low!700.630\cellcolor𝑖𝑔75𝑙𝑜𝑤700.630\cellcolor{high!75!low!70}0.630italic_h italic_i italic_g italic_h ! 75 ! italic_l italic_o italic_w ! 700.630
GP 26262626 piecewise-cst \cellcolorhigh!40!low!700.559\cellcolor𝑖𝑔40𝑙𝑜𝑤700.559\cellcolor{high!40!low!70}0.559italic_h italic_i italic_g italic_h ! 40 ! italic_l italic_o italic_w ! 700.559
linearμ \cellcolorhigh!75!low!700.630\cellcolor𝑖𝑔75𝑙𝑜𝑤700.630\cellcolor{high!75!low!70}0.630italic_h italic_i italic_g italic_h ! 75 ! italic_l italic_o italic_w ! 700.630
cubic Hermiteμ(†) \cellcolorhigh!71!low!700.622\cellcolor𝑖𝑔71𝑙𝑜𝑤700.622\cellcolor{high!71!low!70}0.622italic_h italic_i italic_g italic_h ! 71 ! italic_l italic_o italic_w ! 700.622
natural cubic \cellcolorhigh!75!low!700.629\cellcolor𝑖𝑔75𝑙𝑜𝑤700.629\cellcolor{high!75!low!70}0.629italic_h italic_i italic_g italic_h ! 75 ! italic_l italic_o italic_w ! 700.629
one-hot AP 32323232 linearμ \cellcolorhigh!64!low!700.608\cellcolor𝑖𝑔64𝑙𝑜𝑤700.608\cellcolor{high!64!low!70}0.608italic_h italic_i italic_g italic_h ! 64 ! italic_l italic_o italic_w ! 700.608
cubic Hermiteμ‡ \cellcolorhigh!58!low!700.596\cellcolor𝑖𝑔58𝑙𝑜𝑤700.596\cellcolor{high!58!low!70}0.596italic_h italic_i italic_g italic_h ! 58 ! italic_l italic_o italic_w ! 700.596
natural cubicμ \cellcolorhigh!29!low!700.538\cellcolor𝑖𝑔29𝑙𝑜𝑤700.538\cellcolor{high!29!low!70}0.538italic_h italic_i italic_g italic_h ! 29 ! italic_l italic_o italic_w ! 700.538
scalar AP 8888 linearμ \cellcolorhigh!16!low!700.511\cellcolor𝑖𝑔16𝑙𝑜𝑤700.511\cellcolor{high!16!low!70}0.511italic_h italic_i italic_g italic_h ! 16 ! italic_l italic_o italic_w ! 700.511
cubic Hermiteμ‡ \cellcolorhigh!14!low!700.506\cellcolor𝑖𝑔14𝑙𝑜𝑤700.506\cellcolor{high!14!low!70}0.506italic_h italic_i italic_g italic_h ! 14 ! italic_l italic_o italic_w ! 700.506
natural cubicμ \cellcolorhigh!0!low!700.479\cellcolor𝑖𝑔0𝑙𝑜𝑤700.479\cellcolor{high!0!low!70}0.479italic_h italic_i italic_g italic_h ! 0 ! italic_l italic_o italic_w ! 700.479

To ensure that the feature-level information for phonemes is relevant, we also use a feature set with one-hot phoneme encodings. In total, we evaluate seven feature sets: one-hot phonemes, scalar AP (also enriched with one-hot phonemes), one-hot AP (also enriched with one-hot phonemes), binary GP and unknown-supporting GP (also enriched with one-hot phonemes).

2.2 Dataset

The articulatory data comes from the publicly available MOCHA-TIMIT dataset222The dataset can be found at https://www.cstr.ed.ac.uk/research/projects/artic/mocha.html. It provides electromagnetic articulography (EMA) recordings for 460 short sentences read by 8 British English speakers along 12 dimensions (2D midsagittal coordinates for 6 articulators: tongue tip, tongue body, tongue dorsum, lower incisor, upper lip and lower lip.) In this study, we consider 6 speakers, namely fsew0, msak0, ffes0, mjjn0, faet0 and maps0, because a sequence of waveform files did not match the given transcriptions for the other two speakers. For each speaker, the 460 utterances are split into 410 for training (out of which 20 are randomly drawn for development) and 50 for testing.

The EMA data is low-pass filtered at 50 Hztimes50hertz50\text{\,}\mathrm{Hz}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG and down-sampled from 500 Hztimes500hertz500\text{\,}\mathrm{Hz}start_ARG 500 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to 100 Hztimes100hertz100\text{\,}\mathrm{Hz}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG. As with [23], raw EMA data is then converted into an easier-to-interpret and lower-dimensional set of 6 ‘articulatory parameters’ (jaw height, tongue body, tongue back, tongue tip, lip protrusion and lip height) using a linear decomposition technique, which is often referred to as ‘guided PCA’. It aims to decouple the jaw from tongue and lip movements and extract independent degrees of freedom from the vocal tract.

Each audio recording was segmented at the phonetic level using the Montreal Forced Aligner333https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner. Based on the resulting phonetic segmentation, the initial and final utterance silences were removed and the utterances with non-silence boundary phones were discarded. The filtered phonetic segmentations were later mapped into featural segmentations by replacing the phones with the phonological features of their underlying phonemes (lookup table mapping). Finally, we inferred timings for the phonological targets from the time midpoint of each phoneme in the featural segmentation.

2.3 Forward model

Different forward mapping techniques are tested to generate continuous trajectories from the discrete pseudo-motor commands provided by GP and AP (Section 2.1). As a first baseline, we use the piecewise-constant interpolation, which keeps all the phonological features constant for the duration given by the phonetic segmentation.

To test a smoothness degree that better fits the articulatory space, we test linear and cubic interpolation methods. Specifically, we consider two cubic methods: the cubic Hermite spline and the natural cubic spline. The former enforces zero velocity at all targets, and continuity of both position and velocity (so the acceleration is possibly discontinuous at the targets); whereas the latter enforces continuity of all position, velocity and acceleration, with zero acceleration at the initial and final targets.

Formally, let d𝐍>0𝑑subscript𝐍absent0d\in\mathbf{N}_{>0}italic_d ∈ bold_N start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT be the number of phonological features. For a given utterance, let K𝐍>0𝐾subscript𝐍absent0K\in\mathbf{N}_{>0}italic_K ∈ bold_N start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT be its number of non-boundary (or intermediate) targets. Then, its featural segmentation is denoted by (𝐗,𝐘)𝐑(K+2)×d×𝐑(K+2)×2𝐗𝐘superscript𝐑𝐾2𝑑superscript𝐑𝐾22(\mathbf{X},\mathbf{Y})\in\mathbf{R}^{(K+2)\times d}\times\mathbf{R}^{(K+2)% \times 2}( bold_X , bold_Y ) ∈ bold_R start_POSTSUPERSCRIPT ( italic_K + 2 ) × italic_d end_POSTSUPERSCRIPT × bold_R start_POSTSUPERSCRIPT ( italic_K + 2 ) × 2 end_POSTSUPERSCRIPT, where 𝐗𝐗\mathbf{X}bold_X contains the K+2𝐾2K+2italic_K + 2 target positions, and 𝐘𝐘\mathbf{Y}bold_Y the targets’ time intervals. In this work, we remove the boundary silences, so y1,=𝟎2subscript𝑦1subscript02y_{1,\ast}=\mathbf{0}_{2}italic_y start_POSTSUBSCRIPT 1 , ∗ end_POSTSUBSCRIPT = bold_0 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, yK+2,=yK+1,2𝟏2subscript𝑦𝐾2subscript𝑦𝐾12subscript12y_{K+2,\ast}=y_{K+1,2}\mathbf{1}_{2}italic_y start_POSTSUBSCRIPT italic_K + 2 , ∗ end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_K + 1 , 2 end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and x1,=xK+2,=𝟎dsubscript𝑥1subscript𝑥𝐾2subscript0𝑑x_{1,\ast}=x_{K+2,\ast}=\mathbf{0}_{d}italic_x start_POSTSUBSCRIPT 1 , ∗ end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_K + 2 , ∗ end_POSTSUBSCRIPT = bold_0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. From this, we deduce a vector of midpoint target timings 𝐭12Y𝟏2𝐭12𝑌subscript12\mathbf{t}\triangleq\frac{1}{2}Y\mathbf{1}_{2}bold_t ≜ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_Y bold_1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The (base) interpolating function for the utterance (𝐗,𝐘)𝐗𝐘(\mathbf{X},\mathbf{Y})( bold_X , bold_Y ), expressed as f(τ;𝐗,𝐭)𝑓𝜏𝐗𝐭f(\tau;\mathbf{X},\mathbf{t})italic_f ( italic_τ ; bold_X , bold_t ), 0τtK+20𝜏subscript𝑡𝐾20\leq\tau\leq t_{K+2}0 ≤ italic_τ ≤ italic_t start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT, thus satisfies

f(tk;𝐗,𝐭)=xk,,𝑓subscript𝑡𝑘𝐗𝐭subscript𝑥𝑘f(t_{k};\mathbf{X},\mathbf{t})=x_{k,\ast},italic_f ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; bold_X , bold_t ) = italic_x start_POSTSUBSCRIPT italic_k , ∗ end_POSTSUBSCRIPT , (1)

for all 1kK+21𝑘𝐾21\leq k\leq K+21 ≤ italic_k ≤ italic_K + 2.

To tackle the uncertainty of the midpoint-timing heuristic assumed for the base interpolating functions and to allow for target undershoot444Target undershoot occurs when there is not enough time for the forward model to reach some targets., we include two additive optimisations over time and space. The optimised interpolating function f𝑓fitalic_f learns the target positions and timings (𝐗,𝐭)superscript𝐗superscript𝐭(\mathbf{X}^{\prime},\mathbf{t}^{\prime})( bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) such that

f(τ;𝐗,𝐭)=g(τ,𝐗,𝐭),0τtK+2,formulae-sequence𝑓𝜏𝐗𝐭𝑔𝜏superscript𝐗superscript𝐭0𝜏subscript𝑡𝐾2f(\tau;\mathbf{X},\mathbf{t})=g(\tau,\mathbf{X}^{\prime},\mathbf{t}^{\prime}),% 0\leq\tau\leq t_{K+2},italic_f ( italic_τ ; bold_X , bold_t ) = italic_g ( italic_τ , bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , 0 ≤ italic_τ ≤ italic_t start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT , (2)

where the base interpolating function g𝑔gitalic_g satisfies Equation 1, by minimising the objective function

Lλ(𝐗,𝐭)subscript𝐿𝜆superscript𝐗superscript𝐭absent\displaystyle L_{\lambda}(\mathbf{X}^{\prime},\mathbf{t}^{\prime})\triangleqitalic_L start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≜ 0tK+2g′′(τ;𝐗,𝐭)22𝑑τsuperscriptsubscript0subscript𝑡𝐾2superscriptsubscriptdelimited-∥∥superscript𝑔′′𝜏superscript𝐗superscript𝐭22differential-d𝜏\displaystyle\int_{0}^{t_{K+2}}\lVert g^{\prime\prime}(\tau;\mathbf{X}^{\prime% },\mathbf{t}^{\prime})\rVert_{2}^{2}\,d\tau∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_K + 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_g start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_τ ; bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_τ
+λk=2K+1g(tk;𝐗,𝐭)xk,22.𝜆superscriptsubscript𝑘2𝐾1superscriptsubscriptdelimited-∥∥𝑔superscriptsubscript𝑡𝑘superscript𝐗superscript𝐭subscript𝑥𝑘22\displaystyle+\lambda\sum_{k=2}^{K+1}\lVert g(t_{k}^{\prime};\mathbf{X}^{% \prime},\mathbf{t}^{\prime})-x_{k,\ast}\rVert_{2}^{2}.+ italic_λ ∑ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT ∥ italic_g ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_k , ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

We optimise the objective function Lλsubscript𝐿𝜆L_{\lambda}italic_L start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT by on-utterance gradient descent with the initialisation (𝐗,𝐭)=(𝐗,𝐭)superscript𝐗superscript𝐭𝐗𝐭(\mathbf{X}^{\prime},\mathbf{t}^{\prime})=(\mathbf{X},\mathbf{t})( bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( bold_X , bold_t ).

2.4 Linear probing

To evaluate the different interpolation methods described above, we adopted the same metric as used in [19], referred to as the articulatory score.

Let Si={(𝐗j,𝐭j,𝐙j)j}subscript𝑆𝑖subscriptsubscript𝐗𝑗subscript𝐭𝑗subscript𝐙𝑗𝑗S_{i}=\{(\mathbf{X}_{j},\mathbf{t}_{j},\mathbf{Z}_{j})_{j}\}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } be the set of utterances for the i𝑖iitalic_ith speaker, with 𝐙j𝐑n𝐙j×6subscript𝐙𝑗superscript𝐑subscript𝑛subscript𝐙𝑗6\mathbf{Z}_{j}\in\mathbf{R}^{n_{\mathbf{Z}_{j}}\times 6}bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT being the n𝐙jsubscript𝑛subscript𝐙𝑗n_{\mathbf{Z}_{j}}italic_n start_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT articulatory parameters of the j𝑗jitalic_jth utterance. Then, for each interpolation method f𝑓fitalic_f and speaker Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we learn a linear transformation hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that minimises the reconstruction loss from the interpolated articulatory trajectories and the expected articulatory parameters as follows

i=1|Si|(𝐗,𝐭,𝐙)Si1n𝐙k=1n𝐙hi(f(k100;𝐗,𝐭))zk,22.\mathcal{L}_{i}=\frac{1}{\lvert S_{i}\rvert}\sum_{(\mathbf{X},\mathbf{t},% \mathbf{Z})\in S_{i}}\frac{1}{n_{\mathbf{Z}}}\sum_{k=1}^{n{{}_{\mathbf{Z}}}}% \left\lVert h_{i}\left(f\left(\frac{k}{100};\mathbf{X},\mathbf{t}\right)\right% )-z_{k,\ast}\right\rVert_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( bold_X , bold_t , bold_Z ) ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_FLOATSUBSCRIPT bold_Z end_FLOATSUBSCRIPT end_POSTSUPERSCRIPT ∥ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ( divide start_ARG italic_k end_ARG start_ARG 100 end_ARG ; bold_X , bold_t ) ) - italic_z start_POSTSUBSCRIPT italic_k , ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

This is done via gradient descent with a learning rate of 0.0010.0010.0010.001 via the Adam optimiser, with β=(0.9,0.999)𝛽0.90.999\beta=(0.9,0.999)italic_β = ( 0.9 , 0.999 ). We run the learning procedure for 100100100100 epochs unless it is early stopped when the validation loss stops decreasing (patience fixed at 5555).

For the optimised cubic interpolations, we first do a grid search of the hyper-parameters used in the on-utterance target optimisation: (i) timing learning rate within {1×106,5×106,1×105,5×105,1×104}1E-65E-61E-55E-51E-4\{$1\text{\times}{10}^{-6}$,$5\text{\times}{10}^{-6}$,$1\text{\times}{10}^{-5}% $,$5\text{\times}{10}^{-5}$,$1\text{\times}{10}^{-4}$\}{ start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 6 end_ARG end_ARG , start_ARG 5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 6 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 5 end_ARG end_ARG , start_ARG 5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 5 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG }, (ii) position learning rate within {1×103,1×102,1×101}1E-31E-21E-1\{$1\text{\times}{10}^{-3}$,$1\text{\times}{10}^{-2}$,$1\text{\times}{10}^{-1}$\}{ start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 2 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 1 end_ARG end_ARG }, and (iii) loss weight parameter λ{0,1×103,1×104,1×105,1×106,1×107}𝜆01E31E41E51E61E7\lambda\in\{0,$1\text{\times}{10}^{3}$,$1\text{\times}{10}^{4}$,$1\text{\times% }{10}^{5}$,$1\text{\times}{10}^{6}$,$1\text{\times}{10}^{7}$\}italic_λ ∈ { 0 , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 3 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 4 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 5 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 6 end_ARG end_ARG , start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 7 end_ARG end_ARG }555The spatial term of the loss in Equation 3 is very small compared with the smoothness term.. For each interpolation method, the hyper-parameter configuration with the highest articulatory score in the development set was selected. Finally, we compute the Pearson correlation coefficient (PCC) between the learnt linear projections and the articulatory parameters. The final articulatory score of an interpolation method is then the average PCC of all articulatory parameters and speakers.

3 Results

We run the linear, cubic Hermite spline and natural cubic spline interpolation methods (Section 2.3) on the seven feature sets derived from one-hot phoneme encoding, AP and GP theories (Section 2.1). The piecewise-constant interpolation can only be run on fully specified feature sets, here the one-hot phoneme and binary GP feature sets.

Table 2: Articulatory score for each speaker on the GP feature set enriched with one-hot phonemes.
Method msak0 fsew0 ffes0 mjjn0 maps0 faet0 Average
piecewise-cst \cellcolorhigh!47!low!700.634\cellcolor𝑖𝑔47𝑙𝑜𝑤700.634\cellcolor{high!47!low!70}0.634italic_h italic_i italic_g italic_h ! 47 ! italic_l italic_o italic_w ! 700.634 \cellcolorhigh!39!low!700.616\cellcolor𝑖𝑔39𝑙𝑜𝑤700.616\cellcolor{high!39!low!70}0.616italic_h italic_i italic_g italic_h ! 39 ! italic_l italic_o italic_w ! 700.616 \cellcolorhigh!27!low!700.590\cellcolor𝑖𝑔27𝑙𝑜𝑤700.590\cellcolor{high!27!low!70}0.590italic_h italic_i italic_g italic_h ! 27 ! italic_l italic_o italic_w ! 700.590 \cellcolorhigh!28!low!700.591\cellcolor𝑖𝑔28𝑙𝑜𝑤700.591\cellcolor{high!28!low!70}0.591italic_h italic_i italic_g italic_h ! 28 ! italic_l italic_o italic_w ! 700.591 \cellcolorhigh!3!low!700.537\cellcolor𝑖𝑔3𝑙𝑜𝑤700.537\cellcolor{high!3!low!70}0.537italic_h italic_i italic_g italic_h ! 3 ! italic_l italic_o italic_w ! 700.537 \cellcolorhigh!34!low!700.604\cellcolor𝑖𝑔34𝑙𝑜𝑤700.604\cellcolor{high!34!low!70}0.604italic_h italic_i italic_g italic_h ! 34 ! italic_l italic_o italic_w ! 700.604 \cellcolorhigh!30!low!700.595\cellcolor𝑖𝑔30𝑙𝑜𝑤700.595\cellcolor{high!30!low!70}0.595italic_h italic_i italic_g italic_h ! 30 ! italic_l italic_o italic_w ! 700.595
linear \cellcolor𝐡𝐢𝐠𝐡!𝟗𝟎!𝐥𝐨𝐰!700.729\cellcolor𝐡𝐢𝐠𝐡90𝐥𝐨𝐰700.729\mathbf{\cellcolor{high!90!low!70}0.729}bold_high ! bold_90 ! bold_low ! bold_700.729 \cellcolor𝐡𝐢𝐠𝐡!𝟕𝟗!𝐥𝐨𝐰!700.704\cellcolor𝐡𝐢𝐠𝐡79𝐥𝐨𝐰700.704\mathbf{\cellcolor{high!79!low!70}0.704}bold_high ! bold_79 ! bold_low ! bold_700.704 \cellcolor𝐡𝐢𝐠𝐡!𝟔𝟐!𝐥𝐨𝐰!700.666\cellcolor𝐡𝐢𝐠𝐡62𝐥𝐨𝐰700.666\mathbf{\cellcolor{high!62!low!70}0.666}bold_high ! bold_62 ! bold_low ! bold_700.666 \cellcolor𝐡𝐢𝐠𝐡!𝟓𝟖!𝐥𝐨𝐰!700.658\cellcolor𝐡𝐢𝐠𝐡58𝐥𝐨𝐰700.658\mathbf{\cellcolor{high!58!low!70}0.658}bold_high ! bold_58 ! bold_low ! bold_700.658 \cellcolor𝐡𝐢𝐠𝐡!𝟒𝟐!𝐥𝐨𝐰!700.623\cellcolor𝐡𝐢𝐠𝐡42𝐥𝐨𝐰700.623\mathbf{\cellcolor{high!42!low!70}0.623}bold_high ! bold_42 ! bold_low ! bold_700.623 \cellcolor𝐡𝐢𝐠𝐡!𝟕𝟒!𝐥𝐨𝐰!700.693\cellcolor𝐡𝐢𝐠𝐡74𝐥𝐨𝐰700.693\mathbf{\cellcolor{high!74!low!70}0.693}bold_high ! bold_74 ! bold_low ! bold_700.693 \cellcolor𝐡𝐢𝐠𝐡!𝟔𝟖!𝐥𝐨𝐰!700.679\cellcolor𝐡𝐢𝐠𝐡68𝐥𝐨𝐰700.679\mathbf{\cellcolor{high!68!low!70}0.679}bold_high ! bold_68 ! bold_low ! bold_700.679
cubic Hermite \cellcolorhigh!85!low!700.718\cellcolor𝑖𝑔85𝑙𝑜𝑤700.718\cellcolor{high!85!low!70}0.718italic_h italic_i italic_g italic_h ! 85 ! italic_l italic_o italic_w ! 700.718 \cellcolorhigh!75!low!700.695\cellcolor𝑖𝑔75𝑙𝑜𝑤700.695\cellcolor{high!75!low!70}0.695italic_h italic_i italic_g italic_h ! 75 ! italic_l italic_o italic_w ! 700.695 \cellcolorhigh!57!low!700.655\cellcolor𝑖𝑔57𝑙𝑜𝑤700.655\cellcolor{high!57!low!70}0.655italic_h italic_i italic_g italic_h ! 57 ! italic_l italic_o italic_w ! 700.655 \cellcolorhigh!54!low!700.649\cellcolor𝑖𝑔54𝑙𝑜𝑤700.649\cellcolor{high!54!low!70}0.649italic_h italic_i italic_g italic_h ! 54 ! italic_l italic_o italic_w ! 700.649 \cellcolorhigh!37!low!700.611\cellcolor𝑖𝑔37𝑙𝑜𝑤700.611\cellcolor{high!37!low!70}0.611italic_h italic_i italic_g italic_h ! 37 ! italic_l italic_o italic_w ! 700.611 \cellcolorhigh!69!low!700.681\cellcolor𝑖𝑔69𝑙𝑜𝑤700.681\cellcolor{high!69!low!70}0.681italic_h italic_i italic_g italic_h ! 69 ! italic_l italic_o italic_w ! 700.681 \cellcolorhigh!63!low!700.668\cellcolor𝑖𝑔63𝑙𝑜𝑤700.668\cellcolor{high!63!low!70}0.668italic_h italic_i italic_g italic_h ! 63 ! italic_l italic_o italic_w ! 700.668
natural cubic \cellcolorhigh!82!low!700.711\cellcolor𝑖𝑔82𝑙𝑜𝑤700.711\cellcolor{high!82!low!70}0.711italic_h italic_i italic_g italic_h ! 82 ! italic_l italic_o italic_w ! 700.711 \cellcolorhigh!71!low!700.686\cellcolor𝑖𝑔71𝑙𝑜𝑤700.686\cellcolor{high!71!low!70}0.686italic_h italic_i italic_g italic_h ! 71 ! italic_l italic_o italic_w ! 700.686 \cellcolorhigh!55!low!700.652\cellcolor𝑖𝑔55𝑙𝑜𝑤700.652\cellcolor{high!55!low!70}0.652italic_h italic_i italic_g italic_h ! 55 ! italic_l italic_o italic_w ! 700.652 \cellcolorhigh!46!low!700.632\cellcolor𝑖𝑔46𝑙𝑜𝑤700.632\cellcolor{high!46!low!70}0.632italic_h italic_i italic_g italic_h ! 46 ! italic_l italic_o italic_w ! 700.632 \cellcolorhigh!38!low!700.613\cellcolor𝑖𝑔38𝑙𝑜𝑤700.613\cellcolor{high!38!low!70}0.613italic_h italic_i italic_g italic_h ! 38 ! italic_l italic_o italic_w ! 700.613 \cellcolorhigh!70!low!700.685\cellcolor𝑖𝑔70𝑙𝑜𝑤700.685\cellcolor{high!70!low!70}0.685italic_h italic_i italic_g italic_h ! 70 ! italic_l italic_o italic_w ! 700.685 \cellcolorhigh!60!low!700.663\cellcolor𝑖𝑔60𝑙𝑜𝑤700.663\cellcolor{high!60!low!70}0.663italic_h italic_i italic_g italic_h ! 60 ! italic_l italic_o italic_w ! 700.663
Table 3: Articulatory score for each articulatory parameter on the GP feature set enriched with one-hot phonemes.
Method Jaw height Tongue body Tongue dorsum Tongue tip Lip protrusion Lip height Average
piecewise-cst \cellcolorhigh!53!low!700.646\cellcolor𝑖𝑔53𝑙𝑜𝑤700.646\cellcolor{high!53!low!70}0.646italic_h italic_i italic_g italic_h ! 53 ! italic_l italic_o italic_w ! 700.646 \cellcolorhigh!56!low!700.653\cellcolor𝑖𝑔56𝑙𝑜𝑤700.653\cellcolor{high!56!low!70}0.653italic_h italic_i italic_g italic_h ! 56 ! italic_l italic_o italic_w ! 700.653 \cellcolorhigh!6!low!700.543\cellcolor𝑖𝑔6𝑙𝑜𝑤700.543\cellcolor{high!6!low!70}0.543italic_h italic_i italic_g italic_h ! 6 ! italic_l italic_o italic_w ! 700.543 \cellcolorhigh!4!low!700.539\cellcolor𝑖𝑔4𝑙𝑜𝑤700.539\cellcolor{high!4!low!70}0.539italic_h italic_i italic_g italic_h ! 4 ! italic_l italic_o italic_w ! 700.539 \cellcolorhigh!1!low!700.532\cellcolor𝑖𝑔1𝑙𝑜𝑤700.532\cellcolor{high!1!low!70}0.532italic_h italic_i italic_g italic_h ! 1 ! italic_l italic_o italic_w ! 700.532 \cellcolorhigh!58!low!700.658\cellcolor𝑖𝑔58𝑙𝑜𝑤700.658\cellcolor{high!58!low!70}0.658italic_h italic_i italic_g italic_h ! 58 ! italic_l italic_o italic_w ! 700.658 \cellcolorhigh!30!low!700.595\cellcolor𝑖𝑔30𝑙𝑜𝑤700.595\cellcolor{high!30!low!70}0.595italic_h italic_i italic_g italic_h ! 30 ! italic_l italic_o italic_w ! 700.595
linear \cellcolor𝐡𝐢𝐠𝐡!𝟖𝟒!𝐥𝐨𝐰!700.715\cellcolor𝐡𝐢𝐠𝐡84𝐥𝐨𝐰700.715\mathbf{\cellcolor{high!84!low!70}0.715}bold_high ! bold_84 ! bold_low ! bold_700.715 \cellcolor𝐡𝐢𝐠𝐡!𝟏𝟎𝟎!𝐥𝐨𝐰!700.750\cellcolor𝐡𝐢𝐠𝐡100𝐥𝐨𝐰700.750\mathbf{\cellcolor{high!100!low!70}0.750}bold_high ! bold_100 ! bold_low ! bold_700.750 \cellcolor𝐡𝐢𝐠𝐡!𝟒𝟑!𝐥𝐨𝐰!700.625\cellcolor𝐡𝐢𝐠𝐡43𝐥𝐨𝐰700.625\mathbf{\cellcolor{high!43!low!70}0.625}bold_high ! bold_43 ! bold_low ! bold_700.625 \cellcolor𝐡𝐢𝐠𝐡!𝟒𝟒!𝐥𝐨𝐰!700.627\cellcolor𝐡𝐢𝐠𝐡44𝐥𝐨𝐰700.627\mathbf{\cellcolor{high!44!low!70}0.627}bold_high ! bold_44 ! bold_low ! bold_700.627 \cellcolor𝐡𝐢𝐠𝐡!𝟒𝟏!𝐥𝐨𝐰!700.621\cellcolor𝐡𝐢𝐠𝐡41𝐥𝐨𝐰700.621\mathbf{\cellcolor{high!41!low!70}0.621}bold_high ! bold_41 ! bold_low ! bold_700.621 \cellcolor𝐡𝐢𝐠𝐡!𝟗𝟒!𝐥𝐨𝐰!700.736\cellcolor𝐡𝐢𝐠𝐡94𝐥𝐨𝐰700.736\mathbf{\cellcolor{high!94!low!70}0.736}bold_high ! bold_94 ! bold_low ! bold_700.736 \cellcolor𝐡𝐢𝐠𝐡!𝟔𝟖!𝐥𝐨𝐰!700.679\cellcolor𝐡𝐢𝐠𝐡68𝐥𝐨𝐰700.679\mathbf{\cellcolor{high!68!low!70}0.679}bold_high ! bold_68 ! bold_low ! bold_700.679
cubic Hermite \cellcolorhigh!79!low!700.703\cellcolor𝑖𝑔79𝑙𝑜𝑤700.703\cellcolor{high!79!low!70}0.703italic_h italic_i italic_g italic_h ! 79 ! italic_l italic_o italic_w ! 700.703 \cellcolorhigh!96!low!700.742\cellcolor𝑖𝑔96𝑙𝑜𝑤700.742\cellcolor{high!96!low!70}0.742italic_h italic_i italic_g italic_h ! 96 ! italic_l italic_o italic_w ! 700.742 \cellcolorhigh!40!low!700.618\cellcolor𝑖𝑔40𝑙𝑜𝑤700.618\cellcolor{high!40!low!70}0.618italic_h italic_i italic_g italic_h ! 40 ! italic_l italic_o italic_w ! 700.618 \cellcolorhigh!39!low!700.616\cellcolor𝑖𝑔39𝑙𝑜𝑤700.616\cellcolor{high!39!low!70}0.616italic_h italic_i italic_g italic_h ! 39 ! italic_l italic_o italic_w ! 700.616 \cellcolorhigh!36!low!700.610\cellcolor𝑖𝑔36𝑙𝑜𝑤700.610\cellcolor{high!36!low!70}0.610italic_h italic_i italic_g italic_h ! 36 ! italic_l italic_o italic_w ! 700.610 \cellcolorhigh!87!low!700.722\cellcolor𝑖𝑔87𝑙𝑜𝑤700.722\cellcolor{high!87!low!70}0.722italic_h italic_i italic_g italic_h ! 87 ! italic_l italic_o italic_w ! 700.722 \cellcolorhigh!63!low!700.668\cellcolor𝑖𝑔63𝑙𝑜𝑤700.668\cellcolor{high!63!low!70}0.668italic_h italic_i italic_g italic_h ! 63 ! italic_l italic_o italic_w ! 700.668
natural cubic \cellcolorhigh!81!low!700.708\cellcolor𝑖𝑔81𝑙𝑜𝑤700.708\cellcolor{high!81!low!70}0.708italic_h italic_i italic_g italic_h ! 81 ! italic_l italic_o italic_w ! 700.708 \cellcolorhigh!92!low!700.733\cellcolor𝑖𝑔92𝑙𝑜𝑤700.733\cellcolor{high!92!low!70}0.733italic_h italic_i italic_g italic_h ! 92 ! italic_l italic_o italic_w ! 700.733 \cellcolorhigh!34!low!700.604\cellcolor𝑖𝑔34𝑙𝑜𝑤700.604\cellcolor{high!34!low!70}0.604italic_h italic_i italic_g italic_h ! 34 ! italic_l italic_o italic_w ! 700.604 \cellcolorhigh!35!low!700.606\cellcolor𝑖𝑔35𝑙𝑜𝑤700.606\cellcolor{high!35!low!70}0.606italic_h italic_i italic_g italic_h ! 35 ! italic_l italic_o italic_w ! 700.606 \cellcolorhigh!39!low!700.615\cellcolor𝑖𝑔39𝑙𝑜𝑤700.615\cellcolor{high!39!low!70}0.615italic_h italic_i italic_g italic_h ! 39 ! italic_l italic_o italic_w ! 700.615 \cellcolorhigh!83!low!700.713\cellcolor𝑖𝑔83𝑙𝑜𝑤700.713\cellcolor{high!83!low!70}0.713italic_h italic_i italic_g italic_h ! 83 ! italic_l italic_o italic_w ! 700.713 \cellcolorhigh!60!low!700.663\cellcolor𝑖𝑔60𝑙𝑜𝑤700.663\cellcolor{high!60!low!70}0.663italic_h italic_i italic_g italic_h ! 60 ! italic_l italic_o italic_w ! 700.663

Table 1 reports the articulatory score by feature space and interpolation method on a held-out test. The scores correspond to each interpolation method’s best configuration on each feature set. We observe that the scalar AP proves to be (very) difficult to interpolate on, but replacing the fixed values (probably needing learning) with equidistant values in the form of one-hot encodings helps close the gap to the GP features. Surprisingly, the one-hot phoneme encodings are the best single feature set.

Given the potential complementarity of information between the GP/AP feature sets and the one-hot phoneme encodings, we probe the GP and AP features enriched with the latter. The mix turns out to be beneficial for all the interpolation methods, although we lose the reduced number of interpretable features sought for the inverse models in perspective.

Tables 3 and 3 show the articulatory scores per speaker and articulatory parameter, respectively, on the best feature space, namely GP features enriched with one-hot phonemes. In both cases, the ranking induced by the average score (linear succeeds\succ cubic Hermite succeeds\succ natural cubic succeeds\succ piecewise constant) is met throughout the conditions, bar the two speakers maps0 and faet0, the jaw height and the lip protrusion (natural cubic succeeds\succ cubic Hermite). Interestingly, from Tables 1, 3 and 3, it is clear that the linear interpolation method better exploits the given phonological spaces, regardless of the feature nature, speaker or articulatory parameter.

In Table 1, we see that most of the best scores reported were obtained on features with unknown support. The results in Table 4 support the hypothesis that, in general, keeping and interpolating unknown feature values is better than associating them with a fixed value. On the other hand, the effect of the target timing and/or position optimisation depends on the interpolation method. For instance, when we optimise the timings on the cubic Hermite spline, the articulatory score does not improve, and the spatial optimisation has the same (negative) impact. This is why we see few cubic Hermite interpolations with target optimisations in Table 1.

Table 4: Comparison of GP + one-hot phoneme feature set variants and the effect of target optimisation. The two leftmost scores correspond to the binary and unknown-supporting feature-set variants without optimisation, and the right scores to the timing-only and the timing-and-position optimisations on the underlined feature sets.
Method Non-optimised Optimised
Binary Unknown Time Time & space
linear 0.6590.6590.6590.659 0.6790.679\mathbf{0.679}bold_0.679
cubic Hermite 0.6450.6450.6450.645 0.668¯¯0.668\underline{\mathbf{0.668}}under¯ start_ARG bold_0.668 end_ARG 0.6600.6600.6600.660 0.6490.6490.6490.649
natural cubic 0.623¯¯0.623\underline{0.623}under¯ start_ARG 0.623 end_ARG 0.6380.6380.6380.638 0.6240.6240.6240.624 0.6630.663\mathbf{0.663}bold_0.663

4 Conclusion

In this study, we have analysed phonological features as potential pseudo-motor commands in a computational model of a speech perception-production loop. We found that: (i) smooth trajectories on generative phonology features correlate better with articulatory parameters than those on articulatory phonology ones, with a correlation coefficient of 0.670.670.670.67 when GP features are enriched with one-hot phoneme encodings, (ii) a linear forward model better captures the dynamics of real articulatory data, but target optimisation (in terms of timing and/or position) helps a cubic model to reduce the gap, (iii) with the AP features, a better correlation coefficient is obtained with a one-hot encoding, in which all the values for a given feature are equidistant to one another, rather than with a fixed, scalar continuous one, (iv) interpolating unknown (or context-dependent) features is better than associating a fixed value with them.

Since interpolating under-specified dimensions of articulatory targets appears to lead to a better fit, future work could try to push this strategy further by incorporating more under-specified dimensions in featural segmentations, thus enabling the smoothness of the forward model’s trajectories to better model co-articulation.

Further work should also investigate the reason why linear interpolation of articulatory targets outperforms smoother cubic spline interpolation. This is surprising since articulatory trajectories are not linear, cubic splines are excellent interpolators and the biological motion literature suggests that smooth trajectories should fit articulatory data well [24]. A possible interpretation is that unwarranted assumptions in our analyses cause the observed advantage of linear interpolation methods. Based on our results, the advantage of linear interpolation does not appear sensitive to assumptions regarding target definition (generative vs articulatory, specification, timing, position). It may be the case, however, that the assumption of a fixed inventory of targets at the phonemic level is too optimistic, even for a single speaker, at least without controlling further variables that may modulate target parameters, such as prosodic effects. To test this hypothesis, simulated trajectories could be used to determine if ignoring prosodic effects would predict an advantage for linear interpolation even when using a cubic spline model for the dynamics. Alternatively, our results may indicate that classical results on biological motion, obtained in highly controlled settings, do not accurately characterize the dynamics of biological motion in less restricted environments. To test this hypothesis, our methodology could be applied to more controlled trajectories (e.g. isolated syllables), where it should find that smoother trajectories provide a better fit than linear interpolation. This would validate our methodology, which could then be leveraged to better understand the dynamics of biological motion in the wild.

References

  • [1] S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
  • [2] E. Dunbar, N. Hamilakis, and E. Dupoux, “Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1211–1226, 2022.
  • [3] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “AudioLM: a Language Modeling Approach to Audio Generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [4] K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed et al., “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
  • [5] M. Lavechin, M. de Seyssel, M. Métais, F. Metze, A. Mohamed, H. Bredin, E. Dupoux, and A. Cristia, “Modeling early phonetic acquisition from child-centered audio data,” Cognition, vol. 245, p. 105734, 2024.
  • [6] M. Hallap, E. Dupoux, and E. Dunbar, “Evaluating context-invariance in unsupervised speech representations,” in Proc. Interspeech, 2023, pp. 2973–2977.
  • [7] A. M. Liberman and I. G. Mattingly, “The motor theory of speech perception revised,” Cognition, vol. 21, no. 1, p. 1–36, Oct 1985.
  • [8] J.-L. Schwartz, A. Basirat, L. Ménard, and M. Sato, “The Perception-for-Action-Control Theory (PACT): A perceptuo-motor theory of speech perception,” Journal of Neurolinguistics, vol. 25, no. 5, pp. 336–354, 2012.
  • [9] J. S. Perkell and D. H. Klatt, “Invariance and variability in speech processes.” in Symposium on Invariance and Variability of Speech Processes, Oct, 1983, Massachusetts Inst. of Technology, Cambridge, MA, US.   Lawrence Erlbaum Associates, Inc, 1986.
  • [10] J. I. Skipper, J. T. Devlin, and D. R. Lametti, “The hearing ear is always found close to the speaking tongue: Review of the role of the motor system in speech perception,” Brain and Language, vol. 164, pp. 77–105, 2017.
  • [11] M. I. Jordan and D. M. Wolpert, “Computational motor control,” M. Gazzaniga, Ed.   MIT Press, 1999.
  • [12] J.-F. Patri, J. Diard, and P. Perrier, “Optimal speech motor control and token-to-token variability: a Bayesian modeling approach,” Biological cybernetics, vol. 109, no. 6, pp. 611–626, 2015.
  • [13] A. K. Philippsen, R. F. Reinhart, and B. Wrede, “Learning how to speak: Imitation-based refinement of syllable production in an articulatory-acoustic model,” in Proc. International Conference on Development and Learning and on Epigenetic Robotics, 2014, pp. 195–200.
  • [14] H. Rasilo and O. Räsänen, “An online model for vowel imitation learning,” Speech Communication, vol. 86, pp. 1–23, 2017.
  • [15] M.-A. Georges, J. Diard, L. Girin, J.-L. Schwartz, and T. Hueber, “Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation,” in Proc. ICASSP, 2022, pp. 8252–8256.
  • [16] N. Chomsky and M. Halle, The Sound Pattern of English.   Harper and Row, 1968.
  • [17] C. P. Browman and L. Goldstein, “Articulatory phonology: An overview,” Phonetica, vol. 49, no. 3-4, pp. 155–180, 1992.
  • [18] S. Maeda, “Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model,” in Speech production and speech modelling, W. J. Hardcastle and A. Marchal, Eds.   Springer Science & Business Media, 2012, vol. 55, pp. 131–149.
  • [19] C. J. Cho, P. Wu, A. Mohamed, and G. K. Anumanchipalli, “Evidence of vocal tract articulation in self-supervised learning of speech,” in Proc. ICASSP, 2023, pp. 1–5.
  • [20] B. Hayes, Introductory Phonology.   John Wiley & Sons, 2008, vol. 7.
  • [21] K. Livescu, “Feature-based pronunciation modeling for automatic speech recognition,” Ph.D. dissertation, Massachusetts Institute of Technology, 2005.
  • [22] L. Badino, C. Canevari, L. Fadiga, and G. Metta, “Integrating articulatory data in deep neural network-based acoustic modeling,” Computer Speech & Language, vol. 36, pp. 173–195, 2016.
  • [23] A. Serrurier, P. Badin, A. Barney, L.-J. Boë, and C. Savariaux, “The tongue in speech and feeding: Comparative articulatory modelling,” Journal of Phonetics, vol. 40, no. 6, pp. 745–763, 2012.
  • [24] T. Flash and N. Hogan, “The coordination of arm movements: an experimentally confirmed mathematical model,” Journal of neuroscience, vol. 5, no. 7, pp. 1688–1703, 1985.