Learning and controlling the source-filter representation of speech with a variational autoencoder

Sadok, Samir; Leglaive, Simon; Girin, Laurent; Alameda-Pineda, Xavier; Séguier, Renaud

doi:10.1016/j.specom.2023.02.005

Computer Science > Sound

arXiv:2204.07075 (cs)

[Submitted on 14 Apr 2022 (v1), last revised 21 Mar 2023 (this version, v3)]

Title:Learning and controlling the source-filter representation of speech with a variational autoencoder

Authors:Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

View PDF

Abstract:Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$.

Comments:	23 pages, 7 figures, companion website: this https URL
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2204.07075 [cs.SD]
	(or arXiv:2204.07075v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2204.07075
Journal reference:	Speech Communication, vol. 148, 2023
Related DOI:	https://doi.org/10.1016/j.specom.2023.02.005

Submission history

From: Simon Leglaive [view email]
[v1] Thu, 14 Apr 2022 16:13:06 UTC (9,293 KB)
[v2] Wed, 4 May 2022 14:26:11 UTC (9,293 KB)
[v3] Tue, 21 Mar 2023 10:41:12 UTC (10,264 KB)

Computer Science > Sound

Title:Learning and controlling the source-filter representation of speech with a variational autoencoder

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Learning and controlling the source-filter representation of speech with a variational autoencoder

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators