On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Varshavsky-Hassid, Miri; Hirsch, Roy; Cohen, Regev; Golany, Tomer; Freedman, Daniel; Rivlin, Ehud

Computer Science > Sound

arXiv:2402.12423 (cs)

[Submitted on 19 Feb 2024 (v1), last revised 4 Jun 2024 (this version, v2)]

Title:On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Authors:Miri Varshavsky-Hassid, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin

View PDF HTML (experimental)

Abstract:The incorporation of Denoising Diffusion Models (DDMs) in the Text-to-Speech (TTS) domain is rising, providing great value in synthesizing high quality speech. Although they exhibit impressive audio quality, the extent of their semantic capabilities is unknown, and controlling their synthesized speech's vocal properties remains a challenge. Inspired by recent advances in image synthesis, we explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We then demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements. We present evidence of the semantic and acoustic qualities of the edited audio, and provide supplemental samples: this https URL.

Comments:	Accepted to ACL 2024
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2402.12423 [cs.SD]
	(or arXiv:2402.12423v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2402.12423

Submission history

From: Miri Varshavsky-Hassid [view email]
[v1] Mon, 19 Feb 2024 16:22:21 UTC (8,031 KB)
[v2] Tue, 4 Jun 2024 11:03:57 UTC (8,942 KB)

Computer Science > Sound

Title:On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators