CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Kim, Jaehyeon; Lee, Keon; Chung, Seungjun; Cho, Jaewoong

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2404.02781 (eess)

[Submitted on 3 Apr 2024]

Title:CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Authors:Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

View PDF HTML (experimental)

Abstract:With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.

Comments:	ICLR 2024
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2404.02781 [eess.AS]
	(or arXiv:2404.02781v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2404.02781

Submission history

From: Jaehyeon Kim [view email]
[v1] Wed, 3 Apr 2024 14:52:20 UTC (1,197 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators