Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Lou, Haowei; Paik, Helen; Hu, Wen; Yao, Lina

Computer Science > Sound

arXiv:2412.08112 (cs)

[Submitted on 11 Dec 2024]

Title:Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Authors:Haowei Lou, Helen Paik, Wen Hu, Lina Yao

View PDF HTML (experimental)

Abstract:Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16\% improvement in word error rate and significantly enhance phoneme and tone alignment. These findings highlight the effectiveness of our approach in optimizing TTS systems for more natural and intelligible speech generation.

Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.08112 [cs.SD]
	(or arXiv:2412.08112v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2412.08112

Submission history

From: Haowei Lou [view email]
[v1] Wed, 11 Dec 2024 05:39:12 UTC (685 KB)

Computer Science > Sound

Title:Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators