High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Leyuan Sheng; Dong-Yan Huang; Evgeniy N. Pavlovskiy

by Leyuan Sheng, Dong-Yan Huang, Evgeniy N. Pavlovskiy

Released as a article .

2019

Abstract

In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by image-to-image translation, we address this problem by using a learning-based post filter combining Pix2PixHD and ResUnet to reconstruct the mel-spectrograms together with super-resolution. From the resulting super-resolution spectrogram networks, we can generate enhanced spectrograms to produce high quality synthesized speech. Our proposed model achieves improved mean opinion scores (MOS) of 3.71 and 4.01 over baseline results of 3.29 and 3.84, while using vocoder Griffin-Lim and WaveNet, respectively.
In text/plain format

Archived Files and Locations

application/pdf 1.1 MB
file_a3nnro2h5fdglcon2ajlt46wwu arxiv.org (repository)
web.archive.org (webarchive)

Read Archived PDF

Preserved and Accessible

Type article
Stage

submitted

Date 2019-12-03
Version v1
Language en ^?

arXiv 1912.01167v1

Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)

Cite This

BibTeX
CSL-JSON
MLA
Harvard

Lookup Links

Worldcat
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar

Catalog Record
Revision: c4511592-2d44-4e11-882f-cce372a72096
API URL: JSON

Edit Metadata View History

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram release_bjcl5zcuofapxkt5f25h6gsr3q

Abstract

Archived Files and Locations

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram `release_bjcl5zcuofapxkt5f25h6gsr3q`