DataScience Lab, 13 мая 2017
Recent deep learning approaches for speech generation
Дмитрий Белевцов (Techlead at IBDI)
В последние пол года появилось несколько важных моделей на базе глубоких нейронных сетей, способных успешно синтезировать человеческую речь на уровне отдельных сэмплов. Это позволило обойти многие недостатки классических спектральных подходов. В этом докладе я сделаю небольшой обзор архитектур наиболее популярных сетей, таких как Wavenet и SampleRNN.
Все материалы доступны по ссылке: http://datascience.in.ua/report2017
4. Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
3.Model the distribution of these parameters
5. Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
3.Model the distribution of these parameters
4.Generate parameters
6. Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
3.Model the distribution of these parameters
4.Generate parameters
5.Convert parameters back to the waveform
7. Frame-based business
● 100x lower time resolution
● Phase-invariant
● Naturally motivated
● Highly compressed
● Separated from pitch
Pros:
8. Frame-based business
● 100x lower time resolution
● Phase-invariant
● Naturally motivated
● Highly compressed
● Separated from pitch
Pros:
● Highly compressed
● Synthesis introduces
unnaturalness
Cons:
23. So how is it?
● Direct waveform generation
● State-of-the-art timbre quality
● CNN-like training
Pros:
24. So how is it?
● Direct waveform generation
● State-of-the-art timbre quality
● CNN-like training
Pros:
● Slow generation (40x
slower than realtime on
commodity CPU) *
● Sensitive to local condition
● Large memory footprint
● Hard to interpret
● Missing details
Cons:
29. SampleRNN
● Direct waveform generation
● Great long-range
dependencies modelling
● Reference impl. available
● Clear distinction between slow
and fast time scales
Pros:
● Training RNN on very long
sequences can be tricky
● ???
Cons:
30. Papers to check out
● WaveNet: A Generative Model for Raw Audio (Oord et al. 2016)
● Fast Wavenet Generation Algorithm (Paine et al. 2016)
● Deep Voice: Real-time Neural Text-to-Speech (Arik et al. 2017)
● A Neural Parametric Singing Synthesizer (Blaauw et al. 2017)
● SamplerRNN: An Unconditional End-To-End Neural Audio
Generation Model (Mehri et al. 2017)
● Char2wav: End-To-End Speech Synthesis (Sotelo et al. 2017)