DataScienceLab2017_Блиц-доклад

Sample based generative models
for speech synthesis
Дмитро Бєлєвцов @ IBDI

Frame-based business
1.Split the waveform into overlapping frames

2.Extract spectral features from each frame

3.Model the distribution of these parameters

4.Generate parameters

4.Generate parameters
5.Convert parameters back to the waveform

● 100x lower time resolution
● Phase-invariant
● Naturally motivated
● Highly compressed
● Separated from pitch
Pros:

● 100x lower time resolution
● Phase-invariant
● Naturally motivated
● Separated from pitch
Pros:
● Synthesis introduces
unnaturalness
Cons:

WaveNet
● Deep
● Residual
● Convolutional

WaveNet
● Deep
● Residual
● Convolutional
● Sample-based

WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic

WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional

WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional
● Generative

WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional
● Generative
● Auto-regressive

How does it work?
dilated causal convolutions

Generates like an RNN
(with limited memory)

So how is it?
● Direct waveform generation
● State-of-the-art timbre quality
● CNN-like training
Pros:

So how is it?
● State-of-the-art timbre quality
● CNN-like training
Pros:
● Slow generation (40x
slower than realtime on
commodity CPU) *
● Sensitive to local condition
● Large memory footprint
● Hard to interpret
● Missing details
Cons:

SampleRNN
● Great long-range
dependencies modelling
● Reference impl. available
● Clear distinction between slow
and fast time scales
Pros:
● Training RNN on very long
sequences can be tricky
● ???
Cons:

Papers to check out
● WaveNet: A Generative Model for Raw Audio (Oord et al. 2016)
● Fast Wavenet Generation Algorithm (Paine et al. 2016)
● Deep Voice: Real-time Neural Text-to-Speech (Arik et al. 2017)
● A Neural Parametric Singing Synthesizer (Blaauw et al. 2017)
● SamplerRNN: An Unconditional End-To-End Neural Audio
Generation Model (Mehri et al. 2017)
● Char2wav: End-To-End Speech Synthesis (Sotelo et al. 2017)

DataScienceLab2017_Блиц-доклад

More Related Content

DataScienceLab2017_Блиц-доклад