Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Sample based generative models
for speech synthesis
Дмитро Бєлєвцов @ IBDI
Frame-based business
1.Split the waveform into overlapping frames
Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
3.Model the distribution of these parameters
Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
3.Model the distribution of these parameters
4.Generate parameters
Frame-based business
1.Split the waveform into overlapping frames
2.Extract spectral features from each frame
3.Model the distribution of these parameters
4.Generate parameters
5.Convert parameters back to the waveform
Frame-based business
● 100x lower time resolution
● Phase-invariant
● Naturally motivated
● Highly compressed
● Separated from pitch
Pros:
Frame-based business
● 100x lower time resolution
● Phase-invariant
● Naturally motivated
● Highly compressed
● Separated from pitch
Pros:
● Highly compressed
● Synthesis introduces
unnaturalness
Cons:
WaveNet
WaveNet
● Deep
WaveNet
● Deep
● Residual
WaveNet
● Deep
● Residual
● Convolutional
WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional
WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional
● Generative
WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional
● Generative
● Auto-regressive
WaveNet
● Deep
● Residual
● Convolutional
● Sample-based
● Probabilistic
● Conditional
● Generative
● Auto-regressive
How does it work?
dilated causal convolutions
How does it work?
Trained like a CNN
Generates like an RNN
(with limited memory)
So how is it?
● Direct waveform generation
● State-of-the-art timbre quality
● CNN-like training
Pros:
So how is it?
● Direct waveform generation
● State-of-the-art timbre quality
● CNN-like training
Pros:
● Slow generation (40x
slower than realtime on
commodity CPU) *
● Sensitive to local condition
● Large memory footprint
● Hard to interpret
● Missing details
Cons:
Top layer activation
SampleRNN
SampleRNN
SampleRNN
SampleRNN
● Direct waveform generation
● Great long-range
dependencies modelling
● Reference impl. available
● Clear distinction between slow
and fast time scales
Pros:
● Training RNN on very long
sequences can be tricky
● ???
Cons:
Papers to check out
● WaveNet: A Generative Model for Raw Audio (Oord et al. 2016)
● Fast Wavenet Generation Algorithm (Paine et al. 2016)
● Deep Voice: Real-time Neural Text-to-Speech (Arik et al. 2017)
● A Neural Parametric Singing Synthesizer (Blaauw et al. 2017)
● SamplerRNN: An Unconditional End-To-End Neural Audio
Generation Model (Mehri et al. 2017)
● Char2wav: End-To-End Speech Synthesis (Sotelo et al. 2017)
Thanks!

More Related Content

DataScienceLab2017_Блиц-доклад