Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Picture for Zhizheng Wu

Zhizheng Wu

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

Add code
Jul 07, 2024
Viaarxiv icon

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Add code
Jul 03, 2024
Viaarxiv icon

AudioTime: A Temporally-aligned Audio-text Benchmark Dataset

Add code
Jul 03, 2024
Viaarxiv icon

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Add code
Jul 01, 2024
Viaarxiv icon

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Add code
Jun 19, 2024
Figure 1 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Figure 2 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Figure 3 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Figure 4 for SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
Viaarxiv icon

Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis

Add code
Jun 16, 2024
Figure 1 for Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis
Figure 2 for Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis
Figure 3 for Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis
Figure 4 for Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis
Viaarxiv icon

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

Add code
Apr 26, 2024
Figure 1 for An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Figure 2 for An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Figure 3 for An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Figure 4 for An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Viaarxiv icon

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Add code
Mar 05, 2024
Figure 1 for NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Figure 2 for NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Figure 3 for NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Figure 4 for NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Viaarxiv icon

SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

Add code
Feb 20, 2024
Viaarxiv icon

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

Add code
Jan 22, 2024
Figure 1 for CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Figure 2 for CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Figure 3 for CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Figure 4 for CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Viaarxiv icon