Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–15 of 15 results for author: Xin, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.03204  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

    Authors: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

    Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. Th… ▽ More

    Submitted 19 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  2. arXiv:2403.13353  [pdf, other

    cs.SD eess.AS

    Building speech corpus with diverse voice characteristics for its prompt-based representation

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

    Abstract: In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limit… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv admin note: text overlap with arXiv:2309.13509

  3. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  4. JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

    Authors: Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari

    Abstract: We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

  5. arXiv:2309.13509  [pdf, other

    cs.SD eess.AS

    Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

    Abstract: In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form d… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

    Comments: Submitted to ASRU2023

  6. arXiv:2306.00697  [pdf, other

    cs.CL cs.AI eess.AS

    How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

    Authors: Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari

    Abstract: We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the finding… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  7. arXiv:2305.12445  [pdf, other

    cs.SD eess.AS

    JNV Corpus: A Corpus of Japanese Nonverbal Vocalizations with Diverse Phrases and Emotions

    Authors: Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

    Comments: 4 pages, 3 figures

  8. arXiv:2305.12442  [pdf, other

    cs.SD eess.AS

    Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

    Authors: Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi Saruwatari

    Abstract: We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wild corpus comprising $3.5$ hours of laughter, which is to our best knowledge the largest laughter corpus designed for laughter synthesis. We then propo… ▽ More

    Submitted 26 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  9. arXiv:2302.13652  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

    Abstract: Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-spe… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  10. arXiv:2211.02336  [pdf, other

    cs.SD eess.AS

    Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

    Authors: Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder an… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  11. arXiv:2210.09916  [pdf, other

    cs.SD eess.AS

    Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari

    Abstract: In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding voice. The conventional TacoSpawn-based speaker generation method represents the distributions of speaker embeddings by Gaussian mixture models (GMMs) condition… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023. Demo: https://sarulab-speech.github.io/demo_mid-attribute-speaker-generation

  12. arXiv:2206.10695  [pdf, other

    cs.SD eess.AS

    Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations

    Authors: Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present an emotion recognition system for nonverbal vocalizations (NVs) submitted to the ExVo Few-Shot track of the ICML Expressive Vocalizations Competition 2022. The proposed method uses self-supervised learning (SSL) models to extract features from NVs and uses a classifier chain to model the label dependency between emotions. Experimental results demonstrate that the proposed method can sig… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Accepted by the ICML Expressive Vocalizations Workshop and Competition 2022

  13. arXiv:2204.10561  [pdf, other

    cs.SD eess.AS

    Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

    Authors: Detai Xin, Shinnosuke Takamichi, Takuma Okamoto, Hisashi Kawai, Hiroshi Saruwatari

    Abstract: This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A s… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: submitted to INTERSPEECH 2022

  14. arXiv:2204.02152  [pdf, other

    cs.SD eess.AS

    UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

    Authors: Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tes… ▽ More

    Submitted 29 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  15. arXiv:2103.06116  [pdf, other

    eess.IV cs.CV cs.MM

    Spatial Attention-based Non-reference Perceptual Quality Prediction Network for Omnidirectional Images

    Authors: Li Yang, Mai Xu, Deng Xin, Bo Feng

    Abstract: Due to the strong correlation between visual attention and perceptual quality, many methods attempt to use human saliency information for image quality assessment. Although this mechanism can get good performance, the networks require human saliency labels, which is not easily accessible for omnidirectional images (ODI). To alleviate this issue, we propose a spatial attention-based perceptual qual… ▽ More

    Submitted 10 March, 2021; originally announced March 2021.

    Comments: Accepted by IEEE ICME 2021