Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 84 results for author: Yu, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.13198  [pdf, other

    cs.SD eess.AS

    DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

    Authors: Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

    Abstract: Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assist… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.04219  [pdf, other

    eess.AS

    Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter

    Authors: Yu Xi, Wen Ding, Kai Yu, Junjie Lai

    Abstract: Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this paper, we propose to enhance CS-ASR systems by utilizing rich unsupervised monolingual speech data within a semi-supervised learning framework, particularly whe… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  3. arXiv:2407.03892  [pdf, other

    cs.SD cs.AI eess.AS

    On the Effectiveness of Acoustic BPE in Decoder-Only TTS

    Authors: Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

    Abstract: Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: 5 pages, 3 tables, 1 figures. accepted to Interspeech 2024

  4. arXiv:2406.12447  [pdf, other

    eess.AS

    Text-aware Speech Separation for Multi-talker Keyword Spotting

    Authors: Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu

    Abstract: For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To ad… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  5. arXiv:2406.11546  [pdf, other

    eess.AS cs.CL cs.SD

    GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

    Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

    Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Under review

  6. arXiv:2406.10358  [pdf, other

    cs.CR eess.SY

    I Still See You: Why Existing IoT Traffic Reshaping Fails

    Authors: Su Wang, Keyang Yu, Qi Li, Dong Chen

    Abstract: The Internet traffic data produced by the Internet of Things (IoT) devices are collected by Internet Service Providers (ISPs) and device manufacturers, and often shared with their third parties to maintain and enhance user services. Unfortunately, on-path adversaries could infer and fingerprint users' sensitive privacy information such as occupancy and user activities by analyzing these network tr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: EWSN'24 paper accepted, to appear

  7. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

    Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  8. arXiv:2406.08052  [pdf, other

    cs.SD eess.AS

    FakeSound: Deepfake General Audio Detection

    Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

    Abstract: With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset n… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

    MSC Class: 68Txx ACM Class: I.2

  9. arXiv:2405.16062  [pdf, other

    cs.IT eess.SP

    Movable Antenna Empowered Physical Layer Security Without Eve's CSI: Joint Optimization of Beamforming and Antenna Positions

    Authors: Zhiyong Feng, Yujia Zhao, Kan Yu, Dong Li

    Abstract: Physical layer security (PLS) technology based on the fixed-position antenna (FPA) has {attracted widespread attention}. Due to the fixed feature of the antennas, current FPA-based PLS schemes cannot fully utilize the spatial degree of freedom, and thus a weaken secure gain in the desired/undesired direction may exist. Different from the concept of FPA, mobile antenna (MA) is a novel technology th… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  10. arXiv:2405.06339  [pdf, other

    eess.SP

    Performance Analysis of Uplink/Downlink Decoupled Access in Cellular-V2X Networks

    Authors: Luofang Jiao, Kai Yu, Jiacheng Chen, Tingting Liu, Haibo Zhou, Lin Cai

    Abstract: This paper firstly develops an analytical framework to investigate the performance of uplink (UL) / downlink (DL) decoupled access in cellular vehicle-to-everything (C-V2X) networks, in which a vehicle's UL/DL can be connected to different macro/small base stations (MBSs/SBSs) separately. Using the stochastic geometry analytical tool, the UL/DL decoupled access C-V2X is modeled as a Cox process, a… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: 15 pages, 10 figures

    Journal ref: Jiao L, Yu K, Chen J, et al. Performance Analysis of Uplink/Downlink Decoupled Access in Cellular-V2X Networks[J]. IEEE Transactions on Mobile Computing, 2023

  11. arXiv:2404.19723  [pdf, other

    eess.AS cs.SD

    Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech

    Authors: Hankun Wang, Chenpeng Du, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

    Abstract: Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skipping and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  12. arXiv:2404.19477  [pdf, other

    eess.SP

    Hybrid Bit and Semantic Communications

    Authors: Kaiwen Yu, Renhe Fan, Gang Wu, Zhijin Qin

    Abstract: Semantic communication technology is regarded as a method surpassing the Shannon limit of bit transmission, capable of effectively enhancing transmission efficiency. However, current approaches that directly map content to transmission symbols are challenging to deploy in practice, imposing significant limitations on the development of semantic communication. To address this challenge, we propose… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  13. arXiv:2404.14946  [pdf, other

    cs.SD cs.CL eess.AS

    StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

    Authors: Sen Liu, Yiwei Guo, Xie Chen, Kai Yu

    Abstract: While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and com… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted by ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11521-11525

  14. arXiv:2404.06079  [pdf, other

    eess.AS cs.AI

    The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

    Authors: Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, Hui Zhang, Xie Chen, Kai Yu

    Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challen… ▽ More

    Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: 5 pages, 3 figures. Report of a challenge

  15. arXiv:2404.05538  [pdf, other

    cs.IT cs.LG eess.SP

    Cell-Free Multi-User MIMO Equalization via In-Context Learning

    Authors: Matteo Zecchin, Kai Yu, Osvaldo Simeone

    Abstract: Large pre-trained sequence models, such as transformers, excel as few-shot learners capable of in-context learning (ICL). In ICL, a model is trained to adapt its operation to a new task based on limited contextual information, typically in the form of a few training examples for the given task. Previous work has explored the use of ICL for channel equalization in single-user multi-input and multip… ▽ More

    Submitted 11 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

  16. arXiv:2403.13332  [pdf, other

    eess.AS cs.SD

    TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

    Authors: Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

    Abstract: Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted by ICASSP2024

  17. arXiv:2403.04594  [pdf, other

    cs.SD eess.AS

    A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

    Authors: Xuenan Xu, Xiaohang Xu, Zeyu Xie, Pingyue Zhang, Mengyue Wu, Kai Yu

    Abstract: Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound even… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  18. arXiv:2403.01278  [pdf, other

    cs.SD eess.AS

    Enhancing Audio Generation Diversity with Visual Information

    Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Mengyue Wu, Kai Yu

    Abstract: Audio and sound generation has garnered significant attention in recent years, with a primary focus on improving the quality of generated audios. However, there has been limited research on enhancing the diversity of generated audio, particularly when it comes to audio generation within specific categories. Current models tend to produce homogeneous audio samples within a category. This work aims… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    ACM Class: I.2

  19. Complete and Near-Optimal Robotic Crack Coverage and Filling in Civil Infrastructure

    Authors: Vishnu Veeraraghavan, Kyle Hunte, Jingang Yi, Kaiyan Yu

    Abstract: We present a simultaneous sensor-based inspection and footprint coverage (SIFC) planning and control design with applications to autonomous robotic crack mapping and filling. The main challenge of the SIFC problem lies in the coupling of complete sensing (for mapping) and robotic footprint (for filling) coverage tasks. Initially, we assume known target information (e.g., crack) and employ classic… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Journal ref: in IEEE Transactions on Robotics, vol. 40, pp. 2850-2867, 2024

  20. arXiv:2401.14321  [pdf, other

    eess.AS cs.SD

    VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

    Authors: Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu

    Abstract: Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation,… ▽ More

    Submitted 29 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

  21. arXiv:2401.06485  [pdf, other

    eess.AS cs.SD

    Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech

    Authors: Yu Xi, Baochen Yang, Hao Li, Jiaqi Guo, Kai Yu

    Abstract: Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-art… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP2024

  22. arXiv:2401.02584  [pdf, other

    cs.SD eess.AS

    Towards Weakly Supervised Text-to-Audio Grounding

    Authors: Xuenan Xu, Ziyang Ma, Mengyue Wu, Kai Yu

    Abstract: Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for… ▽ More

    Submitted 17 July, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

  23. arXiv:2312.09455  [pdf, other

    cs.RO cs.CV eess.SP

    Integration of Robotics, Computer Vision, and Algorithm Design: A Chinese Poker Self-Playing Robot

    Authors: Kuan-Huang Yu

    Abstract: This paper presents Chinese Poker Self-Playing Robot, an integrated system enabling a TM5-900 robotic arm to independently play the four-person card game Chinese poker. The robot uses a custom sucker mechanism to pick up and play cards. An object detection model based on YOLOv5 is utilized to recognize the suit and number of 13 cards dealt to the robot. A greedy algorithm is developed to divide th… ▽ More

    Submitted 28 November, 2023; originally announced December 2023.

    Comments: 7 pages, 9 figures

  24. arXiv:2312.08676  [pdf, other

    cs.SD cs.CL eess.AS

    SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

    Authors: Junjie Li, Yiwei Guo, Xie Chen, Kai Yu

    Abstract: Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embed… ▽ More

    Submitted 30 January, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: 5 pages, 2 figures, accepted to ICASSP 2024

  25. arXiv:2311.01260  [pdf, other

    eess.AS cs.AI cs.HC cs.SD

    Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

    Authors: Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, Kai Yu

    Abstract: Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: 5 pages,3 figures, submitted to ICASSP 2024

  26. arXiv:2310.14580  [pdf, other

    cs.SD eess.AS

    Acoustic BPE for Speech Generation with Discrete Tokens

    Authors: Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling proces… ▽ More

    Submitted 15 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: 5 pages, 2 figures; accepted to ICASSP 2024

  27. arXiv:2309.07377  [pdf, other

    eess.AS cs.SD

    Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

    Authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

    Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speec… ▽ More

    Submitted 14 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Accepted in ICASSP 2024

  28. arXiv:2309.05027  [pdf, other

    eess.AS cs.AI cs.HC cs.SD

    VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

    Authors: Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

    Abstract: Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the… ▽ More

    Submitted 16 January, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

    Comments: 4 figure, 5 pages, accepted to ICASSP 2024

  29. arXiv:2309.04655  [pdf

    cs.RO cs.LG eess.SP eess.SY

    Intelligent upper-limb exoskeleton integrated with soft wearable bioelectronics and deep-learning for human intention-driven strength augmentation based on sensory feedback

    Authors: Jinwoo Lee, Kangkyu Kwon, Ira Soltis, Jared Matthews, Yoonjae Lee, Hojoong Kim, Lissette Romero, Nathan Zavanelli, Youngjin Kwon, Shinjae Kwon, Jimin Lee, Yewon Na, Sung Hoon Lee, Ki Jun Yu, Minoru Shinohara, Frank L. Hammond, Woon-Hong Yeo

    Abstract: The age and stroke-associated decline in musculoskeletal strength degrades the ability to perform daily human tasks using the upper extremities. Although there are a few examples of exoskeletons, they need manual operations due to the absence of sensor feedback and no intention prediction of movements. Here, we introduce an intelligent upper-limb exoskeleton system that uses cloud-based deep learn… ▽ More

    Submitted 26 January, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: 15 pages, 6 figures, 1 table, published in npj flexible electronics journals

    MSC Class: 68T40 (Primary) 92C55; 68T99 (Secondary)

  30. arXiv:2306.14145  [pdf, other

    cs.SD cs.CL eess.AS

    DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

    Authors: Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker i… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  31. arXiv:2306.10090  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Improving Audio Caption Fluency with Automatic Error Correction

    Authors: Hanxue Zhang, Zeyu Xie, Xuenan Xu, Mengyue Wu, Kai Yu

    Abstract: Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this pro… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted by NCMMSC 2022

  32. arXiv:2306.08903  [pdf, other

    eess.SP

    Two-Way Semantic Transmission of Images without Feedback

    Authors: Kaiwen Yu, Qi He, Gang Wu

    Abstract: As a competitive technology for 6G, semantic communications can significantly improve transmission efficiency. However, many existing semantic communication systems require information feedback during the training coding process, resulting in a significant communication overhead. In this article, we consider a two-way semantic communication (TW-SC) system, where information feedback can be omitted… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

  33. arXiv:2306.08588  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

    Authors: Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen

    Abstract: Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition. However, there remain several challenging scenarios that E2E models are not competent in, such as code-switching and named entity recognition (NER). Data augmentation is a common and effective practice for these two scenarios. However, the cu… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  34. UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

    Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu

    Abstract: The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted… ▽ More

    Submitted 28 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Accepted to AAAI 2024

  35. Enhance Temporal Relations in Audio Captioning with Sound Event Detection

    Authors: Zeyu Xie, Xuenan Xu, Mengyue Wu, Kai Yu

    Abstract: Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events hav… ▽ More

    Submitted 18 July, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Interspeech 2023

  36. arXiv:2305.01980  [pdf, other

    cs.SD eess.AS

    Diverse and Vivid Sound Generation from Text Descriptions

    Authors: Guangwei Li, Xuenan Xu, Lingfeng Dai, Mengyue Wu, Kai Yu

    Abstract: Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher po… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

  37. arXiv:2304.13121  [pdf, other

    cs.SD eess.AS

    Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

    Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu

    Abstract: In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted by ICASSP 2023 Special Session for Grand Challenges

  38. arXiv:2304.11750  [pdf, other

    eess.AS cs.AI cs.HC cs.LG cs.SD

    DiffVoice: Text-to-Speech with Latent Diffusion

    Authors: Zhijun Liu, Yiwei Guo, Kai Yu

    Abstract: In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate… ▽ More

    Submitted 23 April, 2023; originally announced April 2023.

    Comments: Accepted to ICASSP2023

  39. arXiv:2303.12479   

    eess.SP eess.SY

    Distributed Two-tier DRL Framework for Cell-Free Network: Association, Beamforming and Power Allocation

    Authors: Kaiwen Yu, Chonghao Zhao, Gang Wu, Geoffrey Ye Li

    Abstract: Intelligent wireless networks have long been expected to have self-configuration and self-optimization capabilities to adapt to various environments and demands. In this paper, we develop a novel distributed hierarchical deep reinforcement learning (DHDRL) framework with two-tier control networks in different timescales to optimize the long-term spectrum efficiency (SE) of the downlink cell-free m… ▽ More

    Submitted 5 December, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

    Comments: The paper has some updated

  40. arXiv:2303.05322  [pdf, other

    cs.SD cs.MM eess.AS

    Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation

    Authors: Qi Chen, Ziyang Ma, Tao Liu, Xu Tan, Qu Lu, Xie Chen, Kai Yu

    Abstract: Audio-driven talking face has attracted broad interest from academia and industry recently. However, data acquisition and labeling in audio-driven talking face are labor-intensive and costly. The lack of data resource results in poor synthesis effect. To alleviate this issue, we propose to use TTS (Text-To-Speech) for data augmentation to improve few-shot ability of the talking face system. The mi… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: 4 pages. Accepted by ICASSP 2023

  41. arXiv:2212.02164  [pdf, other

    eess.SP

    Spectral Efficiency Analysis of Uplink-Downlink Decoupled Access in C-V2X Networks

    Authors: Luofang Jiao, Kai Yu, Yunting Xu, Tianqi Zhang, Haibo Zhou, Xuemin, Shen

    Abstract: The uplink (UL)/downlink (DL) decoupled access has been emerging as a novel access architecture to improve the performance gains in cellular networks. In this paper, we investigate the UL/DL decoupled access performance in cellular vehicle-to-everything (C-V2X). We propose a unified analytical framework for the UL/DL decoupled access in C-V2X from the perspective of spectral efficiency (SE). By mo… ▽ More

    Submitted 12 December, 2022; v1 submitted 5 December, 2022; originally announced December 2022.

    Comments: 6pagaes,5 figures, globecom 2022

  42. arXiv:2212.00330   

    eess.IV cs.CV

    Reliable Joint Segmentation of Retinal Edema Lesions in OCT Images

    Authors: Meng Wang, Kai Yu, Chun-Mei Feng, Ke Zou, Yanyu Xu, Qingquan Meng, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

    Abstract: Focusing on the complicated pathological features, such as blurred boundaries, severe scale differences between symptoms, background noise interference, etc., in the task of retinal edema lesions joint segmentation from OCT images and enabling the segmentation results more reliable. In this paper, we propose a novel reliable multi-scale wavelet-enhanced transformer network, which can provide accur… ▽ More

    Submitted 1 January, 2024; v1 submitted 1 December, 2022; originally announced December 2022.

    Comments: Improving algorithm

  43. arXiv:2211.09496  [pdf, other

    eess.AS cs.AI cs.HC cs.LG cs.SD

    EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

    Authors: Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

    Abstract: Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a p… ▽ More

    Submitted 16 February, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023

  44. arXiv:2211.04304  [pdf, other

    cs.SD cs.CL eess.AS

    BER: Balanced Error Rate For Speaker Diarization

    Authors: Tao Liu, Kai Yu

    Abstract: DER is the primary metric to evaluate diarization performance while facing a dilemma: the errors in short utterances or segments tend to be overwhelmed by longer ones. Short segments, e.g., `yes' or `no,' still have semantic information. Besides, DER overlooks errors in less-talked speakers. Although JER balances speaker errors, it still suffers from the same dilemma. Considering all those aspects… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 5 pages, 2 figures

  45. arXiv:2208.01221  [pdf

    cs.NI cs.AI cs.LG eess.SY

    Generative Adversarial Learning for Intelligent Trust Management in 6G Wireless Networks

    Authors: Liu Yang, Yun Li, Simon X. Yang, Yinzhi Lu, Tan Guo, Keping Yu

    Abstract: Emerging six generation (6G) is the integration of heterogeneous wireless networks, which can seamlessly support anywhere and anytime networking. But high Quality-of-Trust should be offered by 6G to meet mobile user expectations. Artificial intelligence (AI) is considered as one of the most important components in 6G. Then AI-based trust management is a promising paradigm to provide trusted and re… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  46. arXiv:2207.09057  [pdf

    cs.NI cs.AI eess.SY

    An Intelligent Trust Cloud Management Method for Secure Clustering in 5G enabled Internet of Medical Things

    Authors: Liu Yang, Keping Yu, Simon X. Yang, Chinmay Chakraborty, Yinzhi Lu, Tan Guo

    Abstract: 5G edge computing enabled Internet of Medical Things (IoMT) is an efficient technology to provide decentralized medical services while Device-to-device (D2D) communication is a promising paradigm for future 5G networks. To assure secure and reliable communication in 5G edge computing and D2D enabled IoMT systems, this paper presents an intelligent trust cloud management method. Firstly, an active… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

  47. arXiv:2207.08226  [pdf

    cs.NI cs.AI cs.SI eess.SY

    An Intelligent Deterministic Scheduling Method for Ultra-Low Latency Communication in Edge Enabled Industrial Internet of Things

    Authors: Yinzhi Lu, Liu Yang, Simon X. Yang, Qiaozhi Hua, Arun Kumar Sangaiah, Tan Guo, Keping Yu

    Abstract: Edge enabled Industrial Internet of Things (IIoT) platform is of great significance to accelerate the development of smart industry. However, with the dramatic increase in real-time IIoT applications, it is a great challenge to support fast response time, low latency, and efficient bandwidth utilization. To address this issue, Time Sensitive Network (TSN) is recently researched to realize low late… ▽ More

    Submitted 17 July, 2022; originally announced July 2022.

  48. arXiv:2207.02957  [pdf, other

    eess.IV cs.CV cs.LG

    Context-aware Self-supervised Learning for Medical Images Using Graph Neural Network

    Authors: Li Sun, Ke Yu, Kayhan Batmanghelich

    Abstract: Although self-supervised learning enables us to bootstrap the training by exploiting unlabeled data, the generic self-supervised methods for natural images do not sufficiently incorporate the context. For medical images, a desirable method should be sensitive enough to detect deviation from normal-appearing tissue of each anatomical region; here, anatomy is the context. We introduce a novel approa… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

    Comments: Accepted by NeurIPS workshop 2020. arXiv admin note: substantial text overlap with arXiv:2012.06457

  49. arXiv:2205.13160  [pdf, other

    cs.CR eess.SP

    Integration of Blockchain and Edge Computing in Internet of Things: A Survey

    Authors: He Xue, Dajiang Chen, Ning Zhang, Hong-Ning Dai, Keping Yu

    Abstract: As an important technology to ensure data security, consistency, traceability, etc., blockchain has been increasingly used in Internet of Things (IoT) applications. The integration of blockchain and edge computing can further improve the resource utilization in terms of network, computing, storage, and security. This paper aims to present a survey on the integration of blockchain and edge computin… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

  50. arXiv:2205.05357  [pdf, other

    cs.SD eess.AS

    Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

    Authors: Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu

    Abstract: Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encode… ▽ More

    Submitted 15 November, 2023; v1 submitted 11 May, 2022; originally announced May 2022.