Search | arXiv e-print repository

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

Authors: Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

Abstract: Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assist… ▽ More Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.04219 [pdf, other]

Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter

Authors: Yu Xi, Wen Ding, Kai Yu, Junjie Lai

Abstract: Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this paper, we propose to enhance CS-ASR systems by utilizing rich unsupervised monolingual speech data within a semi-supervised learning framework, particularly whe… ▽ More Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this paper, we propose to enhance CS-ASR systems by utilizing rich unsupervised monolingual speech data within a semi-supervised learning framework, particularly when access to CS data is limited. To achieve this, we establish a general paradigm for applying noisy student training (NST) to the CS-ASR task. Specifically, we introduce the LLM-Filter, which leverages well-designed prompt templates to activate the correction capability of large language models (LLMs) for monolingual data selection and pseudo-labels refinement during NST. Our experiments on the supervised ASRU-CS and unsupervised AISHELL-2 and LibriSpeech datasets show that our method not only achieves significant improvements over supervised and semi-supervised learning baselines for the CS task, but also attains better performance compared with the fully-supervised oracle upper-bound on the CS English part. Additionally, we further investigate the influence of accent on AESRC dataset and demonstrate that our method can get achieve additional benefits when the monolingual data contains relevant linguistic characteristic. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03892 [pdf, other]

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

Authors: Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

Abstract: Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But… ▽ More Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But the gain in TTS has not been fully investigated, and the proper choice of acoustic BPE remains unclear. In this work, we conduct a comprehensive study on various settings of acoustic BPE to explore its effectiveness in decoder-only TTS models with semantic speech tokens. Experiments on LibriTTS verify that acoustic BPE uniformly increases the intelligibility and diversity of synthesized speech, while showing different features across BPE settings. Hence, acoustic BPE is a favorable tool for decoder-only TTS. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 5 pages, 3 tables, 1 figures. accepted to Interspeech 2024

arXiv:2406.12447 [pdf, other]

Text-aware Speech Separation for Multi-talker Keyword Spotting

Authors: Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu

Abstract: For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To ad… ▽ More For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH2024

arXiv:2406.11546 [pdf, other]

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline uses Whisper for initial transcription and TorchAudio for forced alignment, combined with multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thus enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to the Whisper large-v3 model, with merely 10% model parameters. Furthermore, our ASR models trained on Gigaspeech 2 yield superior performance compared to commercial services. We believe that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2406.10358 [pdf, other]

I Still See You: Why Existing IoT Traffic Reshaping Fails

Authors: Su Wang, Keyang Yu, Qi Li, Dong Chen

Abstract: The Internet traffic data produced by the Internet of Things (IoT) devices are collected by Internet Service Providers (ISPs) and device manufacturers, and often shared with their third parties to maintain and enhance user services. Unfortunately, on-path adversaries could infer and fingerprint users' sensitive privacy information such as occupancy and user activities by analyzing these network tr… ▽ More The Internet traffic data produced by the Internet of Things (IoT) devices are collected by Internet Service Providers (ISPs) and device manufacturers, and often shared with their third parties to maintain and enhance user services. Unfortunately, on-path adversaries could infer and fingerprint users' sensitive privacy information such as occupancy and user activities by analyzing these network traffic traces. While there's a growing body of literature on defending against this side-channel attack-malicious IoT traffic analytics (TA), there's currently no systematic method to compare and evaluate the comprehensiveness of these existing studies. To address this problem, we design a new low-cost, open-source system framework-IoT Traffic Exposure Monitoring Toolkit (ITEMTK) that enables people to comprehensively examine and validate prior attack models and their defending approaches. In particular, we also design a novel image-based attack capable of inferring sensitive user information, even when users employ the most robust preventative measures in their smart homes. Researchers could leverage our new image-based attack to systematize and understand the existing literature on IoT traffic analysis attacks and preventing studies. Our results show that current defending approaches are not sufficient to protect IoT device user privacy. IoT devices are significantly vulnerable to our new image-based user privacy inference attacks, posing a grave threat to IoT device user privacy. We also highlight potential future improvements to enhance the defending approaches. ITEMTK's flexibility allows other researchers for easy expansion by integrating new TA attack models and prevention methods to benchmark their future work. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: EWSN'24 paper accepted, to appear

arXiv:2406.09317 [pdf, other]

Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered. △ Less

Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08052 [pdf, other]

FakeSound: Deepfake General Audio Detection

Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

Abstract: With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset n… ▽ More With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

MSC Class: 68Txx ACM Class: I.2

arXiv:2405.16062 [pdf, other]

Movable Antenna Empowered Physical Layer Security Without Eve's CSI: Joint Optimization of Beamforming and Antenna Positions

Authors: Zhiyong Feng, Yujia Zhao, Kan Yu, Dong Li

Abstract: Physical layer security (PLS) technology based on the fixed-position antenna (FPA) has {attracted widespread attention}. Due to the fixed feature of the antennas, current FPA-based PLS schemes cannot fully utilize the spatial degree of freedom, and thus a weaken secure gain in the desired/undesired direction may exist. Different from the concept of FPA, mobile antenna (MA) is a novel technology th… ▽ More Physical layer security (PLS) technology based on the fixed-position antenna (FPA) has {attracted widespread attention}. Due to the fixed feature of the antennas, current FPA-based PLS schemes cannot fully utilize the spatial degree of freedom, and thus a weaken secure gain in the desired/undesired direction may exist. Different from the concept of FPA, mobile antenna (MA) is a novel technology that {reconfigures} the wireless channels and enhances the corresponding capacity through the flexible movement of antennas on a minor scale. MA-empowered PLS enjoys huge potential and deserves further investigation. In this paper, we, for the first time, investigate the secrecy performance of MA-enabled PLS system where a MA-based Alice transmits the confidential information to multiple single-antenna Bobs, in the presence of the single-antenna eavesdropper (Eve) {in the absence} of perfect channel state information (CSI). For the purpose of the secrecy rate maximization of the worst Bob, we jointly design the transmit beamforming and antenna positions at the Alice, subject to the minimum moving distance of the antenna, uncertainty CSI of Eve, and maximum transmit power. Furthermore, the projected gradient ascent (PGA), alternating optimization (AO), and simulated annealing (SA) {are} adopted to solve the non-convex characteristics of the problem of the secrecy rate maximization. Simulation results demonstrate the effectiveness and correctness of the proposed method. In particular, MA-enabled PLS scheme can significantly enhance the secrecy rate compared to the conventional FPA-based ones for different settings of key system parameters. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.06339 [pdf, other]

Performance Analysis of Uplink/Downlink Decoupled Access in Cellular-V2X Networks

Authors: Luofang Jiao, Kai Yu, Jiacheng Chen, Tingting Liu, Haibo Zhou, Lin Cai

Abstract: This paper firstly develops an analytical framework to investigate the performance of uplink (UL) / downlink (DL) decoupled access in cellular vehicle-to-everything (C-V2X) networks, in which a vehicle's UL/DL can be connected to different macro/small base stations (MBSs/SBSs) separately. Using the stochastic geometry analytical tool, the UL/DL decoupled access C-V2X is modeled as a Cox process, a… ▽ More This paper firstly develops an analytical framework to investigate the performance of uplink (UL) / downlink (DL) decoupled access in cellular vehicle-to-everything (C-V2X) networks, in which a vehicle's UL/DL can be connected to different macro/small base stations (MBSs/SBSs) separately. Using the stochastic geometry analytical tool, the UL/DL decoupled access C-V2X is modeled as a Cox process, and we obtain the following theoretical results, i.e., 1) the probability of different UL/DL joint association cases i.e., both the UL and DL are associated with the different MBSs or SBSs, or they are associated with different types of BSs; 2) the distance distribution of a vehicle to its serving BSs in each case; 3) the spectral efficiency of UL/DL in each case; and 4) the UL/DL coverage probability of MBS/SBS. The analyses reveal the insights and performance gain of UL/DL decoupled access. Through extensive simulations, \textcolor{black}{the accuracy of the proposed analytical framework is validated.} Both the analytical and simulation results show that UL/DL decoupled access can improve spectral efficiency. The theoretical results can be directly used for estimating the statistical performance of a UL/DL decoupled access C-V2X network. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: 15 pages, 10 figures

Journal ref: Jiao L, Yu K, Chen J, et al. Performance Analysis of Uplink/Downlink Decoupled Access in Cellular-V2X Networks[J]. IEEE Transactions on Mobile Computing, 2023

arXiv:2404.19723 [pdf, other]

Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech

Authors: Hankun Wang, Chenpeng Du, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

Abstract: Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skipping and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech… ▽ More Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skipping and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech and text. We call the attention maps of those heads Alignment-Emerged Attention Maps (AEAMs). Based on this discovery, we propose a novel inference method without altering the training process, named Attention-Constrained Inference (ACI), to facilitate monotonic synthesis. It first identifies AEAMs using the Attention Sweeping algorithm and then applies constraining masks on AEAMs. Our experimental results on decoder-only TTS model VALL-E show that the WER of synthesized speech is reduced by up to 20.5% relatively with ACI while the naturalness and speaker similarity are comparable. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.19477 [pdf, other]

Hybrid Bit and Semantic Communications

Authors: Kaiwen Yu, Renhe Fan, Gang Wu, Zhijin Qin

Abstract: Semantic communication technology is regarded as a method surpassing the Shannon limit of bit transmission, capable of effectively enhancing transmission efficiency. However, current approaches that directly map content to transmission symbols are challenging to deploy in practice, imposing significant limitations on the development of semantic communication. To address this challenge, we propose… ▽ More Semantic communication technology is regarded as a method surpassing the Shannon limit of bit transmission, capable of effectively enhancing transmission efficiency. However, current approaches that directly map content to transmission symbols are challenging to deploy in practice, imposing significant limitations on the development of semantic communication. To address this challenge, we propose a hybrid bit and semantic communication system, named HybridBSC, in which encoded semantic information is inserted into bit information for transmission via conventional digital communication systems utilizing same spectrum resources. The system can be easily deployed using existing communication architecture to achieve bit and semantic information transmission. Particularly, we design a semantic insertion and extraction scheme to implement this strategy. Furthermore, we conduct experimental validation based on the pluto-based software defined radio (SDR) platform in a real wireless channel, demonstrating that the proposed strategy can simultaneously transmit semantic and bit information. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.14946 [pdf, other]

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

Authors: Sen Liu, Yiwei Guo, Xie Chen, Kai Yu

Abstract: While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and com… ▽ More While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ large language models and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual and acoustic features. Experiments are conducted to validate that TTS models can generate speech with improved expressiveness when integrating with the annotated textual labels in StoryTTS. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: Accepted by ICASSP 2024

Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11521-11525

arXiv:2404.06079 [pdf, other]

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Authors: Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, Hui Zhang, Xie Chen, Kai Yu

Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challen… ▽ More Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, with the highest UTMOS score and lowest bitrate among all submissions. △ Less

Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

Comments: 5 pages, 3 figures. Report of a challenge

arXiv:2404.05538 [pdf, other]

Cell-Free Multi-User MIMO Equalization via In-Context Learning

Authors: Matteo Zecchin, Kai Yu, Osvaldo Simeone

Abstract: Large pre-trained sequence models, such as transformers, excel as few-shot learners capable of in-context learning (ICL). In ICL, a model is trained to adapt its operation to a new task based on limited contextual information, typically in the form of a few training examples for the given task. Previous work has explored the use of ICL for channel equalization in single-user multi-input and multip… ▽ More Large pre-trained sequence models, such as transformers, excel as few-shot learners capable of in-context learning (ICL). In ICL, a model is trained to adapt its operation to a new task based on limited contextual information, typically in the form of a few training examples for the given task. Previous work has explored the use of ICL for channel equalization in single-user multi-input and multiple-output (MIMO) systems. In this work, we demonstrate that ICL can be also used to tackle the problem of multi-user equalization in cell-free MIMO systems with limited fronthaul capacity. In this scenario, a task is defined by channel statistics, signal-to-noise ratio, and modulation schemes. The context encompasses the users' pilot sequences, the corresponding quantized received signals, and the current received data signal. Different prompt design strategies are proposed and evaluated that encompass also large-scale fading and modulation information. Experiments demonstrate that ICL-based equalization provides estimates with lower mean squared error as compared to the linear minimum mean squared error equalizer, especially in the presence of limited fronthaul capacity and pilot contamination. △ Less

Submitted 11 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2403.13332 [pdf, other]

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Authors: Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

Abstract: Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT… ▽ More Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: Accepted by ICASSP2024

arXiv:2403.04594 [pdf, other]

A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

Authors: Xuenan Xu, Xiaohang Xu, Zeyu Xie, Pingyue Zhang, Mengyue Wu, Kai Yu

Abstract: Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound even… ▽ More Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details. Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by large language models. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.01278 [pdf, other]

Enhancing Audio Generation Diversity with Visual Information

Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Mengyue Wu, Kai Yu

Abstract: Audio and sound generation has garnered significant attention in recent years, with a primary focus on improving the quality of generated audios. However, there has been limited research on enhancing the diversity of generated audio, particularly when it comes to audio generation within specific categories. Current models tend to produce homogeneous audio samples within a category. This work aims… ▽ More Audio and sound generation has garnered significant attention in recent years, with a primary focus on improving the quality of generated audios. However, there has been limited research on enhancing the diversity of generated audio, particularly when it comes to audio generation within specific categories. Current models tend to produce homogeneous audio samples within a category. This work aims to address this limitation by improving the diversity of generated audio with visual information. We propose a clustering-based method, leveraging visual information to guide the model in generating distinct audio content within each category. Results on seven categories indicate that extra visual input can largely enhance audio generation diversity. Audio samples are available at https://zeyuxie29.github.io/DiverseAudioGeneration. △ Less

Submitted 2 March, 2024; originally announced March 2024.

ACM Class: I.2

arXiv:2403.00613 [pdf, other]

doi 10.1109/TRO.2024.3392077

Complete and Near-Optimal Robotic Crack Coverage and Filling in Civil Infrastructure

Authors: Vishnu Veeraraghavan, Kyle Hunte, Jingang Yi, Kaiyan Yu

Abstract: We present a simultaneous sensor-based inspection and footprint coverage (SIFC) planning and control design with applications to autonomous robotic crack mapping and filling. The main challenge of the SIFC problem lies in the coupling of complete sensing (for mapping) and robotic footprint (for filling) coverage tasks. Initially, we assume known target information (e.g., crack) and employ classic… ▽ More We present a simultaneous sensor-based inspection and footprint coverage (SIFC) planning and control design with applications to autonomous robotic crack mapping and filling. The main challenge of the SIFC problem lies in the coupling of complete sensing (for mapping) and robotic footprint (for filling) coverage tasks. Initially, we assume known target information (e.g., crack) and employ classic cell decomposition methods to achieve complete sensing coverage of the workspace and complete robotic footprint coverage using the least-cost route. Subsequently, we generalize the algorithm to handle unknown target information, allowing the robot to scan and incrementally construct the target graph online while conducting robotic footprint coverage. The online polynomial-time SIFC planning algorithm minimizes the total robot traveling distance, guarantees complete sensing coverage of the entire workspace, and achieves near-optimal robotic footprint coverage, as demonstrated through empirical experiments. For the demonstrated application, we design coordinated nozzle motion control with the planned robot trajectory to efficiently fill all cracks within the robot's footprint. Experimental results are presented to illustrate the algorithm's design, performance, and comparisons. The SIFC algorithm offers a high-efficiency motion planning solution for various robotic applications requiring simultaneous sensing and actuation coverage. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Journal ref: in IEEE Transactions on Robotics, vol. 40, pp. 2850-2867, 2024

arXiv:2401.14321 [pdf, other]

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

Authors: Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu

Abstract: Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation,… ▽ More Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot adaptation given a speech prompt. However, such decoder-only TTS models lack monotonic alignment constraints, sometimes leading to hallucination issues such as mispronunciation, word skipping and repeating. To address this limitation, we propose VALL-T, a generative Transducer model that introduces shifting relative position embeddings for input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the architecture of decoder-only Transformer. Consequently, VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations with a relative reduction of 28.3% in the word error rate. Furthermore, the controllability of alignment in VALL-T during decoding facilitates the use of untranscribed speech prompts, even in unknown languages. It also enables the synthesis of lengthy speech by utilizing an aligned context window. △ Less

Submitted 29 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.06485 [pdf, other]

Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech

Authors: Yu Xi, Baochen Yang, Hao Li, Jiaqi Guo, Kai Yu

Abstract: Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-art… ▽ More Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up. △ Less

Submitted 12 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP2024

arXiv:2401.02584 [pdf, other]

Towards Weakly Supervised Text-to-Audio Grounding

Authors: Xuenan Xu, Ziyang Ma, Mengyue Wu, Kai Yu

Abstract: Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for… ▽ More Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms the previous WSTAG SOTA. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG. The code and model is available at https://github.com/wsntxxn/TextToAudioGrounding. △ Less

Submitted 17 July, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

arXiv:2312.09455 [pdf, other]

Integration of Robotics, Computer Vision, and Algorithm Design: A Chinese Poker Self-Playing Robot

Authors: Kuan-Huang Yu

Abstract: This paper presents Chinese Poker Self-Playing Robot, an integrated system enabling a TM5-900 robotic arm to independently play the four-person card game Chinese poker. The robot uses a custom sucker mechanism to pick up and play cards. An object detection model based on YOLOv5 is utilized to recognize the suit and number of 13 cards dealt to the robot. A greedy algorithm is developed to divide th… ▽ More This paper presents Chinese Poker Self-Playing Robot, an integrated system enabling a TM5-900 robotic arm to independently play the four-person card game Chinese poker. The robot uses a custom sucker mechanism to pick up and play cards. An object detection model based on YOLOv5 is utilized to recognize the suit and number of 13 cards dealt to the robot. A greedy algorithm is developed to divide the 13 cards into optimal hands of 3, 5, and 5 cards to play. Experiments demonstrate that the robot can successfully obtain the cards, identify them using computer vision, strategically select hands to play using the algorithm, and physically play the selected cards in the game. The system showcases effective integration of mechanical design, computer vision, algorithm design, and robotic control to accomplish the complex task of independently playing cards. △ Less

Submitted 28 November, 2023; originally announced December 2023.

Comments: 7 pages, 9 figures

arXiv:2312.08676 [pdf, other]

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

Authors: Junjie Li, Yiwei Guo, Xie Chen, Kai Yu

Abstract: Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embed… ▽ More Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embedding free voice conversion model, which is designed to learn and incorporate speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism, and then reconstruct waveform from HuBERT semantic tokens in a non-autoregressive manner. The concise design of SEF-VC enhances its training stability and voice conversion performance. Objective and subjective evaluations demonstrate the superiority of SEF-VC to generate high-quality speech with better similarity to target reference than strong zero-shot VC baselines, even for very short reference speeches. △ Less

Submitted 30 January, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: 5 pages, 2 figures, accepted to ICASSP 2024

arXiv:2311.01260 [pdf, other]

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

Authors: Hanglei Zhang, Yiwei Guo, Sen Liu, Xie Chen, Kai Yu

Abstract: Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire… ▽ More Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in expressive TTS empower users with the ability to directly control synthesis style through natural language prompts. However, these methods often require excessive training with a significant amount of style-annotated data, which can be challenging to acquire. Moreover, they may have limited adaptability due to fixed style annotations. In this work, we present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations. Our approach utilizes a large language model (LLM) to transform expressive TTS into a style retrieval task. The LLM selects the best-matching style references from annotated utterances based on external style prompts, which can be raw input text or natural language style descriptions. The selected reference guides the TTS pipeline to synthesize speeches with the intended style. This innovative approach provides flexible, versatile, and precise style control with minimal human workload. Experiments on a Mandarin storytelling corpus demonstrate FS-TTS's proficiency in leveraging LLM's semantic inference ability to retrieve desired styles from either input text or user-defined descriptions. This results in synthetic speeches that are closely aligned with the specified styles. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: 5 pages,3 figures, submitted to ICASSP 2024

arXiv:2310.14580 [pdf, other]

Acoustic BPE for Speech Generation with Discrete Tokens

Authors: Feiyu Shen, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

Abstract: Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling proces… ▽ More Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks. △ Less

Submitted 15 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: 5 pages, 2 figures; accepted to ICASSP 2024

arXiv:2309.07377 [pdf, other]

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Authors: Yifan Yang, Feiyu Shen, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speec… ▽ More Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available at https://github.com/k2-fsa/icefall. △ Less

Submitted 14 December, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Accepted in ICASSP 2024

arXiv:2309.05027 [pdf, other]

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Authors: Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

Abstract: Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the… ▽ More Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow. △ Less

Submitted 16 January, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

Comments: 4 figure, 5 pages, accepted to ICASSP 2024

arXiv:2309.04655 [pdf]

Intelligent upper-limb exoskeleton integrated with soft wearable bioelectronics and deep-learning for human intention-driven strength augmentation based on sensory feedback

Authors: Jinwoo Lee, Kangkyu Kwon, Ira Soltis, Jared Matthews, Yoonjae Lee, Hojoong Kim, Lissette Romero, Nathan Zavanelli, Youngjin Kwon, Shinjae Kwon, Jimin Lee, Yewon Na, Sung Hoon Lee, Ki Jun Yu, Minoru Shinohara, Frank L. Hammond, Woon-Hong Yeo

Abstract: The age and stroke-associated decline in musculoskeletal strength degrades the ability to perform daily human tasks using the upper extremities. Although there are a few examples of exoskeletons, they need manual operations due to the absence of sensor feedback and no intention prediction of movements. Here, we introduce an intelligent upper-limb exoskeleton system that uses cloud-based deep learn… ▽ More The age and stroke-associated decline in musculoskeletal strength degrades the ability to perform daily human tasks using the upper extremities. Although there are a few examples of exoskeletons, they need manual operations due to the absence of sensor feedback and no intention prediction of movements. Here, we introduce an intelligent upper-limb exoskeleton system that uses cloud-based deep learning to predict human intention for strength augmentation. The embedded soft wearable sensors provide sensory feedback by collecting real-time muscle signals, which are simultaneously computed to determine the user's intended movement. The cloud-based deep-learning predicts four upper-limb joint motions with an average accuracy of 96.2% at a 200-250 millisecond response rate, suggesting that the exoskeleton operates just by human intention. In addition, an array of soft pneumatics assists the intended movements by providing 897 newton of force and 78.7 millimeter of displacement at maximum. Collectively, the intent-driven exoskeleton can augment human strength by 5.15 times on average compared to the unassisted exoskeleton. This report demonstrates an exoskeleton robot that augments the upper-limb joint movements by human intention based on a machine-learning cloud computing and sensory feedback. △ Less

Submitted 26 January, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

Comments: 15 pages, 6 figures, 1 table, published in npj flexible electronics journals

MSC Class: 68T40 (Primary) 92C55; 68T99 (Secondary)

arXiv:2306.14145 [pdf, other]

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

Authors: Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

Abstract: Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker i… ▽ More Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: Accepted to Interspeech 2023

arXiv:2306.10090 [pdf, other]

Improving Audio Caption Fluency with Automatic Error Correction

Authors: Hanxue Zhang, Zeyu Xie, Xuenan Xu, Mengyue Wu, Kai Yu

Abstract: Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this pro… ▽ More Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this problem, we use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation. One pair of corrupted and clean sentences can thus be used for training. We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs. Results on two benchmark datasets indicate that our approach significantly improves fluency while maintaining semantic information. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted by NCMMSC 2022

arXiv:2306.08903 [pdf, other]

Two-Way Semantic Transmission of Images without Feedback

Authors: Kaiwen Yu, Qi He, Gang Wu

Abstract: As a competitive technology for 6G, semantic communications can significantly improve transmission efficiency. However, many existing semantic communication systems require information feedback during the training coding process, resulting in a significant communication overhead. In this article, we consider a two-way semantic communication (TW-SC) system, where information feedback can be omitted… ▽ More As a competitive technology for 6G, semantic communications can significantly improve transmission efficiency. However, many existing semantic communication systems require information feedback during the training coding process, resulting in a significant communication overhead. In this article, we consider a two-way semantic communication (TW-SC) system, where information feedback can be omitted by exploiting the weight reciprocity in the transceiver. Particularly, the channel simulator and semantic transceiver are implemented on both TW-SC nodes and the channel distribution is modeled by a conditional generative adversarial network. Simulation results demonstrate that the proposed TW-SC system performs closing to the state-of-the-art one-way semantic communication systems but requiring no feedback between the transceiver in the training process. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.08588 [pdf, other]

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

Authors: Zheng Liang, Zheshu Song, Ziyang Ma, Chenpeng Du, Kai Yu, Xie Chen

Abstract: Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition. However, there remain several challenging scenarios that E2E models are not competent in, such as code-switching and named entity recognition (NER). Data augmentation is a common and effective practice for these two scenarios. However, the cu… ▽ More Recently, end-to-end (E2E) automatic speech recognition (ASR) models have made great strides and exhibit excellent performance in general speech recognition. However, there remain several challenging scenarios that E2E models are not competent in, such as code-switching and named entity recognition (NER). Data augmentation is a common and effective practice for these two scenarios. However, the current data augmentation methods mainly rely on audio splicing and text-to-speech (TTS) models, which might result in discontinuous, unrealistic, and less diversified speech. To mitigate these potential issues, we propose a novel data augmentation method by applying the text-based speech editing model. The augmented speech from speech editing systems is more coherent and diversified, also more akin to real speech. The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: Accepted by Interspeech 2023

arXiv:2306.07547 [pdf, other]

doi 10.1609/aaai.v38i16.29747

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu

Abstract: The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted… ▽ More The utilization of discrete speech tokens, divided into semantic tokens and acoustic tokens, has been proven superior to traditional acoustic feature mel-spectrograms in terms of naturalness and robustness for text-to-speech (TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow zero-shot speaker adaptation through auto-regressive (AR) continuation of acoustic tokens extracted from a short speech prompt. However, these AR models are restricted to generate speech only in a left-to-right direction, making them unsuitable for speech editing where both preceding and following contexts are provided. Furthermore, these models rely on acoustic tokens, which have audio quality limitations imposed by the performance of audio codec models. In this study, we propose a unified context-aware TTS framework called UniCATS, which is capable of both speech continuation and editing. UniCATS comprises two components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav. CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the input text, enabling it to incorporate the semantic context and maintain seamless concatenation with the surrounding context. Following that, CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into waveforms, taking into consideration the acoustic context. Our experimental results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms of speech resynthesis from semantic tokens. Moreover, we show that UniCATS achieves state-of-the-art performance in both speech continuation and editing. △ Less

Submitted 28 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: Accepted to AAAI 2024

arXiv:2306.01533 [pdf, other]

doi 10.21437/Interspeech.2023-1614

Enhance Temporal Relations in Audio Captioning with Sound Event Detection

Authors: Zeyu Xie, Xuenan Xu, Mengyue Wu, Kai Yu

Abstract: Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events hav… ▽ More Automated audio captioning aims at generating natural language descriptions for given audio clips, not only detecting and classifying sounds, but also summarizing the relationships between audio events. Recent research advances in audio captioning have introduced additional guidance to improve the accuracy of audio events in generated sentences. However, temporal relations between audio events have received little attention while revealing complex relations is a key component in summarizing audio content. Therefore, this paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events' timestamps. We investigate the best approach to integrate temporal information in a captioning model and propose a temporal tag system to transform the timestamps into comprehensible relations. Results evaluated by the proposed temporal metrics suggest that great improvement is achieved in terms of temporal relation generation. △ Less

Submitted 18 July, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: Interspeech 2023

arXiv:2305.01980 [pdf, other]

Diverse and Vivid Sound Generation from Text Descriptions

Authors: Guangwei Li, Xuenan Xu, Lingfeng Dai, Mengyue Wu, Kai Yu

Abstract: Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher po… ▽ More Previous audio generation mainly focuses on specified sound classes such as speech or music, whose form and content are greatly restricted. In this paper, we go beyond specific audio generation by using natural language description as a clue to generate broad sounds. Unlike visual information, a text description is concise by its nature but has rich hidden meanings beneath, which poses a higher possibility and complexity on the audio to be generated. A Variation-Quantized GAN is used to train a codebook learning discrete representations of spectrograms. For a given text description, its pre-trained embedding is fed to a Transformer to sample codebook indices to decode a spectrogram to be further transformed into waveform by a melgan vocoder. The generated waveform has high quality and fidelity while excellently corresponding to the given text. Experiments show that our proposed method is capable of generating natural, vivid audios, achieving superb quantitative and qualitative results. △ Less

Submitted 3 May, 2023; originally announced May 2023.

arXiv:2304.13121 [pdf, other]

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

Authors: Chenpeng Du, Yiwei Guo, Feiyu Shen, Kai Yu

Abstract: In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours… ▽ More In this paper, we describe the systems developed by the SJTU X-LANCE team for LIMMITS 2023 Challenge, and we mainly focus on the winning system on naturalness for track 1. The aim of this challenge is to build a multi-speaker multi-lingual text-to-speech (TTS) system for Marathi, Hindi and Telugu. Each of the languages has a male and a female speaker in the given dataset. In track 1, only 5 hours data from each speaker can be selected to train the TTS model. Our system is based on the recently proposed VQTTS that utilizes VQ acoustic feature rather than mel-spectrogram. We introduce additional speaker embeddings and language embeddings to VQTTS for controlling the speaker and language information. In the cross-lingual evaluations where we need to synthesize speech in a cross-lingual speaker's voice, we provide a native speaker's embedding to the acoustic model and the target speaker's embedding to the vocoder. In the subjective MOS listening test on naturalness, our system achieves 4.77 which ranks first. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP 2023 Special Session for Grand Challenges

arXiv:2304.11750 [pdf, other]

DiffVoice: Text-to-Speech with Latent Diffusion

Authors: Zhijun Liu, Yiwei Guo, Kai Yu

Abstract: In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate… ▽ More In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness. By adopting recent generative inverse problem solving algorithms for diffusion models, DiffVoice achieves the state-of-the-art performance in text-based speech editing, and zero-shot adaptation. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: Accepted to ICASSP2023

arXiv:2303.12479

Distributed Two-tier DRL Framework for Cell-Free Network: Association, Beamforming and Power Allocation

Authors: Kaiwen Yu, Chonghao Zhao, Gang Wu, Geoffrey Ye Li

Abstract: Intelligent wireless networks have long been expected to have self-configuration and self-optimization capabilities to adapt to various environments and demands. In this paper, we develop a novel distributed hierarchical deep reinforcement learning (DHDRL) framework with two-tier control networks in different timescales to optimize the long-term spectrum efficiency (SE) of the downlink cell-free m… ▽ More Intelligent wireless networks have long been expected to have self-configuration and self-optimization capabilities to adapt to various environments and demands. In this paper, we develop a novel distributed hierarchical deep reinforcement learning (DHDRL) framework with two-tier control networks in different timescales to optimize the long-term spectrum efficiency (SE) of the downlink cell-free multiple-input single-output (MISO) network, consisting of multiple distributed access points (AP) and user terminals (UT). To realize the proposed two-tier control strategy, we decompose the optimization problem into two sub-problems, AP-UT association (AUA) as well as beamforming and power allocation (BPA), resulting in a Markov decision process (MDP) and Partially Observable MDP (POMDP). The proposed method consists of two neural networks. At the system level, a distributed high-level neural network is introduced to optimize wireless network structure on a large timescale. While at the link level, a distributed low-level neural network is proposed to mitigate inter-AP interference and improve the transmission performance on a small timescale. Numerical results show that our method is effective for high-dimensional problems, in terms of spectrum efficiency, signaling overhead as well as satisfaction probability, and generalize well to diverse multi-object problems. △ Less

Submitted 5 December, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: The paper has some updated

arXiv:2303.05322 [pdf, other]

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation

Authors: Qi Chen, Ziyang Ma, Tao Liu, Xu Tan, Qu Lu, Xie Chen, Kai Yu

Abstract: Audio-driven talking face has attracted broad interest from academia and industry recently. However, data acquisition and labeling in audio-driven talking face are labor-intensive and costly. The lack of data resource results in poor synthesis effect. To alleviate this issue, we propose to use TTS (Text-To-Speech) for data augmentation to improve few-shot ability of the talking face system. The mi… ▽ More Audio-driven talking face has attracted broad interest from academia and industry recently. However, data acquisition and labeling in audio-driven talking face are labor-intensive and costly. The lack of data resource results in poor synthesis effect. To alleviate this issue, we propose to use TTS (Text-To-Speech) for data augmentation to improve few-shot ability of the talking face system. The misalignment problem brought by the TTS audio is solved with the introduction of soft-DTW, which is first adopted in the talking face task. Moreover, features extracted by HuBERT are explored to utilize underlying information of audio, and found to be superior over other features. The proposed method achieves 17%, 14%, 38% dominance on MSE score, DTW score and user study preference repectively over the baseline model, which shows the effectiveness of improving few-shot learning for talking face system with TTS augmentation. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: 4 pages. Accepted by ICASSP 2023

arXiv:2212.02164 [pdf, other]

Spectral Efficiency Analysis of Uplink-Downlink Decoupled Access in C-V2X Networks

Authors: Luofang Jiao, Kai Yu, Yunting Xu, Tianqi Zhang, Haibo Zhou, Xuemin, Shen

Abstract: The uplink (UL)/downlink (DL) decoupled access has been emerging as a novel access architecture to improve the performance gains in cellular networks. In this paper, we investigate the UL/DL decoupled access performance in cellular vehicle-to-everything (C-V2X). We propose a unified analytical framework for the UL/DL decoupled access in C-V2X from the perspective of spectral efficiency (SE). By mo… ▽ More The uplink (UL)/downlink (DL) decoupled access has been emerging as a novel access architecture to improve the performance gains in cellular networks. In this paper, we investigate the UL/DL decoupled access performance in cellular vehicle-to-everything (C-V2X). We propose a unified analytical framework for the UL/DL decoupled access in C-V2X from the perspective of spectral efficiency (SE). By modeling the UL/DL decoupled access C-V2X as a Cox process and leveraging the stochastic geometry, we obtain the joint association probability, the UL/DL distance distributions to serving base stations and the SE for the UL/DL decoupled access in C-V2X networks with different association cases. We conduct extensive Monte Carlo simulations to verify the accuracy of the proposed unified analytical framework, and the results show a better system average SE of UL/DL decoupled access in C-V2X. △ Less

Submitted 12 December, 2022; v1 submitted 5 December, 2022; originally announced December 2022.

Comments: 6pagaes,5 figures, globecom 2022

arXiv:2212.00330

Reliable Joint Segmentation of Retinal Edema Lesions in OCT Images

Authors: Meng Wang, Kai Yu, Chun-Mei Feng, Ke Zou, Yanyu Xu, Qingquan Meng, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract: Focusing on the complicated pathological features, such as blurred boundaries, severe scale differences between symptoms, background noise interference, etc., in the task of retinal edema lesions joint segmentation from OCT images and enabling the segmentation results more reliable. In this paper, we propose a novel reliable multi-scale wavelet-enhanced transformer network, which can provide accur… ▽ More Focusing on the complicated pathological features, such as blurred boundaries, severe scale differences between symptoms, background noise interference, etc., in the task of retinal edema lesions joint segmentation from OCT images and enabling the segmentation results more reliable. In this paper, we propose a novel reliable multi-scale wavelet-enhanced transformer network, which can provide accurate segmentation results with reliability assessment. Specifically, aiming at improving the model's ability to learn the complex pathological features of retinal edema lesions in OCT images, we develop a novel segmentation backbone that integrates a wavelet-enhanced feature extractor network and a multi-scale transformer module of our newly designed. Meanwhile, to make the segmentation results more reliable, a novel uncertainty segmentation head based on the subjective logical evidential theory is introduced to generate the final segmentation results with a corresponding overall uncertainty evaluation score map. We conduct comprehensive experiments on the public database of AI-Challenge 2018 for retinal edema lesions segmentation, and the results show that our proposed method achieves better segmentation accuracy with a high degree of reliability as compared to other state-of-the-art segmentation approaches. The code will be released on: https://github.com/LooKing9218/ReliableRESeg. △ Less

Submitted 1 January, 2024; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: Improving algorithm

arXiv:2211.09496 [pdf, other]

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

Authors: Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu

Abstract: Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a p… ▽ More Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and \textit{Neutral} is set to $α$ and $1-α$ respectively. The $α$ here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process. △ Less

Submitted 16 February, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: Accepted to ICASSP2023

arXiv:2211.04304 [pdf, other]

BER: Balanced Error Rate For Speaker Diarization

Authors: Tao Liu, Kai Yu

Abstract: DER is the primary metric to evaluate diarization performance while facing a dilemma: the errors in short utterances or segments tend to be overwhelmed by longer ones. Short segments, e.g., `yes' or `no,' still have semantic information. Besides, DER overlooks errors in less-talked speakers. Although JER balances speaker errors, it still suffers from the same dilemma. Considering all those aspects… ▽ More DER is the primary metric to evaluate diarization performance while facing a dilemma: the errors in short utterances or segments tend to be overwhelmed by longer ones. Short segments, e.g., `yes' or `no,' still have semantic information. Besides, DER overlooks errors in less-talked speakers. Although JER balances speaker errors, it still suffers from the same dilemma. Considering all those aspects, duration error, segment error, and speaker-weighted error constituting a complete diarization evaluation, we propose a Balanced Error Rate (BER) to evaluate speaker diarization. First, we propose a segment-level error rate (SER) via connected sub-graphs and adaptive IoU threshold to get accurate segment matching. Second, to evaluate diarization in a unified way, we adopt a speaker-specific harmonic mean between duration and segment, followed by a speaker-weighted average. Third, we analyze our metric via the modularized system, EEND, and the multi-modal method on real datasets. SER and BER are publicly available at https://github.com/X-LANCE/BER. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: 5 pages, 2 figures

arXiv:2208.01221 [pdf]

Generative Adversarial Learning for Intelligent Trust Management in 6G Wireless Networks

Authors: Liu Yang, Yun Li, Simon X. Yang, Yinzhi Lu, Tan Guo, Keping Yu

Abstract: Emerging six generation (6G) is the integration of heterogeneous wireless networks, which can seamlessly support anywhere and anytime networking. But high Quality-of-Trust should be offered by 6G to meet mobile user expectations. Artificial intelligence (AI) is considered as one of the most important components in 6G. Then AI-based trust management is a promising paradigm to provide trusted and re… ▽ More Emerging six generation (6G) is the integration of heterogeneous wireless networks, which can seamlessly support anywhere and anytime networking. But high Quality-of-Trust should be offered by 6G to meet mobile user expectations. Artificial intelligence (AI) is considered as one of the most important components in 6G. Then AI-based trust management is a promising paradigm to provide trusted and reliable services. In this article, a generative adversarial learning-enabled trust management method is presented for 6G wireless networks. Some typical AI-based trust management schemes are first reviewed, and then a potential heterogeneous and intelligent 6G architecture is introduced. Next, the integration of AI and trust management is developed to optimize the intelligence and security. Finally, the presented AI-based trust management method is applied to secure clustering to achieve reliable and real-time communications. Simulation results have demonstrated its excellent performance in guaranteeing network security and service quality. △ Less

Submitted 1 August, 2022; originally announced August 2022.

arXiv:2207.09057 [pdf]

doi 10.1109/TII.2021.3128954

An Intelligent Trust Cloud Management Method for Secure Clustering in 5G enabled Internet of Medical Things

Authors: Liu Yang, Keping Yu, Simon X. Yang, Chinmay Chakraborty, Yinzhi Lu, Tan Guo

Abstract: 5G edge computing enabled Internet of Medical Things (IoMT) is an efficient technology to provide decentralized medical services while Device-to-device (D2D) communication is a promising paradigm for future 5G networks. To assure secure and reliable communication in 5G edge computing and D2D enabled IoMT systems, this paper presents an intelligent trust cloud management method. Firstly, an active… ▽ More 5G edge computing enabled Internet of Medical Things (IoMT) is an efficient technology to provide decentralized medical services while Device-to-device (D2D) communication is a promising paradigm for future 5G networks. To assure secure and reliable communication in 5G edge computing and D2D enabled IoMT systems, this paper presents an intelligent trust cloud management method. Firstly, an active training mechanism is proposed to construct the standard trust clouds. Secondly, individual trust clouds of the IoMT devices can be established through fuzzy trust inferring and recommending. Thirdly, a trust classification scheme is proposed to determine whether an IoMT device is malicious. Finally, a trust cloud update mechanism is presented to make the proposed trust management method adaptive and intelligent under an open wireless medium. Simulation results demonstrate that the proposed method can effectively address the trust uncertainty issue and improve the detection accuracy of malicious devices. △ Less

Submitted 19 July, 2022; originally announced July 2022.

arXiv:2207.08226 [pdf]

doi 10.1109/TII.2022.3186891

An Intelligent Deterministic Scheduling Method for Ultra-Low Latency Communication in Edge Enabled Industrial Internet of Things

Authors: Yinzhi Lu, Liu Yang, Simon X. Yang, Qiaozhi Hua, Arun Kumar Sangaiah, Tan Guo, Keping Yu

Abstract: Edge enabled Industrial Internet of Things (IIoT) platform is of great significance to accelerate the development of smart industry. However, with the dramatic increase in real-time IIoT applications, it is a great challenge to support fast response time, low latency, and efficient bandwidth utilization. To address this issue, Time Sensitive Network (TSN) is recently researched to realize low late… ▽ More Edge enabled Industrial Internet of Things (IIoT) platform is of great significance to accelerate the development of smart industry. However, with the dramatic increase in real-time IIoT applications, it is a great challenge to support fast response time, low latency, and efficient bandwidth utilization. To address this issue, Time Sensitive Network (TSN) is recently researched to realize low latency communication via deterministic scheduling. To the best of our knowledge, the combinability of multiple flows, which can significantly affect the scheduling performance, has never been systematically analyzed before. In this article, we first analyze the combinability problem. Then a non-collision theory based deterministic scheduling (NDS) method is proposed to achieve ultra-low latency communication for the time-sensitive flows. Moreover, to improve bandwidth utilization, a dynamic queue scheduling (DQS) method is presented for the best-effort flows. Experiment results demonstrate that NDS/DQS can well support deterministic ultra-low latency services and guarantee efficient bandwidth utilization. △ Less

Submitted 17 July, 2022; originally announced July 2022.

arXiv:2207.02957 [pdf, other]

Context-aware Self-supervised Learning for Medical Images Using Graph Neural Network

Authors: Li Sun, Ke Yu, Kayhan Batmanghelich

Abstract: Although self-supervised learning enables us to bootstrap the training by exploiting unlabeled data, the generic self-supervised methods for natural images do not sufficiently incorporate the context. For medical images, a desirable method should be sensitive enough to detect deviation from normal-appearing tissue of each anatomical region; here, anatomy is the context. We introduce a novel approa… ▽ More Although self-supervised learning enables us to bootstrap the training by exploiting unlabeled data, the generic self-supervised methods for natural images do not sufficiently incorporate the context. For medical images, a desirable method should be sensitive enough to detect deviation from normal-appearing tissue of each anatomical region; here, anatomy is the context. We introduce a novel approach with two levels of self-supervised representation learning objectives: one on the regional anatomical level and another on the patient-level. We use graph neural networks to incorporate the relationship between different anatomical regions. The structure of the graph is informed by anatomical correspondences between each patient and an anatomical atlas. In addition, the graph representation has the advantage of handling any arbitrarily sized image in full resolution. Experiments on large-scale Computer Tomography (CT) datasets of lung images show that our approach compares favorably to baseline methods that do not account for the context. We use the learned embedding for staging lung tissue abnormalities related to COVID-19. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: Accepted by NeurIPS workshop 2020. arXiv admin note: substantial text overlap with arXiv:2012.06457

arXiv:2205.13160 [pdf, other]

Integration of Blockchain and Edge Computing in Internet of Things: A Survey

Authors: He Xue, Dajiang Chen, Ning Zhang, Hong-Ning Dai, Keping Yu

Abstract: As an important technology to ensure data security, consistency, traceability, etc., blockchain has been increasingly used in Internet of Things (IoT) applications. The integration of blockchain and edge computing can further improve the resource utilization in terms of network, computing, storage, and security. This paper aims to present a survey on the integration of blockchain and edge computin… ▽ More As an important technology to ensure data security, consistency, traceability, etc., blockchain has been increasingly used in Internet of Things (IoT) applications. The integration of blockchain and edge computing can further improve the resource utilization in terms of network, computing, storage, and security. This paper aims to present a survey on the integration of blockchain and edge computing. In particular, we first give an overview of blockchain and edge computing. We then present a general architecture of an integration of blockchain and edge computing system. We next study how to utilize blockchain to benefit edge computing, as well as how to use edge computing to benefit blockchain. We also discuss the issues brought by the integration of blockchain and edge computing system and solutions from perspectives of resource management, joint optimization, data management, computation offloading and security mechanism. Finally, we analyze and summarize the existing challenges posed by the integration of blockchain and edge computing system and the potential solutions in the future. △ Less

Submitted 26 May, 2022; originally announced May 2022.

arXiv:2205.05357 [pdf, other]

Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

Authors: Xuenan Xu, Zeyu Xie, Mengyue Wu, Kai Yu

Abstract: Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encode… ▽ More Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encoder-decoder-based deep learning framework is the standard approach to tackle this problem. Plenty of works have proposed novel network architectures and training schemes, including extra guidance, reinforcement learning, audio-text self-supervised learning and diverse or controllable captioning. Effective data augmentation techniques, especially based on large language models are explored. Benchmark datasets and AAC-oriented evaluation metrics also accelerate the improvement of this field. This paper situates itself as a comprehensive survey covering the comparison between AAC and its related tasks, the existing deep learning techniques, datasets, and the evaluation metrics in AAC, with insights provided to guide potential future research directions. △ Less

Submitted 15 November, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

Showing 1–50 of 84 results for author: Yu, K