-
Accompanied Singing Voice Synthesis with Fully Text-controlled Melody
Authors:
Ruiqi Li,
Zhiqing Hong,
Yongqi Wang,
Lichao Zhang,
Rongjie Huang,
Siqi Zheng,
Zhou Zhao
Abstract:
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achie…
▽ More
Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
Authors:
Ruiqi Li,
Rongjie Huang,
Yongqi Wang,
Zhiqing Hong,
Zhou Zhao
Abstract:
Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training mod…
▽ More
Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training model. We leverage spoken language model techniques to tackle the rhythm alignment problem and the in-context learning capability to achieve zero-shot conversion. We adopt discrete-unit random resampling and pitch corruption strategies, enabling training with unpaired singing data and thus mitigating the issue of data scarcity. SVPT also serves as an effective backbone for singing voice synthesis (SVS), offering insights into scaling up SVS models. Experimental results indicate that SVPT delivers notable improvements in both STS and SVS endeavors. Audio samples are available at https://speech2sing.github.io.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Robust Singing Voice Transcription Serves Synthesis
Authors:
Ruiqi Li,
Yu Zhang,
Yongqi Wang,
Zhiqing Hong,
Rongjie Huang,
Zhou Zhao
Abstract:
Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating…
▽ More
Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io.
△ Less
Submitted 3 June, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Detection of Peri-Pancreatic Edema using Deep Learning and Radiomics Techniques
Authors:
Ziliang Hong,
Debesh Jha,
Koushik Biswas,
Zheyuan Zhang,
Yury Velichko,
Cemal Yazici,
Temel Tirkes,
Amir Borhani,
Baris Turkbey,
Alpay Medetalibeyoglu,
Gorkem Durak,
Ulas Bagci
Abstract:
Identifying peri-pancreatic edema is a pivotal indicator for identifying disease progression and prognosis, emphasizing the critical need for accurate detection and assessment in pancreatitis diagnosis and management. This study \textit{introduces a novel CT dataset sourced from 255 patients with pancreatic diseases, featuring annotated pancreas segmentation masks and corresponding diagnostic labe…
▽ More
Identifying peri-pancreatic edema is a pivotal indicator for identifying disease progression and prognosis, emphasizing the critical need for accurate detection and assessment in pancreatitis diagnosis and management. This study \textit{introduces a novel CT dataset sourced from 255 patients with pancreatic diseases, featuring annotated pancreas segmentation masks and corresponding diagnostic labels for peri-pancreatic edema condition}. With the novel dataset, we first evaluate the efficacy of the \textit{LinTransUNet} model, a linear Transformer based segmentation algorithm, to segment the pancreas accurately from CT imaging data. Then, we use segmented pancreas regions with two distinctive machine learning classifiers to identify existence of peri-pancreatic edema: deep learning-based models and a radiomics-based eXtreme Gradient Boosting (XGBoost). The LinTransUNet achieved promising results, with a dice coefficient of 80.85\%, and mIoU of 68.73\%. Among the nine benchmarked classification models for peri-pancreatic edema detection, \textit{Swin-Tiny} transformer model demonstrated the highest recall of $98.85 \pm 0.42$ and precision of $98.38\pm 0.17$. Comparatively, the radiomics-based XGBoost model achieved an accuracy of $79.61\pm4.04$ and recall of $91.05\pm3.28$, showcasing its potential as a supplementary diagnostic tool given its rapid processing speed and reduced training time. Our code is available \url{https://github.com/NUBagciLab/Peri-Pancreatic-Edema-Detection}.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
Authors:
Zhiqing Hong,
Rongjie Huang,
Xize Cheng,
Yongqi Wang,
Ruiqi Li,
Fuming You,
Zhou Zhao,
Zhimeng Zhang
Abstract:
A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consi…
▽ More
A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
△ Less
Submitted 20 May, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Authors:
Yongqi Wang,
Ruofan Hu,
Rongjie Huang,
Zhiqing Hong,
Ruiqi Li,
Wenrui Liu,
Fuming You,
Tao Jin,
Zhou Zhao
Abstract:
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only…
▽ More
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
△ Less
Submitted 9 July, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Authors:
Yongqi Wang,
Jionghao Bai,
Rongjie Huang,
Ruiqi Li,
Zhiqing Hong,
Zhou Zhao
Abstract:
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic lan…
▽ More
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ .
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Authors:
Rongjie Huang,
Mingze Li,
Dongchao Yang,
Jiatong Shi,
Xuankai Chang,
Zhenhui Ye,
Yuning Wu,
Zhiqing Hong,
Jiawei Huang,
Jinglin Liu,
Yi Ren,
Zhou Zhao,
Shinji Watanabe
Abstract:
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements…
▽ More
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
Cross-Dimensional Refined Learning for Real-Time 3D Visual Perception from Monocular Video
Authors:
Ziyang Hong,
C. Patrick Yue
Abstract:
We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels. Recent approaches to real-time 3D scene reconstruction mostly adopt a volumetric scheme, where a Truncated Signed Distance Function (TSDF) is directly regressed. However, these volumetric approaches tend to focus on the global coherence of their reconstructions, which le…
▽ More
We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels. Recent approaches to real-time 3D scene reconstruction mostly adopt a volumetric scheme, where a Truncated Signed Distance Function (TSDF) is directly regressed. However, these volumetric approaches tend to focus on the global coherence of their reconstructions, which leads to a lack of local geometric detail. To overcome this issue, we propose to leverage the latent geometric prior knowledge in 2D image features by explicit depth prediction and anchored feature generation, to refine the occupancy learning in TSDF volume. Besides, we find that this cross-dimensional feature refinement methodology can also be adopted for the semantic segmentation task by utilizing semantic priors. Hence, we proposed an end-to-end cross-dimensional refinement neural network (CDRNet) to extract both 3D mesh and 3D semantic labeling in real time. The experiment results show that this method achieves a state-of-the-art 3D perception efficiency on multiple datasets, which indicates the great potential of our method for industrial applications.
△ Less
Submitted 10 September, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Bayesian Optimization-Based Beam Alignment for MmWave MIMO Communication Systems
Authors:
Songjie Yang,
Baojuan Liu,
Zhiqin Hong,
Zhongpei Zhang
Abstract:
Due to the very narrow beam used in millimeter wave communication (mmWave), beam alignment (BA) is a critical issue. In this work, we investigate the issue of mmWave BA and present a novel beam alignment scheme on the basis of a machine learning strategy, Bayesian optimization (BO). In this context, we consider the beam alignment issue to be a black box function and then use BO to find the possibl…
▽ More
Due to the very narrow beam used in millimeter wave communication (mmWave), beam alignment (BA) is a critical issue. In this work, we investigate the issue of mmWave BA and present a novel beam alignment scheme on the basis of a machine learning strategy, Bayesian optimization (BO). In this context, we consider the beam alignment issue to be a black box function and then use BO to find the possible optimal beam pair. During the BA procedure, this strategy exploits information from the measured beam pairs to predict the best beam pair. In addition, we suggest a novel BO algorithm based on the gradient boosting regression tree model. The simulation results demonstrate the spectral efficiency performance of our proposed schemes for BA using three different surrogate models. They also demonstrate that the proposed schemes can achieve spectral efficiency with a small overhead when compared to the orthogonal match pursuit (OMP) algorithm and the Thompson sampling-based multi-armed bandit (TS-MAB) method.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
Low-complexity Sparse Array Synthesis Based on Off-grid Compressive Sensing
Authors:
Songjie Yang,
Baojuan Liu,
Zhiqin Hong,
Zhongpei Zhang
Abstract:
A novel sparse array synthesis method for non-uniform planar arrays is proposed, which belongs to compressive sensing (CS)-based systhesis. Particularly, we propose an off-grid refinement technique to simultaneously optimize the antenna element positions and excitations with a low complexity, in response to the antenna position optimization problem that is difficult for standard CS. More important…
▽ More
A novel sparse array synthesis method for non-uniform planar arrays is proposed, which belongs to compressive sensing (CS)-based systhesis. Particularly, we propose an off-grid refinement technique to simultaneously optimize the antenna element positions and excitations with a low complexity, in response to the antenna position optimization problem that is difficult for standard CS. More importantly, we take into account the minimum inter-element spacing constraint for ensuring the physically realizable solution. Specifically, the off-grid Orthogonal Match Pursuit (OMP) algorithm is first proposed with low complexity and then off-grid Look Ahead Orthogonal Match Pursuit (LAOMP) is designed with better synthesis performance but higher complexity. In addition, simulation results have shown the proposed schemes have more advantages in computational complexity and synthesis performances compared with the related method.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
DT-SV: A Transformer-based Time-domain Approach for Speaker Verification
Authors:
Nan Zhang,
Jianzong Wang,
Zhenhou Hong,
Chendong Zhao,
Xiaoyang Qu,
Jing Xiao
Abstract:
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in…
▽ More
Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models.
△ Less
Submitted 26 May, 2022;
originally announced May 2022.
-
Board-level Code-Modulated Embedded Test and Calibration of an X-band Phased-Array Transceiver
Authors:
Zhangjie Hong,
Simon Schönherr,
Vikas Chauhan,
Brian Floyd
Abstract:
We present methods for built-in test and calibration of phased arrays using code-modulated embedded test (CoMET). Our approach employs Cartesian modulation of test signals within each element using existing phase shifters, combining of these signals into an aggregate code-multiplexed response, downconversion and creation of code-modulated element-to-element "interference products" using a built-in…
▽ More
We present methods for built-in test and calibration of phased arrays using code-modulated embedded test (CoMET). Our approach employs Cartesian modulation of test signals within each element using existing phase shifters, combining of these signals into an aggregate code-multiplexed response, downconversion and creation of code-modulated element-to-element "interference products" using a built-in power detector, demodulation of correlations from the digitized interference response, and extraction of amplitude and phase per element using an equation solver. Rotated-axis methodology is discussed for accurate extraction of phase near the original 0/90/180/270 degree axes. Our techniques are demonstrated at board level for both receive and transmit modes using an eight-element 8-16 GHz phased array constructed using ADAR1000 chips from ADI. At 6 GHz, CoMET-extracted gain and phase are accurate to within 0.2 dB and 3 degree compared to network-analyzer measurements. We then employ CoMET in a calibration loop to determinate optimum control settings at 6 GHz, outside the 8-16 GHz band for which the array was designed. We achieve seven-bit phase resolution with equalized gain. The root-mean squared gain and phase errors are improved from 0.8 dB and 8 degree before calibration to 0.1 dB and 1.7 degree after calibration.
△ Less
Submitted 11 July, 2021;
originally announced July 2021.
-
EfficientTDNN: Efficient Architecture Search for Speaker Recognition
Authors:
Rui Wang,
Zhihua Wei,
Haoran Duan,
Shouling Ji,
Yang Long,
Zhen Hong
Abstract:
Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approach…
▽ More
Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approaches, neural architecture search (NAS) appears as a practical technique in automating the manual architecture design process and has attracted increasing interest in spoken language processing tasks such as speaker recognition. In this paper, we propose EfficientTDNN, an efficient architecture search framework consisting of a TDNN-based supernet and a TDNN-NAS algorithm. The proposed supernet introduces temporal convolution of different ranges of the receptive field and feature aggregation of various resolutions from different layers to TDNN. On top of it, the TDNN-NAS algorithm quickly searches for the desired TDNN architecture via weight-sharing subnets, which surprisingly reduces computation while handling the vast number of devices with various resources requirements. Experimental results on the VoxCeleb dataset show the proposed EfficientTDNN enables approximate $10^{13}$ architectures concerning depth, kernel, and width. Considering different computation constraints, it achieves a 2.20% equal error rate (EER) with 204M multiply-accumulate operations (MACs), 1.41% EER with 571M MACs as well as 0.94% EER with 1.45G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.
△ Less
Submitted 18 June, 2022; v1 submitted 24 March, 2021;
originally announced March 2021.
-
Virtual-to-Real: Learning to Control in Visual Semantic Segmentation
Authors:
Zhang-Wei Hong,
Chen Yu-Ming,
Shih-Yang Su,
Tzu-Yun Shann,
Yi-Hsiang Chang,
Hsuan-Kung Yang,
Brian Hsi-Lin Ho,
Chih-Chieh Tu,
Yueh-Chuan Chang,
Tsu-Ching Hsiao,
Hsin-Wei Hsiao,
Sih-Pin Lai,
Chun-Yi Lee
Abstract:
Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular…
▽ More
Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular architecture for tackling the virtual-to-real problem. The proposed architecture separates the learning model into a perception module and a control policy module, and uses semantic image segmentation as the meta representation for relating these two modules. The perception module translates the perceived RGB image to semantic image segmentation. The control policy module is implemented as a deep reinforcement learning agent, which performs actions based on the translated image segmentation. Our architecture is evaluated in an obstacle avoidance task and a target following task. Experimental results show that our architecture significantly outperforms all of the baseline methods in both virtual and real environments, and demonstrates a faster learning curve than them. We also present a detailed analysis for a variety of variant configurations, and validate the transferability of our modular architecture.
△ Less
Submitted 28 October, 2018; v1 submitted 1 February, 2018;
originally announced February 2018.