Search | arXiv e-print repository

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text

Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung

Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations… ▽ More The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: CVPR 2024

arXiv:2404.10643 [pdf, other]

A Calibrated and Automated Simulator for Innovations in 5G

Authors: Conrado Boeira, Antor Hasan, Khaleda Papry, Yue Ju, Zhongwen Zhu, Israat Haque

Abstract: The rise of 5G deployments has created the environment for many emerging technologies to flourish. Self-driving vehicles, Augmented and Virtual Reality, and remote operations are examples of applications that leverage 5G networks' support for extremely low latency, high bandwidth, and increased throughput. However, the complex architecture of 5G hinders innovation due to the lack of accessibility… ▽ More The rise of 5G deployments has created the environment for many emerging technologies to flourish. Self-driving vehicles, Augmented and Virtual Reality, and remote operations are examples of applications that leverage 5G networks' support for extremely low latency, high bandwidth, and increased throughput. However, the complex architecture of 5G hinders innovation due to the lack of accessibility to testbeds or realistic simulators with adequate 5G functionalities. Also, configuring and managing simulators are complex and time consuming. Finally, the lack of adequate representative data hinders the data-driven designs in 5G campaigns. Thus, we calibrated a system-level open-source simulator, Simu5G, following 3GPP guidelines to enable faster innovation in the 5G domain. Furthermore, we developed an API for automatic simulator configuration without knowing the underlying architectural details. Finally, we demonstrate the usage of the calibrated and automated simulator by developing an ML-based anomaly detection in a 5G Radio Access Network (RAN). △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2401.07532 [pdf, other]

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

Authors: Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng

Abstract: Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still re… ▽ More Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still remains unaddressed. To this end, we propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. The Multi-view MidiVAE utilizes the two-dimensional (2-D) representation, OctupleMIDI, to capture relationships among notes while reducing the feature sequences length. Moreover, we focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy to integrate both Track- and Bar-view MidiVAE features. Objective and subjective experimental results on the CocoChorales dataset demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted by ICASSP 2024

arXiv:2309.07293 [pdf]

GAN-based Algorithm for Efficient Image Inpainting

Authors: Zhengyang Han, Zehao Jiang, Yuan Ju

Abstract: Global pandemic due to the spread of COVID-19 has post challenges in a new dimension on facial recognition, where people start to wear masks. Under such condition, the authors consider utilizing machine learning in image inpainting to tackle the problem, by complete the possible face that is originally covered in mask. In particular, autoencoder has great potential on retaining important, general… ▽ More Global pandemic due to the spread of COVID-19 has post challenges in a new dimension on facial recognition, where people start to wear masks. Under such condition, the authors consider utilizing machine learning in image inpainting to tackle the problem, by complete the possible face that is originally covered in mask. In particular, autoencoder has great potential on retaining important, general features of the image as well as the generative power of the generative adversarial network (GAN). The authors implement a combination of the two models, context encoders and explain how it combines the power of the two models and train the model with 50,000 images of influencers faces and yields a solid result that still contains space for improvements. Furthermore, the authors discuss some shortcomings with the model, their possible improvements, as well as some area of study for future investigation for applicative perspective, as well as directions to further enhance and refine the model. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: 6 pages, 3 figures

MSC Class: 68U10

Journal ref: The 3rd International Conference on Artificial Intelligence and Computer Engineering(ICAICE 2022)

arXiv:2306.16250 [pdf, other]

MC-SpEx: Towards Effective Speaker Extraction with Multi-Scale Interfusion and Conditional Speaker Modulation

Authors: Jun Chen, Wei Rao, Zilin Wang, Jiuxin Lin, Yukai Ju, Shulin He, Yannan Wang, Zhiyong Wu

Abstract: The previous SpEx+ has yielded outstanding performance in speaker extraction and attracted much attention. However, it still encounters inadequate utilization of multi-scale information and speaker embedding. To this end, this paper proposes a new effective speaker extraction system with multi-scale interfusion and conditional speaker modulation (ConSM), which is called MC-SpEx. First of all, we d… ▽ More The previous SpEx+ has yielded outstanding performance in speaker extraction and attracted much attention. However, it still encounters inadequate utilization of multi-scale information and speaker embedding. To this end, this paper proposes a new effective speaker extraction system with multi-scale interfusion and conditional speaker modulation (ConSM), which is called MC-SpEx. First of all, we design the weight-share multi-scale fusers (ScaleFusers) for efficiently leveraging multi-scale information as well as ensuring consistency of the model's feature space. Then, to consider different scale information while generating masks, the multi-scale interactive mask generator (ScaleInterMG) is presented. Moreover, we introduce ConSM module to fully exploit speaker embedding in the speech extractor. Experimental results on the Libri2Mix dataset demonstrate the effectiveness of our improvements and the state-of-the-art performance of our proposed MC-SpEx. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: Accepted by InterSpeech 2023

arXiv:2303.07704 [pdf, other]

TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

Authors: Yukai Ju, Jun Chen, Shimin Zhang, Shulin He, Wei Rao, Weixin Zhu, Yannan Wang, Tao Yu, Shidong Shang

Abstract: This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) struct… ▽ More This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) structure is introduced to boost speaker information extraction, and multi-STFT resolution loss is used to effectively capture the time-frequency characteristics of the speech signals. Moreover, retraining methods are employed based on the freeze training strategy to fine-tune the system. According to the official results, TEA-PSE 3.0 ranks 1st in both ICASSP 2023 DNS-Challenge track 1 and track 2. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.06404 [pdf, other]

Multi-Task Sub-Band Network For Deep Residual Echo Suppression

Authors: Jiayao Sun, Dawei Luo, Zhaoxia Li, Jindong Li, Yukai Ju, Yang Li

Abstract: This paper introduces the SWANT team entry to the ICASSP 2023 AEC Challenge. We submit a system that cascades a linear filter with a neural post-filter. Particularly, we adopt sub-band processing to handle full-band signals and shape the network with multi-task learning, where dual signal voice activity detection (DSVAD) and echo estimation are adopted as auxiliary tasks. Moreover, we particularly… ▽ More This paper introduces the SWANT team entry to the ICASSP 2023 AEC Challenge. We submit a system that cascades a linear filter with a neural post-filter. Particularly, we adopt sub-band processing to handle full-band signals and shape the network with multi-task learning, where dual signal voice activity detection (DSVAD) and echo estimation are adopted as auxiliary tasks. Moreover, we particularly improve the time frequency convolution module (TFCM) to increase the receptive field using small convolution kernels. Finally, our system has ranked 4th in ICASSP 2023 AEC Challenge Non-personalized track. △ Less

Submitted 11 March, 2023; originally announced March 2023.

arXiv:2302.14370 [pdf, other]

CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis

Authors: Ji-Hoon Kim, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim

Abstract: While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and… ▽ More While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and language information in the level of acoustic feature space. Specifically, CrossSpeech decomposes the speech generation pipeline into the speaker-independent generator (SIG) and speaker-dependent generator (SDG). The SIG produces the speaker-independent acoustic representation which is not biased to specific speaker distributions. On the other hand, the SDG models speaker-dependent speech variation that characterizes speaker attributes. By handling each information separately, CrossSpeech can obtain disentangled speaker and language representations. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS, especially in terms of speaker similarity to the target speaker. △ Less

Submitted 12 June, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted to ICASSP 2023

arXiv:2212.05439 [pdf, other]

doi 10.1016/j.buildenv.2022.109950

Personalized local heating neutralizing individual, spatial and temporal thermo-physiological variances in extreme cold environments

Authors: Yi Ju, Xinyuan Ju, Hui Zhang, Bin Cao, Bin Liu, Yingxin Zhu

Abstract: In this paper, we investigate the feasibility, robustness and optimization of introducing personal comfort systems (PCS), apparatuses that promises in energy saving and comfort improvement, into a broader range of environments. We report a series of laboratory experiments systematically examining the effect of personalized heating in neutralizing individual, spatial and temporal variations of ther… ▽ More In this paper, we investigate the feasibility, robustness and optimization of introducing personal comfort systems (PCS), apparatuses that promises in energy saving and comfort improvement, into a broader range of environments. We report a series of laboratory experiments systematically examining the effect of personalized heating in neutralizing individual, spatial and temporal variations of thermal demands. The experiments were conducted in an artificial climate chamber at -15 degC in order to simulate extreme cold environments. We developed a heating garment with 20 pieces of 20 * 20 cm2 heating cloth (grouped into 9 regions) comprehensively covering human body. Surface temperatures of the garment can be controlled independently, quickly (within 20 seconds), precisely (within 1 degC) and easily (through a tablet) up to 45 degC. Participants were instructed to adjust surface temperatures of each segment to their preferences, with their physiological, psychological and adjustment data collected. We found that active heating could significantly and stably improve thermal satisfaction. The overall TSV and TCV were improved 1.50 and 1.53 during the self-adjustment phase. Preferred heating surface temperatures for different segments varied widely. Further, even for the same segment, individual differences among participants were considerable. Such variances were observed through local heating powers, while unnoticeable among thermal perception votes. In other words, all these various differences could be neutralized given the flexibility in personalized adjustments. Our research reaffirms the paradigm of "adaptive thermal comfort" and will promote innovations on human-centric design for more efficient PCSs. △ Less

Submitted 27 December, 2022; v1 submitted 11 December, 2022; originally announced December 2022.

Journal ref: Building and Environment, 109950 (2022)

arXiv:2212.03391 [pdf, other]

doi 10.1109/TSG.2023.3286434

Robo-Chargers: Optimal Operation and Planning of a Robotic Charging System to Alleviate Overstay

Authors: Yi Ju, Teng Zeng, Zaid Allybokus, Scott Moura

Abstract: Charging infrastructure availability is a major concern for plug-in electric vehicle users. Nowadays, the limited public chargers are commonly occupied by vehicles which have already been fully charged. Such phenomenon, known as overstay, hinders other vehicles' accessibility to charging resources. In this paper, we analyze a charging facility innovation to tackle the challenge of overstay, levera… ▽ More Charging infrastructure availability is a major concern for plug-in electric vehicle users. Nowadays, the limited public chargers are commonly occupied by vehicles which have already been fully charged. Such phenomenon, known as overstay, hinders other vehicles' accessibility to charging resources. In this paper, we analyze a charging facility innovation to tackle the challenge of overstay, leveraging the idea of Robo-chargers - automated chargers that can rotate in a charging station and proactively plug or unplug plug-in electric vehicles. We formalize an operation model for stations incorporating Fixed-chargers and Robo-chargers. Optimal scheduling can be solved with the recognition of the combinatorial nature of vehicle-charger assignments, charging dynamics, and customer waiting behaviors. Then, with operation model nested, we develop a planning model to guide economical investment on both types of chargers so that the total cost of ownership is minimized. In the planning phase, it further considers charging demand variances and service capacity requirements. In this paper, we provide systematic techno-economical methods to evaluate if introducing Robo-chargers is beneficial given a specific application scenario. Comprehensive sensitivity analysis based on real-world data highlights the advantages of Robo-chargers, especially in a scenario where overstay is severe. Validations also suggest the tractability of operation model and robustness of planning results for real-time application under reasonable model mismatches, uncertainties and disturbances. △ Less

Submitted 18 June, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Journal ref: IEEE Transactions on Smart Grid

arXiv:2210.15853 [pdf, other]

Speech Enhancement with Intelligent Neural Homomorphic Synthesis

Authors: Shulin He, Wei Rao, Jinjiang Liu, Jun Chen, Yukai Ju, Xueliang Zhang, Yannan Wang, Shidong Shang

Abstract: Most neural network speech enhancement models ignore speech production mathematical models by directly mapping Fourier transform spectrums or waveforms. In this work, we propose a neural source filter network for speech enhancement. Specifically, we use homomorphic signal processing and cepstral analysis to obtain noisy speech's excitation and vocal tract. Unlike traditional signal processing, we… ▽ More Most neural network speech enhancement models ignore speech production mathematical models by directly mapping Fourier transform spectrums or waveforms. In this work, we propose a neural source filter network for speech enhancement. Specifically, we use homomorphic signal processing and cepstral analysis to obtain noisy speech's excitation and vocal tract. Unlike traditional signal processing, we use an attentive recurrent network (ARN) model predicted ratio mask to replace the liftering separation function. Then two convolutional attentive recurrent network (CARN) networks are used to predict the excitation and vocal tract of clean speech, respectively. The system's output is synthesized from the estimated excitation and vocal. Experiments prove that our proposed method performs better, with SI-SNR improving by 1.363dB compared to FullSubNet. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.15849 [pdf, ps, other]

Hierarchical speaker representation for target speaker extraction

Authors: Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang, Xueliang Zhang

Abstract: Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic,… ▽ More Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR substantially outperforms state-of-the-art time-frequency domain techniques. Further demonstrating HR's capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise Suppression Challenge. The proposed HR methodology shows great promise for advancing target speaker extraction through enhanced anchor utilization. △ Less

Submitted 4 January, 2024; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted to ICASSP 2024

arXiv:2210.03027 [pdf, other]

AnimeTAB: A new guitar tablature dataset of anime and game music

Authors: Yuecheng Zhou, Yaolong Ju, Lingyun Xie

Abstract: While guitar tablature has become a popular topic in MIR research, there exists no such a guitar tablature dataset that focuses on the soundtracks of anime and video games, which have a surprisingly broad and growing audience among the youths. In this paper, we present AnimeTAB, a fingerstyle guitar tablature dataset in MusicXML format, which provides more high-quality guitar tablature for both re… ▽ More While guitar tablature has become a popular topic in MIR research, there exists no such a guitar tablature dataset that focuses on the soundtracks of anime and video games, which have a surprisingly broad and growing audience among the youths. In this paper, we present AnimeTAB, a fingerstyle guitar tablature dataset in MusicXML format, which provides more high-quality guitar tablature for both researchers and guitar players. AnimeTAB contains 412 full tracks and 547 clips, the latter are annotated with musical structures (intro, verse, chorus, and bridge). An accompanying analysis toolkit, TABprocessor, is included to further facilitate its use. This includes functions for melody and bassline extraction, key detection, and chord labeling, which are implemented using rule-based algorithms. We evaluated each of these functions against a manually annotated ground truth. Finally, as an example, we performed a music and technique analysis of AnimeTAB using TABprocessor. Our data and code have been made publicly available for composers, performers, and music information retrieval (MIR) researchers alike. △ Less

Submitted 6 October, 2022; originally announced October 2022.

arXiv:2209.12565 [pdf, other]

An Efficient Implementation for Spatial-Temporal Gaussian Process Regression and Its Applications

Authors: Junpeng Zhang, Yue Ju, Biqiang Mu, Renxin Zhong, Tianshi Chen

Abstract: Spatial-temporal Gaussian process regression is a popular method for spatial-temporal data modeling. Its state-of-art implementation is based on the state-space model realization of the spatial-temporal Gaussian process and its corresponding Kalman filter and smoother, and has computational complexity $\mathcal{O}(NM^3)$, where $N$ and $M$ are the number of time instants and spatial input location… ▽ More Spatial-temporal Gaussian process regression is a popular method for spatial-temporal data modeling. Its state-of-art implementation is based on the state-space model realization of the spatial-temporal Gaussian process and its corresponding Kalman filter and smoother, and has computational complexity $\mathcal{O}(NM^3)$, where $N$ and $M$ are the number of time instants and spatial input locations, respectively, and thus can only be applied to data with large $N$ but relatively small $M$. In this paper, our primary goal is to show that by exploring the Kronecker structure of the state-space model realization of the spatial-temporal Gaussian process, it is possible to further reduce the computational complexity to $\mathcal{O}(M^3+NM^2)$ and thus the proposed implementation can be applied to data with large $N$ and moderately large $M$. The proposed implementation is illustrated over applications in weather data prediction and spatially-distributed system identification. Our secondary goal is to design a kernel for both the Colorado precipitation data and the GHCN temperature data, such that while having more efficient implementation, better prediction performance can also be achieved than the state-of-art result. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2209.12231 [pdf, other]

Asymptotic Theory for Regularized System Identification Part I: Empirical Bayes Hyper-parameter Estimator

Authors: Yue Ju, Biqiang Mu, Lennart Ljung, Tianshi Chen

Abstract: Regularized system identification is the major advance in system identification in the last decade. Although many promising results have been achieved, it is far from complete and there are still many key problems to be solved. One of them is the asymptotic theory, which is about convergence properties of the model estimators as the sample size goes to infinity. The existing related results for re… ▽ More Regularized system identification is the major advance in system identification in the last decade. Although many promising results have been achieved, it is far from complete and there are still many key problems to be solved. One of them is the asymptotic theory, which is about convergence properties of the model estimators as the sample size goes to infinity. The existing related results for regularized system identification are about the almost sure convergence of various hyper-parameter estimators. A common problem of those results is that they do not contain information on the factors that affect the convergence properties of those hyper-parameter estimators, e.g., the regression matrix. In this paper, we tackle problems of this kind for the regularized finite impulse response model estimation with the empirical Bayes (EB) hyper-parameter estimator and filtered white noise input. In order to expose and find those factors, we study the convergence in distribution of the EB hyper-parameter estimator, and the asymptotic distribution of its corresponding model estimator. For illustration, we run Monte Carlo simulations to show the efficacy of our obtained theoretical results. △ Less

Submitted 4 April, 2023; v1 submitted 25 September, 2022; originally announced September 2022.

arXiv:2205.15195 [pdf, other]

Personalized Acoustic Echo Cancellation for Full-duplex Communications

Authors: Shimin Zhang, Ziteng Wang, Yukai Ju, Yihui Fu, Yueyue Na, Qiang Fu, Lei Xie

Abstract: Deep neural networks (DNNs) have shown promising results for acoustic echo cancellation (AEC). But the DNN-based AEC models let through all near-end speakers including the interfering speech. In light of recent studies on personalized speech enhancement, we investigate the feasibility of personalized acoustic echo cancellation (PAEC) in this paper for full-duplex communications, where background n… ▽ More Deep neural networks (DNNs) have shown promising results for acoustic echo cancellation (AEC). But the DNN-based AEC models let through all near-end speakers including the interfering speech. In light of recent studies on personalized speech enhancement, we investigate the feasibility of personalized acoustic echo cancellation (PAEC) in this paper for full-duplex communications, where background noise and interfering speakers may coexist with acoustic echoes. Specifically, we first propose a novel backbone neural network termed as gated temporal convolutional neural network (GTCNN) that outperforms state-of-the-art AEC models in performance. Speaker embeddings like d-vectors are further adopted as auxiliary information to guide the GTCNN to focus on the target speaker. A special case in PAEC is that speech snippets of both parties on the call are enrolled. Experimental results show that auxiliary information from either the near-end speaker or the far-end speaker can improve the DNN-based AEC performance. Nevertheless, there is still much room for improvement in the utilization of the finite-dimensional speaker embeddings. △ Less

Submitted 29 June, 2022; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: submitted to INTERSPEECH 22

arXiv:2112.10319 [pdf, ps, other]

Tutorial on Asymptotic Properties of Regularized Least Squares Estimator for Finite Impulse Response Model

Authors: Yue Ju, Tianshi Chen, Biqiang Mu, Lennart Ljung

Abstract: In this paper, we give a tutorial on asymptotic properties of the Least Square (LS) and Regularized Least Squares (RLS) estimators for the finite impulse response model with filtered white noise inputs. We provide three perspectives: the almost sure convergence, the convergence in distribution and the boundedness in probability. On one hand, these properties deepen our understanding of the LS and… ▽ More In this paper, we give a tutorial on asymptotic properties of the Least Square (LS) and Regularized Least Squares (RLS) estimators for the finite impulse response model with filtered white noise inputs. We provide three perspectives: the almost sure convergence, the convergence in distribution and the boundedness in probability. On one hand, these properties deepen our understanding of the LS and RLS estimators. On the other hand, we can use them as tools to investigate asymptotic properties of other estimators, such as various hyper-parameter estimators. △ Less

Submitted 30 December, 2021; v1 submitted 19 December, 2021; originally announced December 2021.

arXiv:2110.07840 [pdf, other]

ESPnet2-TTS: Extending the Edge of TTS Research

Authors: Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T… ▽ More This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP2022. Demo HP: https://espnet.github.io/icassp2022-tts/

arXiv:2003.13435 [pdf, other]

Supplementary Material for CDC Submission No. 1461

Authors: Yue Ju, Tianshi Chen, Biqiang Mu, Lennart Ljung

Abstract: In this paper, we focus on the influences of the condition number of the regression matrix upon the comparison between two hyper-parameter estimation methods: the empirical Bayes (EB) and the Stein's unbiased estimator with respect to the mean square error (MSE) related to output prediction (SUREy). We firstly show that the greatest power of the condition number of the regression matrix of SUREy c… ▽ More In this paper, we focus on the influences of the condition number of the regression matrix upon the comparison between two hyper-parameter estimation methods: the empirical Bayes (EB) and the Stein's unbiased estimator with respect to the mean square error (MSE) related to output prediction (SUREy). We firstly show that the greatest power of the condition number of the regression matrix of SUREy cost function convergence rate upper bound is always one larger than that of EB cost function convergence rate upper bound. Meanwhile, EB and SUREy hyper-parameter estimators are both proved to be asymptotically normally distributed under suitable conditions. In addition, one ridge regression case is further investigated to show that when the condition number of the regression matrix goes to infinity, the asymptotic variance of SUREy estimator tends to be larger than that of EB estimator. △ Less

Submitted 21 April, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

arXiv:1906.11330 [pdf, ps, other]

Sparsity-Assisted Signal Denoising and Pattern Recognition in Time-Series Data

Authors: G. V. Prateek, Yo-El Ju, Arye Nehorai

Abstract: We address the problem of signal denoising and pattern recognition in processing batch-mode time-series data by combining linear time-invariant filters, orthogonal multiresolution representations, and sparsity-based methods. We propose a novel approach to designing higher-order zero-phase low-pass, high-pass, and band-pass infinite impulse response filters as matrices, using spectral transformatio… ▽ More We address the problem of signal denoising and pattern recognition in processing batch-mode time-series data by combining linear time-invariant filters, orthogonal multiresolution representations, and sparsity-based methods. We propose a novel approach to designing higher-order zero-phase low-pass, high-pass, and band-pass infinite impulse response filters as matrices, using spectral transformation of the state-space representation of digital filters. We also propose a proximal gradient-based technique to factorize a special class of zero-phase high-pass and band-pass digital filters so that the factorization product preserves the zero-phase property of the filter and also incorporates a sparse-derivative component of the input in the signal model. To demonstrate applications of our novel filter designs, we validate and propose new signal models to simultaneously denoise and identify patterns of interest. We begin by using our proposed filter design to test an existing signal model that simultaneously combines linear time invariant (LTI) filters and sparsity-based methods. We develop a new signal model called sparsity-assisted signal denoising (SASD) by combining our proposed filter designs with the existing signal model. Thereafter, we propose and derive a new signal model called sparsity-assisted pattern recognition (SAPR). In SAPR, we combine LTI band-pass filters and sparsity-based methods with orthogonal multiresolution representations, such as wavelets, to detect specific patterns in the input signal. Finally, we combine the signal denoising and pattern recognition tasks, and derive a new signal model called the sparsity-assisted signal denoising and pattern recognition (SASDPR). We illustrate the capabilities of the SAPR and SASDPR frameworks using sleep-electroencephalography data to detect K-complexes and sleep spindles, respectively. △ Less

Submitted 26 June, 2019; originally announced June 2019.

Comments: 22 pages, 16 figures, submitted to IEEE Transactions on Signal Processing

Showing 1–20 of 20 results for author: Ju, Y