Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2024
IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5051–5064https://doi.org/10.1109/TASLP.2024.3507560Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from ...
- research-articleNovember 2024
Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5106–5116https://doi.org/10.1109/TASLP.2024.3507568In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance ...
- research-articleNovember 2024
Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5092–5105https://doi.org/10.1109/TASLP.2024.3507566Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. ...
- research-articleNovember 2024
An Interpretable Deep Mutual Information Curriculum Metric for a Robust and Generalized Speech Emotion Recognition System
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5117–5130https://doi.org/10.1109/TASLP.2024.3507562It is difficult to achieve robust and well-generalized models for tasks involving subjective concepts such as emotion. It is inevitable to deal with noisy labels, given the ambiguous nature of human perception. Methodologies relying on <italic>semi-...
- research-articleNovember 2024
Online Neural Speaker Diarization With Target Speaker Tracking
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5078–5091https://doi.org/10.1109/TASLP.2024.3507559This paper proposes an online target speaker voice activity detection (TS-VAD) system for speaker diarization tasks that does not rely on prior knowledge from clustering-based diarization systems to obtain target speaker embeddings. By adapting ...
-
- research-articleNovember 2024
CLAPSep: Leveraging Contrastive Pre-Trained Model for Multi-Modal Query-Conditioned Target Sound Extraction
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4945–4960https://doi.org/10.1109/TASLP.2024.3497586Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user ...
- research-articleNovember 2024
Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5024–5039https://doi.org/10.1109/TASLP.2024.3496317The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of ...
- research-articleNovember 2024
<inline-formula><tex-math notation="LaTeX">$\mathcal {P}$</tex-math></inline-formula>owMix: A Versatile Regularizer for Multimodal Sentiment Analysis
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 5010–5023https://doi.org/10.1109/TASLP.2024.3496316Multimodal sentiment analysis (MSA) leverages heterogeneous data sources to interpret the complex nature of human sentiments. Despite significant progress in multimodal architecture design, the field lacks comprehensive regularization methods. This paper ...
- research-articleNovember 2024
Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4822–4837https://doi.org/10.1109/TASLP.2024.3486206Digital watermarking serves as an effective approach for safeguarding speech signal copyrights, achieved by the incorporation of ownership information into the original signal and its subsequent extraction from the watermarked signal. While traditional ...
- research-articleNovember 2024
FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4905–4918https://doi.org/10.1109/TASLP.2024.3492796In recent years, the field of audio deepfake detection has witnessed significant advancements. Nonetheless, the majority of solutions have concentrated on high-quality audio, largely overlooking the challenge of low-quality compressed audio in real-world ...
- research-articleNovember 2024
TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4999–5009https://doi.org/10.1109/TASLP.2024.3492803We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, ...
- research-articleNovember 2024
FlowHash: Accelerating Audio Search With Balanced Hashing via Normalizing Flow
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4961–4970https://doi.org/10.1109/TASLP.2024.3486227Nearest neighbor search on context representation vectors is a formidable task due to challenges posed by high dimensionality, scalability issues, and potential noise within query vectors. Our novel approach leverages normalizing flow within a self-...
- research-articleOctober 2024
MRC-PASCL: A Few-Shot Machine Reading Comprehension Approach via Post-Training and Answer Span-Oriented Contrastive Learning
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4838–4849https://doi.org/10.1109/TASLP.2024.3490373The rapid development of pre-trained language models (PLMs) has significantly enhanced the performance of machine reading comprehension (MRC). Nevertheless, the traditional fine-tuning approaches necessitate extensive labeled data. MRC remains a ...
- research-articleOctober 2024
DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4850–4866https://doi.org/10.1109/TASLP.2024.3488564In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, ...
- research-articleOctober 2024
WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4755–4767https://doi.org/10.1109/TASLP.2024.3487419Large Language Models (LLMs) have shown incomparable representation and generalization capabilities, which have led to significant advancements in Natural Language Processing (NLP). Before deployment, the pre-trained LLMs often need to be tailored to ...
- research-articleOctober 2024
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4931–4944https://doi.org/10.1109/TASLP.2024.3487410Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages ...
- research-articleOctober 2024
Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4676–4690https://doi.org/10.1109/TASLP.2024.3485551Beamforming has been used in a wide range of applications to extract the signal of interest from microphone array observations, which consist of not only the signal of interest, but also noise, interference, and reverberation. The recently proposed ...
- research-articleOctober 2024
EchoScan: Scanning Complex Room Geometries via Acoustic Echoes
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4768–4782https://doi.org/10.1109/TASLP.2024.3485516Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This ...
- research-articleOctober 2024
Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4741–4754https://doi.org/10.1109/TASLP.2024.3485500Temporal knowledge graph reasoning aims to predict the missing links (facts) in the future timestamps. However, most existing methods have a common limitation: they focus on learning dynamic representations of temporal knowledge graphs and rarely consider ...
- research-articleOctober 2024
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 32Pages 4700–4712https://doi.org/10.1109/TASLP.2024.3485485Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of generation tasks. Text-to-Audio (TTA), a burgeoning generation application designed to generate audio from natural language prompts, is ...