Search | arXiv e-print repository

arXiv:2409.16637 [pdf, ps, other]

Deep-Learning Recognition of Scanning Transmission Electron Microscopy: Quantifying and Mitigating the Influence of Gaussian Noises

Authors: Hanlei Zhang, Jincheng Bai, Xiabo Chen, Can Li, Chuanjian Zhong, Jiye Fang, Guangwen Zhou

Abstract: Scanning transmission electron microscopy (STEM) is a powerful tool to reveal the morphologies and structures of materials, thereby attracting intensive interests from the scientific and industrial communities. The outstanding spatial (atomic level) and temporal (ms level) resolutions of the STEM techniques generate fruitful amounts of high-definition data, thereby enabling the high-volume and hig… ▽ More Scanning transmission electron microscopy (STEM) is a powerful tool to reveal the morphologies and structures of materials, thereby attracting intensive interests from the scientific and industrial communities. The outstanding spatial (atomic level) and temporal (ms level) resolutions of the STEM techniques generate fruitful amounts of high-definition data, thereby enabling the high-volume and high-speed analysis of materials. On the other hand, processing of the big dataset generated by STEM is time-consuming and beyond the capability of human-based manual work, which urgently calls for computer-based automation. In this work, we present a deep-learning mask region-based neural network (Mask R-CNN) for the recognition of nanoparticles imaged by STEM, as well as generating the associated dimensional analysis. The Mask R-CNN model was tested on simulated STEM-HAADF results with different Gaussian noises, particle shapes and particle sizes, and the results indicated that Gaussian noise has determining influence on the accuracy of recognition. By applying Gaussian and Non-Local Means filters on the noise-containing STEM-HAADF results, the influences of noises are largely mitigated, and recognition accuracy is significantly improved. This filtering-recognition approach was further applied to experimental STEM-HAADF results, which yields satisfying accuracy compared with the traditional threshold methods. The deep-learning-based method developed in this work has great potentials in analysis of the complicated structures and large data generated by STEM-HAADF. △ Less

Submitted 25 September, 2024; originally announced September 2024.

arXiv:2409.13292 [pdf, other]

Exploring Text-Queried Sound Event Detection with Audio Source Separation

Authors: Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen, Rohan Kumar Das, Chong Deng, Jianfeng Chen

Abstract: In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks cor… ▽ More In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks corresponding to different events from the input audio. Then, multiple target SED branches are employed to detect individual events. AudioSep is a state-of-the-art LASS model, but has limitations in extracting dynamic audio information because of its pure convolutional structure for separation. To address this, we integrate a dual-path recurrent neural network block into the model. We refer to this structure as AudioSep-DP, which achieves the first place in DCASE 2024 Task 9 on language-queried audio source separation (objective single model track). Experimental results show that TQ-SED can significantly improve the SED performance, with an improvement of 7.22\% on F1 score over the conventional framework. Additionally, we setup comprehensive experiments to explore the impact of model complexity. The source code and pre-trained model are released at https://github.com/apple-yinhan/TQ-SED. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP2025

arXiv:2409.11964 [pdf, other]

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

Authors: Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

Abstract: In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model b… ▽ More In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems. △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: 5 pages, 3 figures

arXiv:2409.11700 [pdf, other]

Real-Time Sound Event Localization and Detection: Deployment Challenges on Edge Devices

Authors: Jun Wei Yeow, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

Abstract: Sound event localization and detection (SELD) is critical for various real-world applications, including smart monitoring and Internet of Things (IoT) systems. Although deep neural networks (DNNs) represent the state-of-the-art approach for SELD, their significant computational complexity and model sizes present challenges for deployment on resource-constrained edge devices, especially under real-… ▽ More Sound event localization and detection (SELD) is critical for various real-world applications, including smart monitoring and Internet of Things (IoT) systems. Although deep neural networks (DNNs) represent the state-of-the-art approach for SELD, their significant computational complexity and model sizes present challenges for deployment on resource-constrained edge devices, especially under real-time conditions. Despite the growing need for real-time SELD, research in this area remains limited. In this paper, we investigate the unique challenges of deploying SELD systems for real-world, real-time applications by performing extensive experiments on a commercially available Raspberry Pi 3 edge device. Our findings reveal two critical, often overlooked considerations: the high computational cost of feature extraction and the performance degradation associated with low-latency, real-time inference. This paper provides valuable insights and considerations for future work toward developing more efficient and robust real-time SELD systems △ Less

Submitted 18 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP'25. Code is available at this link : https://github.com/itsjunwei/Realtime-SELD-Edge

arXiv:2409.10980 [pdf]

PSFHS Challenge Report: Pubic Symphysis and Fetal Head Segmentation from Intrapartum Ultrasound Images

Authors: Jieyun Bai, Zihao Zhou, Zhanhong Ou, Gregor Koehler, Raphael Stock, Klaus Maier-Hein, Marawan Elbatel, Robert Martí, Xiaomeng Li, Yaoyang Qiu, Panjie Gou, Gongping Chen, Lei Zhao, Jianxun Zhang, Yu Dai, Fangyijie Wang, Guénolé Silvestre, Kathleen Curran, Hongkun Sun, Jing Xu, Pengzhou Cai, Lu Jiang, Libin Lan, Dong Ni, Mei Zhong , et al. (4 additional authors not shown)

Abstract: Segmentation of the fetal and maternal structures, particularly intrapartum ultrasound imaging as advocated by the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) for monitoring labor progression, is a crucial first step for quantitative diagnosis and clinical decision-making. This requires specialized analysis by obstetrics professionals, in a task that i) is highly time-… ▽ More Segmentation of the fetal and maternal structures, particularly intrapartum ultrasound imaging as advocated by the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) for monitoring labor progression, is a crucial first step for quantitative diagnosis and clinical decision-making. This requires specialized analysis by obstetrics professionals, in a task that i) is highly time- and cost-consuming and ii) often yields inconsistent results. The utility of automatic segmentation algorithms for biometry has been proven, though existing results remain suboptimal. To push forward advancements in this area, the Grand Challenge on Pubic Symphysis-Fetal Head Segmentation (PSFHS) was held alongside the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). This challenge aimed to enhance the development of automatic segmentation algorithms at an international scale, providing the largest dataset to date with 5,101 intrapartum ultrasound images collected from two ultrasound machines across three hospitals from two institutions. The scientific community's enthusiastic participation led to the selection of the top 8 out of 179 entries from 193 registrants in the initial phase to proceed to the competition's second stage. These algorithms have elevated the state-of-the-art in automatic PSFHS from intrapartum ultrasound images. A thorough analysis of the results pinpointed ongoing challenges in the field and outlined recommendations for future work. The top solutions and the complete dataset remain publicly available, fostering further advancements in automatic segmentation and biometry for intrapartum ultrasound imaging. △ Less

Submitted 17 September, 2024; originally announced September 2024.

arXiv:2409.09754 [pdf, other]

Towards Single-Lens Controllable Depth-of-Field Imaging via All-in-Focus Aberration Correction and Monocular Depth Estimation

Authors: Xiaolong Qian, Qi Jiang, Yao Gao, Shaohua Gao, Zhonghua Yi, Lei Sun, Kai Wei, Haifeng Li, Kailun Yang, Kaiwei Wang, Jian Bai

Abstract: Controllable Depth-of-Field (DoF) imaging commonly produces amazing visual effects based on heavy and expensive high-end lenses. However, confronted with the increasing demand for mobile scenarios, it is desirable to achieve a lightweight solution with Minimalist Optical Systems (MOS). This work centers around two major limitations of MOS, i.e., the severe optical aberrations and uncontrollable Do… ▽ More Controllable Depth-of-Field (DoF) imaging commonly produces amazing visual effects based on heavy and expensive high-end lenses. However, confronted with the increasing demand for mobile scenarios, it is desirable to achieve a lightweight solution with Minimalist Optical Systems (MOS). This work centers around two major limitations of MOS, i.e., the severe optical aberrations and uncontrollable DoF, for achieving single-lens controllable DoF imaging via computational methods. A Depth-aware Controllable DoF Imaging (DCDI) framework is proposed equipped with All-in-Focus (AiF) aberration correction and monocular depth estimation, where the recovered image and corresponding depth map are utilized to produce imaging results under diverse DoFs of any high-end lens via patch-wise convolution. To address the depth-varying optical degradation, we introduce a Depth-aware Degradation-adaptive Training (DA2T) scheme. At the dataset level, a Depth-aware Aberration MOS (DAMOS) dataset is established based on the simulation of Point Spread Functions (PSFs) under different object distances. Additionally, we design two plug-and-play depth-aware mechanisms to embed depth information into the aberration image recovery for better tackling depth-aware degradation. Furthermore, we propose a storage-efficient Omni-Lens-Field model to represent the 4D PSF library of various lenses. With the predicted depth map, recovered image, and depth-aware PSF map inferred by Omni-Lens-Field, single-lens controllable DoF imaging is achieved. Comprehensive experimental results demonstrate that the proposed framework enhances the recovery performance, and attains impressive single-lens controllable DoF imaging results, providing a seminal baseline for this field. The source code and the established dataset will be publicly available at https://github.com/XiaolongQian/DCDI. △ Less

Submitted 15 September, 2024; originally announced September 2024.

Comments: The source code and the established dataset will be publicly available at https://github.com/XiaolongQian/DCDI

arXiv:2409.06456 [pdf, other]

Attention-Based Beamformer For Multi-Channel Speech Enhancement

Authors: Jinglin Bai, Hao Li, Xueliang Zhang, Fei Chen

Abstract: Minimum Variance Distortionless Response (MVDR) is a classical adaptive beamformer that theoretically ensures the distortionless transmission of signals in the target direction, which makes it popular in real applications. Its noise reduction performance actually depends on the accuracy of the noise and speech spatial covariance matrices (SCMs) estimation. Time-frequency masks are often used to co… ▽ More Minimum Variance Distortionless Response (MVDR) is a classical adaptive beamformer that theoretically ensures the distortionless transmission of signals in the target direction, which makes it popular in real applications. Its noise reduction performance actually depends on the accuracy of the noise and speech spatial covariance matrices (SCMs) estimation. Time-frequency masks are often used to compute these SCMs. However, most mask-based beamforming methods typically assume that the sources are stationary, ignoring the case of moving sources, which leads to performance degradation. In this paper, we propose an attention-based mechanism to calculate the speech and noise SCMs and then apply MVDR to obtain the enhanced speech. To fully incorporate spatial information, the inplace convolution operator and frequency-independent LSTM are applied to facilitate SCMs estimation. The model is optimized in an end-to-end manner. Experiments demonstrate that the proposed method outperforms baselines with reduced computation and fewer parameters under various conditions. △ Less

Submitted 13 September, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

arXiv:2409.06245 [pdf, other]

A Two-Stage Band-Split Mamba-2 Network For Music Separation

Authors: Jinglin Bai, Yuan Fang, Jiajie Wang, Xueliang Zhang

Abstract: Music source separation (MSS) aims to separate mixed music into its distinct tracks, such as vocals, bass, drums, and more. MSS is considered to be a challenging audio separation task due to the complexity of music signals. Although the RNN and Transformer architecture are not perfect, they are commonly used to model the music sequence for MSS. Recently, Mamba-2 has already demonstrated high effic… ▽ More Music source separation (MSS) aims to separate mixed music into its distinct tracks, such as vocals, bass, drums, and more. MSS is considered to be a challenging audio separation task due to the complexity of music signals. Although the RNN and Transformer architecture are not perfect, they are commonly used to model the music sequence for MSS. Recently, Mamba-2 has already demonstrated high efficiency in various sequential modeling tasks, but its superiority has not been investigated in MSS. This paper applies Mamba-2 with a two-stage strategy, which introduces residual mapping based on the mask method, effectively compensating for the details absent in the mask and further improving separation performance. Experiments confirm the superiority of bidirectional Mamba-2 and the effectiveness of the two-stage network in MSS. The source code is publicly accessible at https://github.com/baijinglin/TS-BSmamba2. △ Less

Submitted 13 September, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

arXiv:2409.05809 [pdf, other]

A Flexible Framework for Universal Computational Aberration Correction via Automatic Lens Library Generation and Domain Adaptation

Authors: Qi Jiang, Yao Gao, Shaohua Gao, Zhonghua Yi, Lei Sun, Hao Shi, Kailun Yang, Kaiwei Wang, Jian Bai

Abstract: Emerging universal Computational Aberration Correction (CAC) paradigms provide an inspiring solution to light-weight and high-quality imaging without repeated data preparation and model training to accommodate new lens designs. However, the training databases in these approaches, i.e., the lens libraries (LensLibs), suffer from their limited coverage of real-world aberration behaviors. In this wor… ▽ More Emerging universal Computational Aberration Correction (CAC) paradigms provide an inspiring solution to light-weight and high-quality imaging without repeated data preparation and model training to accommodate new lens designs. However, the training databases in these approaches, i.e., the lens libraries (LensLibs), suffer from their limited coverage of real-world aberration behaviors. In this work, we set up an OmniLens framework for universal CAC, considering both the generalization ability and flexibility. OmniLens extends the idea of universal CAC to a broader concept, where a base model is trained for three cases, including zero-shot CAC with the pre-trained model, few-shot CAC with a little lens-specific data for fine-tuning, and domain adaptive CAC using domain adaptation for lens-descriptions-unknown lens. In terms of OmniLens's data foundation, we first propose an Evolution-based Automatic Optical Design (EAOD) pipeline to construct LensLib automatically, coined AODLib, whose diversity is enriched by an evolution framework, with comprehensive constraints and a hybrid optimization strategy for achieving realistic aberration behaviors. For network design, we introduce the guidance of high-quality codebook priors to facilitate zero-shot CAC and few-shot CAC, which enhances the model's generalization ability, while also boosting its convergence in a few-shot case. Furthermore, based on the statistical observation of dark channel priors in optical degradation, we design an unsupervised regularization term to adapt the base model to the target descriptions-unknown lens using its aberration images without ground truth. We validate OmniLens on 4 manually designed low-end lenses with various structures and aberration behaviors. Remarkably, the base model trained on AODLib exhibits strong generalization capabilities, achieving 97% of the lens-specific performance in a zero-shot setting. △ Less

Submitted 9 September, 2024; originally announced September 2024.

arXiv:2409.05784 [pdf, other]

Vector Quantized Diffusion Model Based Speech Bandwidth Extension

Authors: Yuan Fang, Jinglin Bai, Jiajie Wang, Xueliang Zhang

Abstract: Recent advancements in neural audio codec (NAC) unlock new potential in audio signal processing. Studies have increasingly explored leveraging the latent features of NAC for various speech signal processing tasks. This paper introduces the first approach to speech bandwidth extension (BWE) that utilizes the discrete features obtained from NAC. By restoring high-frequency details within highly comp… ▽ More Recent advancements in neural audio codec (NAC) unlock new potential in audio signal processing. Studies have increasingly explored leveraging the latent features of NAC for various speech signal processing tasks. This paper introduces the first approach to speech bandwidth extension (BWE) that utilizes the discrete features obtained from NAC. By restoring high-frequency details within highly compressed discrete tokens, this approach enhances speech intelligibility and naturalness. Based on Vector Quantized Diffusion, the proposed framework combines the strengths of advanced NAC, diffusion models, and Mamba-2 to reconstruct high-frequency speech components. Extensive experiments demonstrate that this method exhibits superior performance across both log-spectral distance and ViSQOL, significantly improving speech quality. △ Less

Submitted 14 September, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

Comments: 4pages

arXiv:2407.09021 [pdf, other]

Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge

Authors: Jun Wei Yeow, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

Abstract: This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet-Conformer architectures with Squeeze-and-Excitation blocks in order to introduce add… ▽ More This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet-Conformer architectures with Squeeze-and-Excitation blocks in order to introduce additional forms of channel- and spatial-wise attention. In order to improve SELD performance, we also utilize the Spatial Cue-Augmented Log-Spectrogram (SALSA) features over the commonly used log-mel spectra features for polyphonic SELD. We complement the existing Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset with the audio channel swapping technique and synthesize additional data using the SpatialScaper generator. We also perform distance scaling in order to prevent large distance errors from contributing more towards the loss function. Finally, we evaluate our approach on the evaluation subset of the STARSS23 dataset. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Technical report for DCASE 2024 Challenge Task 3

arXiv:2407.03654 [pdf, other]

Mixstyle based Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract: This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectro… ▽ More This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We evaluate the proposed approach on DCASE 2024 Challenge Task 4 dataset, measuring polyphonic SED score (PSDS) on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-based method improves both PSDS and macro-average pAUC compared to the challenge baseline. △ Less

Submitted 29 August, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Submitted to ICASSP 2025

arXiv:2407.00291 [pdf, other]

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract: This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging… ▽ More This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Technical report for DCASE 2024 Challenge Task 4

arXiv:2406.09695 [pdf, other]

Machine Learning-based Near-field Emitter Location Sensing via Grouped Hybrid Analog and Digital XL-MIMO Receive Array

Authors: Yifan Li, Feng Shu, Kang Wei, Jiatong Bai, Cunhua Pan, Yongpeng Wu, Yaoliang Song, Jiangzhou Wang

Abstract: As a green MIMO structure, the partially-connected hybrid analog and digital (PC-HAD) structure has been widely used in the far-field (FF) scenario for it can significantly reduce the hardware cost and complexity of large-scale or extremely large-scale MIMO (XL-MIMO) array. Recently, near-field (NF) emitter localization including direction-of-arrival (DOA) and range estimations has drawn a lot of… ▽ More As a green MIMO structure, the partially-connected hybrid analog and digital (PC-HAD) structure has been widely used in the far-field (FF) scenario for it can significantly reduce the hardware cost and complexity of large-scale or extremely large-scale MIMO (XL-MIMO) array. Recently, near-field (NF) emitter localization including direction-of-arrival (DOA) and range estimations has drawn a lot of attention, but is rarely explored via PC-HAD structure. In this paper, we first analyze the impact of PC-HAD structure on the NF emitter localization and observe that the phase ambiguity (PA) problem caused by PC-HAD structure can be removed inherently with low-latency in the NF scenario. To obtain the exact NF DOA estimation results, we propose a grouped PC-HAD structure, which is capable of dividing the NF DOA estimation problem into multiple FF DOA estimation problems via partitioning the large-scale PC-HAD array into small-scale groups. An angle calibration method is developed to address the inconsistency among these FF DOA estimation problems. Then, to eliminate PA and improve the NF emitter localization performance, we develop three machine learning (ML)-based methods, i.e., two low-complexity data-driven clustering-based methods and one model-driven regression method, namely RegNet. Furthermore, the Cramer-Rao lower bound (CRLB) of NF emitter localization for the proposed grouped PC-HAD structure is derived and reveals that localization performance will decrease with the increasing of the number of groups. The simulation results show that the proposed methods can achieve CRLB at different SNR regions, the RegNet has great performance advantages at low SNR regions and the clustering-based methods have much lower computation complexity. △ Less

Submitted 3 October, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.07880 [pdf, other]

A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects

Authors: Jun Bai, Di Wu, Tristan Shelley, Peter Schubel, David Twine, John Russell, Xuesen Zeng, Ji Zhang

Abstract: Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigat… ▽ More Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigated issues related to low efficiency in manual inspections, they struggle to meet the diverse requirements of high precision, real-time speed, automation, and intelligence. In recent years, propelled by the swift advancement of machine learning (ML) technologies, particularly exemplified by deep learning, ML has swiftly emerged as the core technology and a prominent research direction for material defect detection (MDD). Through a comprehensive review of the latest literature, we systematically survey the ML techniques applied in MDD into five categories: unsupervised learning, supervised learning, semi-supervised learning, reinforcement learning, and generative learning. We provide a detailed analysis of the main principles and techniques used, together with the advantages and potential challenges associated with these techniques. Furthermore, the survey focuses on the techniques for defect detection in composite materials, which are important types of materials enjoying increasingly wide application in various industries such as aerospace, automotive, construction, and renewable energy. Finally, the survey explores potential future directions in MDD utilizing ML technologies. This comprehensive survey not only consolidates existing literature on ML-based MDD technologies but also serves as a foundational reference for future researchers and industrial practitioners, providing valuable insights and guidance in developing advanced and efficient MDD systems. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.05696 [pdf, other]

Two Power Allocation and Beamforming Strategies for Active IRS-aided Wireless Network via Machine Learning

Authors: Qiankun Cheng, Jiatong Bai, Baihua Shi, Wei Gao, Feng Shu

Abstract: This paper models an active intelligent reflecting surface (IRS) -assisted wireless communication network, which has the ability to adjust power between BS and IRS. We aim to maximize the signal-to-noise ratio of user by jointly designing power allocation (PA) factor, active IRS phase shift matrix, and beamforming vector of BS, subject to a total power constraint. To tackle this non-convex problem… ▽ More This paper models an active intelligent reflecting surface (IRS) -assisted wireless communication network, which has the ability to adjust power between BS and IRS. We aim to maximize the signal-to-noise ratio of user by jointly designing power allocation (PA) factor, active IRS phase shift matrix, and beamforming vector of BS, subject to a total power constraint. To tackle this non-convex problem, we solve this problem by alternately optimizing these variables. Firstly, the PA factor is designed via polynomial regression method. Next, BS beamforming vector and IRS phase shift matrix are obtained by Dinkelbach's transform and successive convex approximation methods. To reduce the high computational complexity of the above proposed algorithm, we maximize achievable rate (AR) and use closed-form fractional programming method to transform the original problem into an equivalent form. Then, we address this problem by iteratively optimizing auxiliary variables, BS and IRS beamformings. Simulation results show that the proposed algorithms can effectively improve the AR performance compared to fixed PA strategies, aided by passive IRS, and without IRS. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2405.09556 [pdf, other]

Co-learning-aided Multi-modal-deep-learning Framework of Passive DOA Estimators for a Heterogeneous Hybrid Massive MIMO Receiver

Authors: Jiatong Bai, Feng Shu, Qinghe Zheng, Bo Xu, Baihua Shi, Yiwen Chen, Weibin Zhang, Xianpeng Wang

Abstract: Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structu… ▽ More Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structure. But there exists the problem of phase ambiguity for HAD, which leads to its low-efficiency or high-latency. Does exist there such a MIMO structure of owning low-cost, low-complexity and high time efficiency at the same time. To satisfy the three properties, a novel heterogeneous hybrid MIMO receiver structure of integrating FD and heterogeneous HAD ($\rm{H}^2$AD-FD) is proposed and corresponding multi-modal (MD)-learning framework is developed. The framework includes three major stages: 1) generate the candidate sets via root multiple signal classification (Root-MUSIC) or deep learning (DL); 2) infer the class of true solutions from candidate sets using machine learning (ML) methods; 3) fuse the two-part true solutions to achieve a better DOA estimation. The above process form two methods named MD-Root-MUSIC and MDDL. To improve DOA estimation accuracy and reduce the clustering complexity, a co-learning-aided MD framework is proposed to form two enhanced methods named CoMDDL and CoMD-RootMUSIC. Moreover, the Cramer-Rao lower bound (CRLB) for the proposed $\rm{H}^2$AD-FD structure is also derived. Experimental results demonstrate that our proposed four methods could approach the CRLB for signal-to-noise ratio (SNR) > 0 dB and the proposed CoMDDL and MDDL perform better than CoMD-RootMUSIC and MD-RootMUSIC, particularly in the extremely low SNR region. △ Less

Submitted 12 June, 2024; v1 submitted 27 April, 2024; originally announced May 2024.

arXiv:2405.09425 [pdf, other]

doi 10.1109/IEEECONF59524.2023.10476865

Robust Covariance-Based Activity Detection for Massive Access

Authors: Jianan Bai, Erik G. Larsson

Abstract: The wireless channel is undergoing continuous changes, and the block-fading assumption, despite its popularity in theoretical contexts, never holds true in practical scenarios. This discrepancy is particularly critical for user activity detection in grant-free random access, where joint processing across multiple resource blocks is usually undesirable. In this paper, we propose employing a low-dim… ▽ More The wireless channel is undergoing continuous changes, and the block-fading assumption, despite its popularity in theoretical contexts, never holds true in practical scenarios. This discrepancy is particularly critical for user activity detection in grant-free random access, where joint processing across multiple resource blocks is usually undesirable. In this paper, we propose employing a low-dimensional approximation of the channel to capture variations over time and frequency and robustify activity detection algorithms. This approximation entails projecting channel fading vectors onto their principal directions to minimize the approximation order. Through numerical examples, we demonstrate a substantial performance improvement achieved by the resulting activity detection algorithm. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 5 pages, 11 figures. Asilomar SSC 2023 Conference

arXiv:2405.05565 [pdf, other]

doi 10.1109/TGRS.2024.3406711

Array SAR 3D Sparse Imaging Based on Regularization by Denoising Under Few Observed Data

Authors: Yangyang Wang, Xu Zhan, Jing Gao, Jinjie Yao, Shunjun Wei, JianSheng Bai

Abstract: Array synthetic aperture radar (SAR) three-dimensional (3D) imaging can obtain 3D information of the target region, which is widely used in environmental monitoring and scattering information measurement. In recent years, with the development of compressed sensing (CS) theory, sparse signal processing is used in array SAR 3D imaging. Compared with matched filter (MF), sparse SAR imaging can effect… ▽ More Array synthetic aperture radar (SAR) three-dimensional (3D) imaging can obtain 3D information of the target region, which is widely used in environmental monitoring and scattering information measurement. In recent years, with the development of compressed sensing (CS) theory, sparse signal processing is used in array SAR 3D imaging. Compared with matched filter (MF), sparse SAR imaging can effectively improve image quality. However, sparse imaging based on handcrafted regularization functions suffers from target information loss in few observed SAR data. Therefore, in this article, a general 3D sparse imaging framework based on Regulation by Denoising (RED) and proximal gradient descent type method for array SAR is presented. Firstly, we construct explicit prior terms via state-of-the-art denoising operators instead of regularization functions, which can improve the accuracy of sparse reconstruction and preserve the structure information of the target. Then, different proximal gradient descent type methods are presented, including a generalized alternating projection (GAP) and an alternating direction method of multiplier (ADMM), which is suitable for high-dimensional data processing. Additionally, the proposed method has robust convergence, which can achieve sparse reconstruction of 3D SAR in few observed SAR data. Extensive simulations and real data experiments are conducted to analyze the performance of the proposed method. The experimental results show that the proposed method has superior sparse reconstruction performance. △ Less

Submitted 26 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.02942 [pdf, other]

Design, analysis, and manufacturing of a glass-plastic hybrid minimalist aspheric panoramic annular lens

Authors: Shaohua Gao, Qi Jiang, Yiqi Liao, Yi Qiu, Wanglei Ying, Kailun Yang, Kaiwei Wang, Benhao Zhang, Jian Bai

Abstract: We propose a high-performance glass-plastic hybrid minimalist aspheric panoramic annular lens (ASPAL) to solve several major limitations of the traditional panoramic annular lens (PAL), such as large size, high weight, and complex system. The field of view (FoV) of the ASPAL is 360°x(35°~110°) and the imaging quality is close to the diffraction limit. This large FoV ASPAL is composed of only 4 len… ▽ More We propose a high-performance glass-plastic hybrid minimalist aspheric panoramic annular lens (ASPAL) to solve several major limitations of the traditional panoramic annular lens (PAL), such as large size, high weight, and complex system. The field of view (FoV) of the ASPAL is 360°x(35°~110°) and the imaging quality is close to the diffraction limit. This large FoV ASPAL is composed of only 4 lenses. Moreover, we establish a physical structure model of PAL using the ray tracing method and study the influence of its physical parameters on compactness ratio. In addition, for the evaluation of local tolerances of annular surfaces, we propose a tolerance analysis method suitable for ASPAL. This analytical method can effectively analyze surface irregularities on annular surfaces and provide clear guidance on manufacturing tolerances for ASPAL. Benefiting from high-precision glass molding and injection molding aspheric lens manufacturing techniques, we finally manufactured 20 ASPALs in small batches. The weight of an ASPAL prototype is only 8.5 g. Our framework provides promising insights for the application of panoramic systems in space and weight-constrained environmental sensing scenarios such as intelligent security, micro-UAVs, and micro-robots. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted to Optics & Laser Technology

arXiv:2405.01074 [pdf, other]

Stability Analysis of Interacting Wireless Repeaters

Authors: Erik G. Larsson, Jianan Bai

Abstract: We consider a wireless network with multiple single-antenna repeaters that amplify and instantaneously re-transmit the signals they receive to improve the channel rank and system coverage. Due to the positive feedback formed by inter-repeater interference, stability could become a critical issue. We investigate the problem of determining the maximum amplification gain that the repeaters can use wi… ▽ More We consider a wireless network with multiple single-antenna repeaters that amplify and instantaneously re-transmit the signals they receive to improve the channel rank and system coverage. Due to the positive feedback formed by inter-repeater interference, stability could become a critical issue. We investigate the problem of determining the maximum amplification gain that the repeaters can use without breaking the system stability. Specifically, we obtain a bound by using the Gershgorin disc theorem, which reveals that the maximum amplification gain is restricted by the sum of channel amplitude gains. We show by case studies the usefulness of the so-obtained bound and provide insights on how the repeaters should be deployed. △ Less

Submitted 7 July, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: Accepted to SPAWC 2024. 5 pages, 7 figures

arXiv:2404.00136 [pdf, other]

Tunable X-band opto-electronic synthesizer with ultralow phase noise

Authors: Igor Kudelin, Pedram Shirmohammadi, William Groman, Samin Hanifi, Megan L. Kelleher, Dahyeon Lee, Takuma Nakamura, Charles A. McLemore, Alexander Lind, Dylan Meyer, Junwu Bai, Joe C. Campbell, Steven M. Bowers, Franklyn Quinlan, Scott A. Diddams

Abstract: Modern communication, navigation, and radar systems rely on low noise and frequency-agile microwave sources. In this application space, photonic systems provide an attractive alternative to conventional microwave synthesis by leveraging high spectral purity lasers and optical frequency combs to generate microwaves with exceedingly low phase noise. However, these photonic techniques suffer from a l… ▽ More Modern communication, navigation, and radar systems rely on low noise and frequency-agile microwave sources. In this application space, photonic systems provide an attractive alternative to conventional microwave synthesis by leveraging high spectral purity lasers and optical frequency combs to generate microwaves with exceedingly low phase noise. However, these photonic techniques suffer from a lack of frequency tunability, and also have substantial size, weight, and power requirements that largely limit their use to laboratory settings. In this work, we address these shortcomings with a hybrid opto-electronic approach that combines simplified optical frequency division with direct digital synthesis to produce tunable low-phase-noise microwaves across the entire X-band. This results in exceptional phase noise at 10 GHz of -156 dBc/Hz at 10 kHz offset and fractional frequency instability of 1x10^-13 at 0.1 s. Spot tuning away from 10 GHz by 500 MHz, 1 GHz, and 2 GHz, yields phase noise at 10 kHz offset of -150 dBc/Hz, -146 dBc/Hz, and -140 dBc/Hz, respectively. The synthesizer architecture is fully compatible with integrated photonic implementations that will enable a versatile microwave source in a chip-scale package. Together, these advances illustrate an impactful and practical synthesis technique that shares the combined benefits of low timing noise provided by photonics and the frequency agility of established digital synthesis. △ Less

Submitted 29 March, 2024; originally announced April 2024.

arXiv:2403.20130 [pdf, other]

Sound event localization and classification using WASN in Outdoor Environment

Authors: Dongzhe Zhang, Jianfeng Chen, Jisheng Bai, Mou Wang

Abstract: Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microp… ▽ More Deep learning-based sound event localization and classification is an emerging research area within wireless acoustic sensor networks. However, current methods for sound event localization and classification typically rely on a single microphone array, making them susceptible to signal attenuation and environmental noise, which limits their monitoring range. Moreover, methods using multiple microphone arrays often focus solely on source localization, neglecting the aspect of sound event classification. In this paper, we propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source. We introduce a Soundmap feature to capture spatial information across multiple frequency bands. We also use the Gammatone filter to generate acoustic features more suitable for outdoor environments. Furthermore, we integrate attention mechanisms to learn channel-wise relationships and temporal dependencies within the acoustic features. To evaluate our proposed method, we conduct experiments using simulated datasets with different levels of noise and size of monitoring areas, as well as different arrays and source positions. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods in both sound event classification and sound source localization tasks. And we provide further analysis to explain the reasons for the observed errors. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2403.18707 [pdf, other]

Connections between Reachability and Time Optimality

Authors: Juho Bae, Ji Hoon Bai, Byung-Yoon Lee, Jun-Yong Lee, Chang-Hun Lee

Abstract: This paper presents the concept of an equivalence relation between the set of optimal control problems. By leveraging this concept, we show that the boundary of the reachability set can be constructed by the solutions of time optimal problems. Alongside, a more generalized equivalence theorem is presented together. The findings facilitate the use of solution structures from a certain class of opti… ▽ More This paper presents the concept of an equivalence relation between the set of optimal control problems. By leveraging this concept, we show that the boundary of the reachability set can be constructed by the solutions of time optimal problems. Alongside, a more generalized equivalence theorem is presented together. The findings facilitate the use of solution structures from a certain class of optimal control problems to address problems in corresponding equivalent classes. As a byproduct, we state and prove the construction methods of the reachability sets of three-dimensional curves with prescribed curvature bound. The findings are twofold: Firstly, we prove that any boundary point of the reachability set, with the terminal direction taken into account, can be accessed via curves of H, CSC, CCC, or their respective subsegments, where H denotes a helicoidal arc, C a circular arc with maximum curvature, and S a straight segment. Secondly, we show that any boundary point of the reachability set, without considering the terminal direction, can be accessed by curves of CC, CS, or their respective subsegments. These findings extend the developments presented in literature regarding planar curves, or Dubins car dynamics, into spatial curves in $\mathbb{R}^3$. For higher dimensions, we confirm that the problem of identifying the reachability set of curvature bounded paths subsumes the well-known Markov-Dubins problem. These advancements in understanding the reachability of curvature bounded paths in $\mathbb{R}^3$ hold significant practical implications, particularly in the contexts of mission planning problems and time optimal guidance. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Submitted to Automatica

arXiv:2402.02694 [pdf, other]

Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

Authors: Jisheng Bai, Mou Wang, Haohe Liu, Han Yin, Yafei Jia, Siwei Huang, Yutong Du, Dongzhe Zhang, Dongyuan Shi, Woon-Seng Gan, Mark D. Plumbley, Susanto Rahardja, Bin Xiang, Jianfeng Chen

Abstract: Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Althoug… ▽ More Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task, in recent years, has achieved substantial progress in device generalization, the challenge of domain shift between different geographical regions, involving discrepancies such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift. △ Less

Submitted 28 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.14304 [pdf, other]

Constraint-Aware Mesh Refinement Method by Reachability Set Envelope of Curvature Bounded Paths

Authors: Juho Bae, Ji Hoon Bai, Byung-Yoon Lee, Jun-Yong Lee

Abstract: This paper presents an enhanced direct-method-based approach for the real-time solution of optimal control problems to handle path constraints, such as obstacles. The principal contributions of this work are twofold: first, the existing methods for constructing reachability sets in the literature are extended to derive the envelope of these sets, which determines the region swept by all feasible t… ▽ More This paper presents an enhanced direct-method-based approach for the real-time solution of optimal control problems to handle path constraints, such as obstacles. The principal contributions of this work are twofold: first, the existing methods for constructing reachability sets in the literature are extended to derive the envelope of these sets, which determines the region swept by all feasible trajectories between adjacent sample points. Second, we propose a novel method to guarantee constraint violation-free between discrete states in two dimensions through mesh refinement approach. To illustrate the effectiveness of the proposed methodology, numerical simulations are conducted on real-time path planning for fixed-wing unmanned aerial vehicles. △ Less

Submitted 4 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: Preprint submitted to Automatica

arXiv:2401.08992 [pdf, other]

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

Authors: Junwen Bai, Bo Li, Qiujia Li, Tara N. Sainath, Trevor Strohman

Abstract: The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, e… ▽ More The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: Accepted to ICASSP 2024

arXiv:2401.08678 [pdf, other]

Sub-band and Full-band Interactive U-Net with DPRNN for Demixing Cross-talk Stereo Music

Authors: Han Yin, Mou Wang, Jisheng Bai, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

Abstract: This paper presents a detailed description of our proposed methods for the ICASSP 2024 Cadenza Challenge. Experimental results show that the proposed system can achieve better performance than official baselines. This paper presents a detailed description of our proposed methods for the ICASSP 2024 Cadenza Challenge. Experimental results show that the proposed system can achieve better performance than official baselines. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Submitted to ICASSP 2024

arXiv:2312.16772 [pdf, other]

Unsupversied feature correlation model to predict breast abnormal variation maps in longitudinal mammograms

Authors: Jun Bai, Annie Jin, Madison Adams, Clifford Yang, Sheida Nabavi

Abstract: Breast cancer continues to be a significant cause of mortality among women globally. Timely identification and precise diagnosis of breast abnormalities are critical for enhancing patient prognosis. In this study, we focus on improving the early detection and accurate diagnosis of breast abnormalities, which is crucial for improving patient outcomes and reducing the mortality rate of breast cancer… ▽ More Breast cancer continues to be a significant cause of mortality among women globally. Timely identification and precise diagnosis of breast abnormalities are critical for enhancing patient prognosis. In this study, we focus on improving the early detection and accurate diagnosis of breast abnormalities, which is crucial for improving patient outcomes and reducing the mortality rate of breast cancer. To address the limitations of traditional screening methods, a novel unsupervised feature correlation network was developed to predict maps indicating breast abnormal variations using longitudinal 2D mammograms. The proposed model utilizes the reconstruction process of current year and prior year mammograms to extract tissue from different areas and analyze the differences between them to identify abnormal variations that may indicate the presence of cancer. The model is equipped with a feature correlation module, an attention suppression gate, and a breast abnormality detection module that work together to improve the accuracy of the prediction. The proposed model not only provides breast abnormal variation maps, but also distinguishes between normal and cancer mammograms, making it more advanced compared to the state-of the-art baseline models. The results of the study show that the proposed model outperforms the baseline models in terms of Accuracy, Sensitivity, Specificity, Dice score, and cancer detection rate. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2311.14068 [pdf, other]

Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

Authors: Han Yin, Jisheng Bai, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

Abstract: Traditional binary hard labels for sound event detection (SED) lack details about the complexity and variability of sound event distributions. Recently, a novel annotation workflow is proposed to generate fine-grained non-binary soft labels, resulting in a new real-life dataset named MAESTRO Real for SED. In this paper, we first propose an interactive dual-conformer (IDC) module, in which a cross-… ▽ More Traditional binary hard labels for sound event detection (SED) lack details about the complexity and variability of sound event distributions. Recently, a novel annotation workflow is proposed to generate fine-grained non-binary soft labels, resulting in a new real-life dataset named MAESTRO Real for SED. In this paper, we first propose an interactive dual-conformer (IDC) module, in which a cross-interaction mechanism is applied to effectively exploit the information from soft labels. In addition, a novel scene-inspired mask (SIM) based on soft labels is incorporated for more precise SED predictions. The SIM is initially generated through a statistical approach, referred as SIM-V1. However, the fixed artificial mask may mismatch the SED model, resulting in limited effectiveness. Therefore, we further propose SIM-V2, which employs a word embedding model for adaptive SIM estimation. Experimental results show that the proposed IDC module can effectively utilize the information from soft labels, and the integration of SIM-V1 can further improve the accuracy. In addition, the impact of different word embedding dimensions on SIM-V2 is explored, and the results show that the appropriate dimension can enable SIM-V2 achieve superior performance than SIM-V1. In DCASE 2023 Challenge Task4B, the proposed system achieved the top ranking performance on the evaluation dataset of MAESTRO Real. △ Less

Submitted 7 December, 2023; v1 submitted 23 November, 2023; originally announced November 2023.

Comments: to be improved (unfinished)

arXiv:2311.12371 [pdf, other]

AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning

Authors: Jisheng Bai, Han Yin, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen, Susanto Rahardja

Abstract: Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-sema… ▽ More Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a large language models (LLMs)-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-semantic audio Transformer by incorporating contrastive learning between hybrid acoustic representations. We then leverage LLMs to generate audio logs that summarize textual descriptions of the acoustic environment. Finally, we evaluate the AudioLog system on two datasets with both scene and event annotations. Experiments show that the proposed system achieves exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analysis of the prompts to LLMs demonstrates that AudioLog can effectively summarize long audio sequences. To the best of our knowledge, this approach is the first attempt to leverage LLMs for summarizing long audio sequences. △ Less

Submitted 4 January, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2309.07566 [pdf, other]

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Authors: Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao

Abstract: Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete… ▽ More Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and speaker similarity. Audio samples are available at http://stylelm.github.io/ . △ Less

Submitted 19 July, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: accepted by ACL SRW 2024

arXiv:2308.08100 [pdf, other]

A New Heterogeneous Hybrid Massive MIMO Receiver with An Intrinsic Ability of Removing Phase Ambiguity of DOA Estimation via Machine Learning

Authors: Feng Shu, Baihua Shi, Yiwen Chen, Jiatong Bai, Yifan Li, Tingting Liu, Zhu Han, Xiaohu You

Abstract: Massive multiple input multiple output (MIMO) antenna arrays eventuate a huge amount of circuit costs and computational complexity. To satisfy the needs of high precision and low cost in future green wireless communication, the conventional Hybrid analog and digital MIMO receive structure emerges a natural choice. But it exists an issue of the phase ambiguity in direction of arrival (DOA) estimati… ▽ More Massive multiple input multiple output (MIMO) antenna arrays eventuate a huge amount of circuit costs and computational complexity. To satisfy the needs of high precision and low cost in future green wireless communication, the conventional Hybrid analog and digital MIMO receive structure emerges a natural choice. But it exists an issue of the phase ambiguity in direction of arrival (DOA) estimation and requires at least two time-slots to complete one-time DOA measurement with the first time-slot generating the set of candidate solutions and the remaining ones to find a true direction by received beamforming over this set. This will lead to a low time-efficiency. To address this problem, a new heterogeneous sub-connected hybrid analog and digital (HAD) MIMO structure is proposed with an intrinsic ability of removing phase ambiguity and a corresponding new framework is developed to implement a rapid high-precision DOA estimation using only single time-slot. This framework consists of two steps: 1) form a set of candidate solutions using existing methods like MUSIC) find the class of the true solutions and compute the class mean. To infer the set of true solutions, we propose two new clustering methods: weight global minimum distance (WGMD) and weight local minimum distance (WLMD). And, we also enhance two classic clustering methods: accelerating local weighted k-means (ALW-K-means) and improved density. Additionally, the corresponding closed-form expression of Cramer-Rao lower bound (CRLB) is derived. Simulation results show that the proposed frameworks using the above four clustering can approach the CRLB at almost all SNR regions except for extremely low SNR. Four clustering methods have an accuracy decreasing order as follows: WGMD, improved DBSCAN, ALW-K-means and WLMD. △ Less

Submitted 28 May, 2024; v1 submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.05305 [pdf, other]

From CNN to Transformer: A Review of Medical Image Segmentation Models

Authors: Wenjian Yao, Jiajun Bai, Wei Liao, Yuheng Chen, Mengjuan Liu, Yao Xie

Abstract: Medical image segmentation is an important step in medical image analysis, especially as a crucial prerequisite for efficient disease diagnosis and treatment. The use of deep learning for image segmentation has become a prevalent trend. The widely adopted approach currently is U-Net and its variants. Additionally, with the remarkable success of pre-trained models in natural language processing tas… ▽ More Medical image segmentation is an important step in medical image analysis, especially as a crucial prerequisite for efficient disease diagnosis and treatment. The use of deep learning for image segmentation has become a prevalent trend. The widely adopted approach currently is U-Net and its variants. Additionally, with the remarkable success of pre-trained models in natural language processing tasks, transformer-based models like TransUNet have achieved desirable performance on multiple medical image segmentation datasets. In this paper, we conduct a survey of the most representative four medical image segmentation models in recent years. We theoretically analyze the characteristics of these models and quantitatively evaluate their performance on two benchmark datasets (i.e., Tuberculosis Chest X-rays and ovarian tumors). Finally, we discuss the main challenges and future trends in medical image segmentation. Our work can assist researchers in the related field to quickly establish medical segmentation models tailored to specific regions. △ Less

Submitted 9 August, 2023; originally announced August 2023.

Comments: 18 pages, 8 figures

arXiv:2307.08239 [pdf, other]

Dynamic Kernel Convolution Network with Scene-dedicate Training for Sound Event Localization and Detection

Authors: Siwei Huang, Jianfeng Chen, Jisheng Bai, Yafei Jia, Dongzhe Zhang

Abstract: DNN-based methods have shown high performance in sound event localization and detection(SELD). While in real spatial sound scenes, reverberation and the imbalanced presence of various sound events increase the complexity of the SELD task. In this paper, we propose an effective SELD system in real spatial scenes.In our approach, a dynamic kernel convolution module is introduced after the convolutio… ▽ More DNN-based methods have shown high performance in sound event localization and detection(SELD). While in real spatial sound scenes, reverberation and the imbalanced presence of various sound events increase the complexity of the SELD task. In this paper, we propose an effective SELD system in real spatial scenes.In our approach, a dynamic kernel convolution module is introduced after the convolution blocks to adaptively model the channel-wise features with different receptive fields. Secondly, we incorporate the SELDnet and EINv2 framework into the proposed SELD system with multi-track ACCDOA. Moreover, two scene-dedicated strategies are introduced into the training stage to improve the generalization of the system in realistic spatial sound scenes. Finally, we apply data augmentation methods to extend the dataset using channel rotation, spatial data synthesis. Four joint metrics are used to evaluate the performance of the SELD system on the Sony-TAu Realistic Spatial Soundscapes 2022 dataset.Experimental results show that the proposed systems outperform the fixed-kernel convolution SELD systems. In addition, the proposed system achieved an SELD score of 0.348 in the DCASE SELD task and surpassed the SOTA methods. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 11 pages, 6 figures

arXiv:2307.01124 [pdf]

Cross-modality Attention Adapter: A Glioma Segmentation Fine-tuning Method for SAM Using Multimodal Brain MR Images

Authors: Xiaoyu Shi, Shurong Chai, Yinhao Li, Jingliang Cheng, Jie Bai, Guohua Zhao, Yen-Wei Chen

Abstract: According to the 2021 World Health Organization (WHO) Classification scheme for gliomas, glioma segmentation is a very important basis for diagnosis and genotype prediction. In general, 3D multimodal brain MRI is an effective diagnostic tool. In the past decade, there has been an increase in the use of machine learning, particularly deep learning, for medical images processing. Thanks to the devel… ▽ More According to the 2021 World Health Organization (WHO) Classification scheme for gliomas, glioma segmentation is a very important basis for diagnosis and genotype prediction. In general, 3D multimodal brain MRI is an effective diagnostic tool. In the past decade, there has been an increase in the use of machine learning, particularly deep learning, for medical images processing. Thanks to the development of foundation models, models pre-trained with large-scale datasets have achieved better results on a variety of tasks. However, for medical images with small dataset sizes, deep learning methods struggle to achieve better results on real-world image datasets. In this paper, we propose a cross-modality attention adapter based on multimodal fusion to fine-tune the foundation model to accomplish the task of glioma segmentation in multimodal MRI brain images with better results. The effectiveness of the proposed method is validated via our private glioma data set from the First Affiliated Hospital of Zhengzhou University (FHZU) in Zhengzhou, China. Our proposed method is superior to current state-of-the-art methods with a Dice of 88.38% and Hausdorff distance of 10.64, thereby exhibiting a 4% increase in Dice to segment the glioma region for glioma treatment. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.10065 [pdf, other]

Taming Diffusion Models for Music-driven Conducting Motion Generation

Authors: Zhuoran Zhao, Jinbin Bai, Delong Chen, Debang Wang, Yubo Pan

Abstract: Generating the motion of orchestral conductors from a given piece of symphony music is a challenging task since it requires a model to learn semantic music features and capture the underlying distribution of real conducting motion. Prior works have applied Generative Adversarial Networks (GAN) to this task, but the promising diffusion model, which recently showed its advantages in terms of both tr… ▽ More Generating the motion of orchestral conductors from a given piece of symphony music is a challenging task since it requires a model to learn semantic music features and capture the underlying distribution of real conducting motion. Prior works have applied Generative Adversarial Networks (GAN) to this task, but the promising diffusion model, which recently showed its advantages in terms of both training stability and output quality, has not been exploited in this context. This paper presents Diffusion-Conductor, a novel DDIM-based approach for music-driven conducting motion generation, which integrates the diffusion model to a two-stage learning framework. We further propose a random masking strategy to improve the feature robustness, and use a pair of geometric loss functions to impose additional regularizations and increase motion diversity. We also design several novel metrics, including Frechet Gesture Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive evaluation of the generated motion. Experimental results demonstrate the advantages of our model. △ Less

Submitted 13 November, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: Accepted by AAAI 2023 Summer Symposium with Best Paper Award

arXiv:2306.04987 [pdf, other]

Convolutional Recurrent Neural Network with Attention for 3D Speech Enhancement

Authors: Han Yin, Jisheng Bai, Mou Wang, Siwei Huang, Yafei Jia, Jianfeng Chen

Abstract: 3D speech enhancement can effectively improve the auditory experience and plays a crucial role in augmented reality technology. However, traditional convolutional-based speech enhancement methods have limitations in extracting dynamic voice information. In this paper, we incorporate a dual-path recurrent neural network block into the U-Net to iteratively extract dynamic audio information in both t… ▽ More 3D speech enhancement can effectively improve the auditory experience and plays a crucial role in augmented reality technology. However, traditional convolutional-based speech enhancement methods have limitations in extracting dynamic voice information. In this paper, we incorporate a dual-path recurrent neural network block into the U-Net to iteratively extract dynamic audio information in both the time and frequency domains. And an attention mechanism is proposed to fuse the original signal, reference signal, and generated masks. Moreover, we introduce a loss function to simultaneously optimize the network in the time-frequency and time domains. Experimental results show that our system outperforms the state-of-the-art systems on the dataset of ICASSP L3DAS23 challenge. △ Less

Submitted 19 November, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Published on IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC 2023)

arXiv:2306.01303 [pdf, ps, other]

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Authors: Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Jinfeng Bai

Abstract: Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application. In this paper, we propose DistilXLSR, a distilled cross-lingual speech representation model. By randomly shuffling the phonemes of existing spe… ▽ More Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application. In this paper, we propose DistilXLSR, a distilled cross-lingual speech representation model. By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data. We also design a layer-jumping initialization method to fully leverage the teacher's pre-trained weights. Experiments on 2 kinds of teacher models and 15 low-resource languages show that our method can reduce the parameters by 50% while maintaining cross-lingual representation ability. Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2302.12186 [pdf, other]

RSFDM-Net: Real-time Spatial and Frequency Domains Modulation Network for Underwater Image Enhancement

Authors: Jingxia Jiang, Jinbin Bai, Yun Liu, Junjie Yin, Sixiang Chen, Tian Ye, Erkang Chen

Abstract: Underwater images typically experience mixed degradations of brightness and structure caused by the absorption and scattering of light by suspended particles. To address this issue, we propose a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for the efficient enhancement of colors and details in underwater images. Specifically, our proposed conditional network is designed w… ▽ More Underwater images typically experience mixed degradations of brightness and structure caused by the absorption and scattering of light by suspended particles. To address this issue, we propose a Real-time Spatial and Frequency Domains Modulation Network (RSFDM-Net) for the efficient enhancement of colors and details in underwater images. Specifically, our proposed conditional network is designed with Adaptive Fourier Gating Mechanism (AFGM) and Multiscale Convolutional Attention Module (MCAM) to generate vectors carrying low-frequency background information and high-frequency detail features, which effectively promote the network to model global background information and local texture details. To more precisely correct the color cast and low saturation of the image, we introduce a Three-branch Feature Extraction (TFE) block in the primary net that processes images pixel by pixel to integrate the color information extended by the same channel (R, G, or B). This block consists of three small branches, each of which has its own weights. Extensive experiments demonstrate that our network significantly outperforms over state-of-the-art methods in both visual quality and quantitative metrics. △ Less

Submitted 23 February, 2023; originally announced February 2023.

arXiv:2302.01496 [pdf, ps, other]

Efficient Domain Adaptation for Speech Foundation Models

Authors: Bo Li, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang, Wei Han, Trevor Strohman, Francoise Beaufays

Abstract: Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we presen… ▽ More Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2212.12844 [pdf, other]

Weakly-Supervised Deep Learning Model for Prostate Cancer Diagnosis and Gleason Grading of Histopathology Images

Authors: Mohammad Mahdi Behzadi, Mohammad Madani, Hanzhang Wang, Jun Bai, Ankit Bhardwaj, Anna Tarakanova, Harold Yamase, Ga Hie Nam, Sheida Nabavi

Abstract: Prostate cancer is the most common cancer in men worldwide and the second leading cause of cancer death in the United States. One of the prognostic features in prostate cancer is the Gleason grading of histopathology images. The Gleason grade is assigned based on tumor architecture on Hematoxylin and Eosin (H&E) stained whole slide images (WSI) by the pathologists. This process is time-consuming a… ▽ More Prostate cancer is the most common cancer in men worldwide and the second leading cause of cancer death in the United States. One of the prognostic features in prostate cancer is the Gleason grading of histopathology images. The Gleason grade is assigned based on tumor architecture on Hematoxylin and Eosin (H&E) stained whole slide images (WSI) by the pathologists. This process is time-consuming and has known interobserver variability. In the past few years, deep learning algorithms have been used to analyze histopathology images, delivering promising results for grading prostate cancer. However, most of the algorithms rely on the fully annotated datasets which are expensive to generate. In this work, we proposed a novel weakly-supervised algorithm to classify prostate cancer grades. The proposed algorithm consists of three steps: (1) extracting discriminative areas in a histopathology image by employing the Multiple Instance Learning (MIL) algorithm based on Transformers, (2) representing the image by constructing a graph using the discriminative patches, and (3) classifying the image into its Gleason grades by developing a Graph Convolutional Neural Network (GCN) based on the gated attention mechanism. We evaluated our algorithm using publicly available datasets, including TCGAPRAD, PANDA, and Gleason 2019 challenge datasets. We also cross validated the algorithm on an independent dataset. Results show that the proposed model achieved state-of-the-art performance in the Gleason grading task in terms of accuracy, F1 score, and cohen-kappa. The code is available at https://github.com/NabaviLab/Prostate-Cancer. △ Less

Submitted 24 December, 2022; originally announced December 2022.

arXiv:2211.04445 [pdf, other]

Physics-Constrained Backdoor Attacks on Power System Fault Localization

Authors: Jianing Bai, Ren Wang, Zuyi Li

Abstract: The advances in deep learning (DL) techniques have the potential to deliver transformative technological breakthroughs to numerous complex tasks in modern power systems that suffer from increasing uncertainty and nonlinearity. However, the vulnerability of DL has yet to be thoroughly explored in power system tasks under various physical constraints. This work, for the first time, proposes a novel… ▽ More The advances in deep learning (DL) techniques have the potential to deliver transformative technological breakthroughs to numerous complex tasks in modern power systems that suffer from increasing uncertainty and nonlinearity. However, the vulnerability of DL has yet to be thoroughly explored in power system tasks under various physical constraints. This work, for the first time, proposes a novel physics-constrained backdoor poisoning attack, which embeds the undetectable attack signal into the learned model and only performs the attack when it encounters the corresponding signal. The paper illustrates the proposed attack on the real-time fault line localization application. Furthermore, the simulation results on the 68-bus power system demonstrate that DL-based fault line localization methods are not robust to our proposed attack, indicating that backdoor poisoning attacks pose real threats to DL implementations in power systems. The proposed attack pipeline can be easily generalized to other power system tasks. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2211.01087 [pdf, other]

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Authors: Kun Song, Yongmao Zhang, Yi Lei, Jian Cong, Hanzhao Li, Lei Xie, Gang He, Jinfeng Bai

Abstract: Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages,… ▽ More Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase and the predicted spectrograms in the inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experiments show that DSPGAN significantly outperforms the compared approaches and it can generate high-fidelity speech for various TTS models trained using diverse data. △ Less

Submitted 28 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted to ICASSP 2023

arXiv:2210.06091 [pdf]

Summary on the ISCSLP 2022 Chinese-English Code-Switching ASR Challenge

Authors: Shuhao Deng, Chengfei Li, Jinfeng Bai, Qingqing Zhang, Wei-Qiang Zhang, Runyan Yang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

Abstract: Code-switching automatic speech recognition becomes one of the most challenging and the most valuable scenarios of automatic speech recognition, due to the code-switching phenomenon between multilingual language and the frequent occurrence of code-switching phenomenon in daily life. The ISCSLP 2022 Chinese-English Code-Switching Automatic Speech Recognition (CSASR) Challenge aims to promote the de… ▽ More Code-switching automatic speech recognition becomes one of the most challenging and the most valuable scenarios of automatic speech recognition, due to the code-switching phenomenon between multilingual language and the frequent occurrence of code-switching phenomenon in daily life. The ISCSLP 2022 Chinese-English Code-Switching Automatic Speech Recognition (CSASR) Challenge aims to promote the development of code-switching automatic speech recognition. The ISCSLP 2022 CSASR challenge provided two training sets, TAL_CSASR corpus and MagicData-RAMC corpus, a development and a test set for participants, which are used for CSASR model training and evaluation. Along with the challenge, we also provide the baseline system performance for reference. As a result, more than 40 teams participated in this challenge, and the winner team achieved 16.70% Mixture Error Rate (MER) performance on the test set and has achieved 9.8% MER absolute improvement compared with the baseline system. In this paper, we will describe the datasets, the associated baselines system and the requirements, and summarize the CSASR challenge results and major techniques and tricks used in the submitted systems. △ Less

Submitted 13 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: accepted by ISCSLP 2022

arXiv:2208.03421 [pdf, other]

doi 10.1016/j.dsp.2023.103939

SSDPT: Self-Supervised Dual-Path Transformer for Anomalous Sound Detection in Machine Condition Monitoring

Authors: Jisheng Bai, Jianfeng Chen, Mou Wang, Muhammad Saad Ayub, Qingli Yan

Abstract: Anomalous sound detection for machine condition monitoring has great potential in the development of Industry 4.0. However, these anomalous sounds of machines are usually unavailable in normal conditions. Therefore, the models employed have to learn acoustic representations with normal sounds for training, and detect anomalous sounds while testing. In this article, we propose a self-supervised dua… ▽ More Anomalous sound detection for machine condition monitoring has great potential in the development of Industry 4.0. However, these anomalous sounds of machines are usually unavailable in normal conditions. Therefore, the models employed have to learn acoustic representations with normal sounds for training, and detect anomalous sounds while testing. In this article, we propose a self-supervised dual-path Transformer (SSDPT) network to detect anomalous sounds in machine monitoring. The SSDPT network splits the acoustic features into segments and employs several DPT blocks for time and frequency modeling. DPT blocks use attention modules to alternately model the interactive information about the frequency and temporal components of the segmented acoustic features. To address the problem of lack of anomalous sound, we adopt a self-supervised learning approach to train the network with normal sound. Specifically, this approach randomly masks and reconstructs the acoustic features, and jointly classifies machine identity information to improve the performance of anomalous sound detection. We evaluated our method on the DCASE2021 task2 dataset. The experimental results show that the SSDPT network achieves a significant increase in the harmonic mean AUC score, in comparison to present state-of-the-art methods of anomalous sound detection. △ Less

Submitted 5 August, 2022; originally announced August 2022.

arXiv:2208.03070 [pdf, ps, other]

doi 10.1109/LWC.2022.3197053

Activity Detection in Distributed MIMO: Distributed AMP via Likelihood Ratio Fusion

Authors: Jianan Bai, Erik G. Larsson

Abstract: We develop a new algorithm for activity detection for grant-free multiple access in distributed multiple-input multiple-output (MIMO). The algorithm is a distributed version of the approximate message passing (AMP) based on a soft combination of likelihood ratios computed independently at multiple access points. The underpinning theoretical basis of our algorithm is a new observation that we made… ▽ More We develop a new algorithm for activity detection for grant-free multiple access in distributed multiple-input multiple-output (MIMO). The algorithm is a distributed version of the approximate message passing (AMP) based on a soft combination of likelihood ratios computed independently at multiple access points. The underpinning theoretical basis of our algorithm is a new observation that we made about the state evolution in the AMP. Specifically, with a minimum mean-square error denoiser, the state maintains a block-diagonal structure whenever the covariance matrices of the signals have such a structure. We show by numerical examples that the algorithm outperforms competing schemes from the literature. △ Less

Submitted 22 September, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

Comments: 5 pages, 2 figures. This paper has been accepted for publication in IEEE Wireless Communications Letters. Code available at https://github.com/jiananbai/distributed-AMP

arXiv:2206.13135 [pdf]

TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline

Authors: Chengfei Li, Shuhao Deng, Yaoping Wang, Guangjing Wang, Yaguang Gong, Changbin Chen, Jinfeng Bai

Abstract: This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest wel… ▽ More This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus, suitable for training and evaluating code-switching speech recognition systems. TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group, which contains roughly 587 hours of speech sampled at 16 kHz. To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition (ASR) dataset in the world. In this paper, we will introduce the recording procedure in detail, including audio capturing devices and corpus environments. And the TALCS corpus is freely available for download under the permissive license1. Using TALCS corpus, we conduct ASR experiments in two popular speech recognition toolkits to make a baseline system, including ESPnet and Wenet. The Mixture Error Rate (MER) performance in the two speech recognition toolkits is compared in TALCS corpus. The experimental results implies that the quality of audio recordings and transcriptions are promising and the baseline system is workable. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: accepted by INTERSPEECH 2022

arXiv:2205.05570 [pdf, other]

Review on Panoramic Imaging and Its Applications in Scene Understanding

Authors: Shaohua Gao, Kailun Yang, Hao Shi, Kaiwei Wang, Jian Bai

Abstract: With the rapid development of high-speed communication and artificial intelligence technologies, human perception of real-world scenes is no longer limited to the use of small Field of View (FoV) and low-dimensional scene detection devices. Panoramic imaging emerges as the next generation of innovative intelligent instruments for environmental perception and measurement. However, while satisfying… ▽ More With the rapid development of high-speed communication and artificial intelligence technologies, human perception of real-world scenes is no longer limited to the use of small Field of View (FoV) and low-dimensional scene detection devices. Panoramic imaging emerges as the next generation of innovative intelligent instruments for environmental perception and measurement. However, while satisfying the need for large-FoV photographic imaging, panoramic imaging instruments are expected to have high resolution, no blind area, miniaturization, and multidimensional intelligent perception, and can be combined with artificial intelligence methods towards the next generation of intelligent instruments, enabling deeper understanding and more holistic perception of 360-degree real-world surrounding environments. Fortunately, recent advances in freeform surfaces, thin-plate optics, and metasurfaces provide innovative approaches to address human perception of the environment, offering promising ideas beyond conventional optical imaging. In this review, we begin with introducing the basic principles of panoramic imaging systems, and then describe the architectures, features, and functions of various panoramic imaging systems. Afterwards, we discuss in detail the broad application prospects and great design potential of freeform surfaces, thin-plate optics, and metasurfaces in panoramic imaging. We then provide a detailed analysis on how these techniques can help enhance the performance of panoramic imaging systems. We further offer a detailed analysis of applications of panoramic imaging in scene understanding for autonomous driving and robotics, spanning panoramic semantic image segmentation, panoramic depth estimation, panoramic visual localization, and so on. Finally, we cast a perspective on future potential and research directions for panoramic imaging instruments. △ Less

Submitted 14 October, 2022; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: Accepted to IEEE Transactions on Instrumentation and Measurement. 34 pages, 15 figures, 420 references

arXiv:2203.08350 [pdf, other]

doi 10.1109/TCDS.2022.3222350

A Squeeze-and-Excitation and Transformer based Cross-task System for Environmental Sound Recognition

Authors: Jisheng Bai, Jianfeng Chen, Mou Wang, Muhammad Saad Ayub

Abstract: Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational models for ESR in real-life applications. However, current models are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task models, which promote unified knowledge modeling across various tasks, have not… ▽ More Environmental sound recognition (ESR) is an emerging research topic in audio pattern recognition. Many tasks are presented to resort to computational models for ESR in real-life applications. However, current models are usually designed for individual tasks, and are not robust and applicable to other tasks. Cross-task models, which promote unified knowledge modeling across various tasks, have not been thoroughly investigated. In this article, we propose a cross-task model for three different tasks of ESR: 1) acoustic scene classification; 2) urban sound tagging; and 3) anomalous sound detection. An architecture named SE-Trans is presented that uses attention mechanism-based Squeeze-and-Excitation and Transformer encoder modules to learn the channelwise relationship and temporal dependencies of the acoustic features. FMix is employed as the data augmentation method that improves the performance of ESR. Evaluations for the three tasks are conducted on the recent databases of detection and classification of acoustic scenes and event challenges. The experimental results show that the proposed cross-task model achieves state-of-the-art performance on all tasks. Further analysis demonstrates that the proposed cross-task model can effectively utilize acoustic knowledge across different ESR tasks. △ Less

Submitted 21 November, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

Showing 1–50 of 61 results for author: Bai, J