Search | arXiv e-print repository

EM++: A parameter learning framework for stochastic switching systems

Authors: Renzi Wang, Alexander Bodard, Mathijs Schuurmans, Panagiotis Patrinos

Abstract: This paper proposes a general switching dynamical system model, and a custom majorization-minimization-based algorithm EM++ for identifying its parameters. For certain families of distributions, such as Gaussian distributions, this algorithm reduces to the well-known expectation-maximization method. We prove global convergence of the algorithm under suitable assumptions, thus addressing an importa… ▽ More This paper proposes a general switching dynamical system model, and a custom majorization-minimization-based algorithm EM++ for identifying its parameters. For certain families of distributions, such as Gaussian distributions, this algorithm reduces to the well-known expectation-maximization method. We prove global convergence of the algorithm under suitable assumptions, thus addressing an important open issue in the switching system identification literature. The effectiveness of both the proposed model and algorithm is validated through extensive numerical experiments. △ Less

Submitted 23 July, 2024; originally announced July 2024.

arXiv:2407.11223 [pdf, other]

DD_RoTIR: Dual-Domain Image Registration via Image Translation and Hierarchical Feature-matching

Authors: Ruixiong Wang, Stephen Cross, Alin Achim

Abstract: Microscopy images obtained from multiple camera lenses or sensors in biological experiments provide a comprehensive understanding of objects from diverse perspectives. However, using multiple microscope setups increases the risk of misalignment of identical target features across different modalities, making multimodal image registration crucial. In this work, we build upon previous successes in b… ▽ More Microscopy images obtained from multiple camera lenses or sensors in biological experiments provide a comprehensive understanding of objects from diverse perspectives. However, using multiple microscope setups increases the risk of misalignment of identical target features across different modalities, making multimodal image registration crucial. In this work, we build upon previous successes in biological image translation (XAcGAN) and mono-modal image registration (RoTIR) to develop a deep learning model, Dual-Domain RoTIR (DD_RoTIR), specifically designed to address these challenges. While GAN-based translation models are often considered inadequate for multimodal image registration, we enhance registration accuracy by employing a feature-matching algorithm based on Transformers and rotation equivariant networks. Additionally, hierarchical feature matching is utilized to tackle the complexities of multimodal image registration. Our results demonstrate that the DD_RoTIR model exhibits strong applicability and robustness across multiple microscopy image datasets. △ Less

Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

Comments: 30 pages including supporting information; 15 figures for main context, 5 figures for supporting information; 5 tables; 5 equations in main, 12 in supporting imformation

arXiv:2407.11031 [pdf, other]

Purification Of Contaminated Convolutional Neural Networks Via Robust Recovery: An Approach with Theoretical Guarantee in One-Hidden-Layer Case

Authors: Hanxiao Lu, Zeyu Huang, Ren Wang

Abstract: Convolutional neural networks (CNNs), one of the key architectures of deep learning models, have achieved superior performance on many machine learning tasks such as image classification, video recognition, and power systems. Despite their success, CNNs can be easily contaminated by natural noises and artificially injected noises such as backdoor attacks. In this paper, we propose a robust recover… ▽ More Convolutional neural networks (CNNs), one of the key architectures of deep learning models, have achieved superior performance on many machine learning tasks such as image classification, video recognition, and power systems. Despite their success, CNNs can be easily contaminated by natural noises and artificially injected noises such as backdoor attacks. In this paper, we propose a robust recovery method to remove the noise from the potentially contaminated CNNs and provide an exact recovery guarantee on one-hidden-layer non-overlapping CNNs with the rectified linear unit (ReLU) activation function. Our theoretical results show that both CNNs' weights and biases can be exactly recovered under the overparameterization setting with some mild assumptions. The experimental results demonstrate the correctness of the proofs and the effectiveness of the method in both the synthetic environment and the practical neural network setting. Our results also indicate that the proposed method can be extended to multiple-layer CNNs and potentially serve as a defense strategy against backdoor attacks. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.09251 [pdf, other]

Deep Adversarial Defense Against Multilevel-Lp Attacks

Authors: Ren Wang, Yuxuan Li, Alfred Hero

Abstract: Deep learning models have shown considerable vulnerability to adversarial attacks, particularly as attacker strategies become more sophisticated. While traditional adversarial training (AT) techniques offer some resilience, they often focus on defending against a single type of attack, e.g., the $\ell_\infty$-norm attack, which can fail for other types. This paper introduces a computationally effi… ▽ More Deep learning models have shown considerable vulnerability to adversarial attacks, particularly as attacker strategies become more sophisticated. While traditional adversarial training (AT) techniques offer some resilience, they often focus on defending against a single type of attack, e.g., the $\ell_\infty$-norm attack, which can fail for other types. This paper introduces a computationally efficient multilevel $\ell_p$ defense, called the Efficient Robust Mode Connectivity (EMRC) method, which aims to enhance a deep learning model's resilience against multiple $\ell_p$-norm attacks. Similar to analytical continuation approaches used in continuous optimization, the method blends two $p$-specific adversarially optimal models, the $\ell_1$- and $\ell_\infty$-norm AT solutions, to provide good adversarial robustness for a range of $p$. We present experiments demonstrating that our approach performs better on various attacks as compared to AT-$\ell_\infty$, E-AT, and MSD, for datasets/architectures including: CIFAR-10, CIFAR-100 / PreResNet110, WideResNet, ViT-Base. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.08093 [pdf, other]

MemWarp: Discontinuity-Preserving Cardiac Registration with Memorized Anatomical Filters

Authors: Hang Zhang, Xiang Chen, Renjiu Hu, Dongdong Liu, Gaolei Li, Rongguang Wang

Abstract: Many existing learning-based deformable image registration methods impose constraints on deformation fields to ensure they are globally smooth and continuous. However, this assumption does not hold in cardiac image registration, where different anatomical regions exhibit asymmetric motions during respiration and movements due to sliding organs within the chest. Consequently, such global constraint… ▽ More Many existing learning-based deformable image registration methods impose constraints on deformation fields to ensure they are globally smooth and continuous. However, this assumption does not hold in cardiac image registration, where different anatomical regions exhibit asymmetric motions during respiration and movements due to sliding organs within the chest. Consequently, such global constraints fail to accommodate local discontinuities across organ boundaries, potentially resulting in erroneous and unrealistic displacement fields. In this paper, we address this issue with MemWarp, a learning framework that leverages a memory network to store prototypical information tailored to different anatomical regions. MemWarp is different from earlier approaches in two main aspects: firstly, by decoupling feature extraction from similarity matching in moving and fixed images, it facilitates more effective utilization of feature maps; secondly, despite its capability to preserve discontinuities, it eliminates the need for segmentation masks during model inference. In experiments on a publicly available cardiac dataset, our method achieves considerable improvements in registration accuracy and producing realistic deformations, outperforming state-of-the-art methods with a remarkable 7.1\% Dice score improvement over the runner-up semi-supervised method. Source code will be available at https://github.com/tinymilky/Mem-Warp. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 11 pages, 2 figure, 2 tables

arXiv:2407.03896 [pdf, other]

Specification-guided temporal logic control for stochastic systems: a multi-layered approach

Authors: Birgit C. van Huijgevoort, Ruohan Wang, Sadegh Soudjani, Sofie Haesaert

Abstract: Designing controllers to satisfy temporal requirements has proven to be challenging for dynamical systems that are affected by uncertainty. This is mainly due to the states evolving in a continuous uncountable space, the stochastic evolution of the states, and infinite-horizon temporal requirements on the system evolution, all of which makes closed-form solutions generally inaccessible. A promisin… ▽ More Designing controllers to satisfy temporal requirements has proven to be challenging for dynamical systems that are affected by uncertainty. This is mainly due to the states evolving in a continuous uncountable space, the stochastic evolution of the states, and infinite-horizon temporal requirements on the system evolution, all of which makes closed-form solutions generally inaccessible. A promising approach for designing provably correct controllers on such systems is to utilize the concept of abstraction, which is based on building simplified abstract models that can be used to approximate optimal controllers with provable closeness guarantees. The available abstraction-based methods are further divided into discretization-based approaches that build a finite abstract model by discretizing the continuous space of the system, and discretization-free approaches that work directly on the continuous state space without the need for building a finite space. To reduce the conservatism in the sub-optimality of the designed controller originating from the abstraction step, this paper develops an approach that naturally has the flexibility to combine different abstraction techniques from the aforementioned classes and to combine the same abstraction technique with different parameters. First, we develop a multi-layered discretization-based approach with variable precision by combining abstraction layers with different precision parameters. Then, we exploit the advantages of both classes of abstraction-based methods by extending this multi-layered approach guided by the specification to combinations of layers with respectively discretization-based and discretization-free abstractions. We achieve an efficient implementation that is less conservative and improves the computation time and memory usage. We illustrate the benefits of the proposed multi-layered approach on several case studies. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.19677 [pdf, other]

End-to-End Uplink Performance Analysis of Satellite-Based IoT Networks: A Stochastic Geometry Approach

Authors: Jiusi Zhou, Ruibo Wang, Basem Shihada, Mohamed-Slim Alouini

Abstract: With the deployment of satellite constellations, Internet-of-Things (IoT) devices in remote areas have gained access to low-cost network connectivity. In this paper, we investigate the performance of IoT devices connecting in up-link through low Earth orbit (LEO) satellites to geosynchronous equatorial orbit (GEO) links. We model the dynamic LEO satellite constellation using the stochastic geometr… ▽ More With the deployment of satellite constellations, Internet-of-Things (IoT) devices in remote areas have gained access to low-cost network connectivity. In this paper, we investigate the performance of IoT devices connecting in up-link through low Earth orbit (LEO) satellites to geosynchronous equatorial orbit (GEO) links. We model the dynamic LEO satellite constellation using the stochastic geometry method and provide an analysis of end-to-end availability with low-complexity and coverage performance estimates for the mentioned link. Based on the analytical expressions derived in this research, we make a sound investigation on the impact of constellation configuration, transmission power, and the relative positions of IoT devices and GEO satellites on end-to-end performance. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.11265 [pdf, ps, other]

Balancing Performance and Cost for Two-Hop Cooperative Communications: Stackelberg Game and Distributed Multi-Agent Reinforcement Learning

Authors: Yuanzhe Geng, Erwu Liu, Wei Ni, Rui Wang, Yan Liu, Hao Xu, Chen Cai, Abbas Jamalipour

Abstract: This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays for… ▽ More This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays form an alliance in an attempt to maximize the benefit of relaying while the source aims to increase the channel capacity cost-effectively. To this end, we establish the trade problem as a Stackelberg game, and prove the existence of its equilibrium. Another important aspect is that we use multi-agent reinforcement learning (MARL) to approach the equilibrium in a situation where the instantaneous channel state information (CSI) is unavailable, and the source and relays do not have knowledge of each other's goal. A multi-agent deep deterministic policy gradient-based framework is designed, where the relay alliance and the source act as agents. Experiments demonstrate that the proposed method can obtain an acceptable performance that is close to the game-theoretic equilibrium for all players under time-invariant environments, which considerably outperforms its potential alternatives and is only about 2.9% away from the optimal solution. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.08200 [pdf, other]

Asynchronous Voice Anonymization Using Adversarial Perturbation On Speaker Embedding

Authors: Rui Wang, Liping Chen, Kong AiK Lee, Zhen-Hua Ling

Abstract: Voice anonymization has been developed as a technique for preserving privacy by replacing the speaker's voice in a speech signal with that of a pseudo-speaker, thereby obscuring the original voice attributes from machine recognition and human perception. In this paper, we focus on altering the voice attributes against machine recognition while retaining human perception. We referred to this as the… ▽ More Voice anonymization has been developed as a technique for preserving privacy by replacing the speaker's voice in a speech signal with that of a pseudo-speaker, thereby obscuring the original voice attributes from machine recognition and human perception. In this paper, we focus on altering the voice attributes against machine recognition while retaining human perception. We referred to this as the asynchronous voice anonymization. To this end, a speech generation framework incorporating a speaker disentanglement mechanism is employed to generate the anonymized speech. The speaker attributes are altered through adversarial perturbation applied on the speaker embedding, while human perception is preserved by controlling the intensity of perturbation. Experiments conducted on the LibriSpeech dataset showed that the speaker attributes were obscured with their human perception preserved for 60.71% of the processed utterances. △ Less

Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: accpeted by Interspeech2024

arXiv:2406.07061 [pdf, other]

Triage of 3D pathology data via 2.5D multiple-instance learning to guide pathologist assessments

Authors: Gan Gao, Andrew H. Song, Fiona Wang, David Brenes, Rui Wang, Sarah S. L. Chow, Kevin W. Bishop, Lawrence D. True, Faisal Mahmood, Jonathan T. C. Liu

Abstract: Accurate patient diagnoses based on human tissue biopsies are hindered by current clinical practice, where pathologists assess only a limited number of thin 2D tissue slices sectioned from 3D volumetric tissue. Recent advances in non-destructive 3D pathology, such as open-top light-sheet microscopy, enable comprehensive imaging of spatially heterogeneous tissue morphologies, offering the feasibili… ▽ More Accurate patient diagnoses based on human tissue biopsies are hindered by current clinical practice, where pathologists assess only a limited number of thin 2D tissue slices sectioned from 3D volumetric tissue. Recent advances in non-destructive 3D pathology, such as open-top light-sheet microscopy, enable comprehensive imaging of spatially heterogeneous tissue morphologies, offering the feasibility to improve diagnostic determinations. A potential early route towards clinical adoption for 3D pathology is to rely on pathologists for final diagnosis based on viewing familiar 2D H&E-like image sections from the 3D datasets. However, manual examination of the massive 3D pathology datasets is infeasible. To address this, we present CARP3D, a deep learning triage approach that automatically identifies the highest-risk 2D slices within 3D volumetric biopsy, enabling time-efficient review by pathologists. For a given slice in the biopsy, we estimate its risk by performing attention-based aggregation of 2D patches within each slice, followed by pooling of the neighboring slices to compute a context-aware 2.5D risk score. For prostate cancer risk stratification, CARP3D achieves an area under the curve (AUC) of 90.4% for triaging slices, outperforming methods relying on independent analysis of 2D sections (AUC=81.3%). These results suggest that integrating additional depth context enhances the model's discriminative capabilities. In conclusion, CARP3D has the potential to improve pathologist diagnosis via accurate triage of high-risk slices within large-volume 3D pathology datasets. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: CVPR CVMI 2024

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 6955-6965

arXiv:2406.04737 [pdf, other]

doi 10.1109/TWC.2024.3425473

Fast-Fading Channel and Power Optimization of the Magnetic Inductive Cellular Network

Authors: Honglei Ma, Erwu Liu, Zhijun Fang, Rui Wang, Yongbin Gao, Wenjun Yu, Dongming Zhang

Abstract: The cellular network of magnetic Induction (MI) communication holds promise in long-distance underground environments. In the traditional MI communication, there is no fast-fading channel since the MI channel is treated as a quasi-static channel. However, for the vehicle (mobile) MI (VMI) communication, the unpredictable antenna vibration brings the remarkable fast-fading. As such fast-fading cann… ▽ More The cellular network of magnetic Induction (MI) communication holds promise in long-distance underground environments. In the traditional MI communication, there is no fast-fading channel since the MI channel is treated as a quasi-static channel. However, for the vehicle (mobile) MI (VMI) communication, the unpredictable antenna vibration brings the remarkable fast-fading. As such fast-fading cannot be modeled by the central limit theorem, it differs radically from other wireless fast-fading channels. Unfortunately, few studies focus on this phenomenon. In this paper, using a novel space modeling based on the electromagnetic field theorem, we propose a 3-dimension model of the VMI antenna vibration. By proposing ``conjugate pseudo-piecewise functions'' and boundary $p(x)$ distribution, we derive the cumulative distribution function (CDF), probability density function (PDF) and the expectation of the VMI fast-fading channel. We also theoretically analyze the effects of the VMI fast-fading on the network throughput, including the VMI outage probability which can be ignored in the traditional MI channel study. We draw several intriguing conclusions different from those in wireless fast-fading studies. For instance, the fast-fading brings more uniformly distributed channel coefficients. Finally, we propose the power control algorithm using the non-cooperative game and multiagent Q-learning methods to optimize the throughput of the cellular VMI network. Simulations validate the derivation and the proposed algorithm. △ Less

Submitted 7 July, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: This work has been accepted by the IEEE TWC for publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2406.01419 [pdf, other]

High-performance magnetostatic wave resonators through deep anisotropic etching of GGG substrates

Authors: Sudhanshu Tiwari, Anuj Ashok, Connor Devitt, Sunil A. Bhave, Renyuan Wang

Abstract: Microscale resonators are fundamental and necessary building blocks for modern radio communication filters for mobile devices. The resonator's Q factor ($Q$) determines the insertion loss while coupling ($K_t^2$) governs the fractional bandwidth. The product $k_t^2 \times Q$ is widely recognized as the definitive figure of merit for microresonators. Magnetostatic wave resonators based on Yttrium I… ▽ More Microscale resonators are fundamental and necessary building blocks for modern radio communication filters for mobile devices. The resonator's Q factor ($Q$) determines the insertion loss while coupling ($K_t^2$) governs the fractional bandwidth. The product $k_t^2 \times Q$ is widely recognized as the definitive figure of merit for microresonators. Magnetostatic wave resonators based on Yttrium Iron Garnet (YIG) are a promising technology platform for future communication filters. They have shown considerably better performance in terms of $Q$ when compared to the commercially successful acoustic resonators in the $>$7 GHz range. However, the coupling coefficients of these resonators have been limited to $<$3 %, primarily due to the restricted design space imposed by microfabrication challenges related to the patterning of gadolinium gallium garnet (GGG), the substrate material used for growing single crystal YIG. This paper reports novel resonator designs enabled by breakthrough bulk micromachining technology for anisotropic etching of GGG, leading to coupling >8 % in the 6-20 GHz frequency range. We use the same technology platform to show resonant enhancement of effective coupling, reaching up to 23 \% at 10.5 GHz. The frequency of resonant coupling can be tuned by design during the fabrication process. The resonant coupling results in an unprecedented $k_t^2 \times Q$ figure of merit of 191 at 10.5 GHz and 222 at 14.7 GHz. The technology platform presented in this paper supports both tunable filter architecture and switched filter banks that are currently being used in consumer mobile devices. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.17366 [pdf, other]

EM-GANSim: Real-time and Accurate EM Simulation Using Conditional GANs for 3D Indoor Scenes

Authors: Ruichen Wang, Dinesh Manocha

Abstract: We present a novel machine-learning (ML) approach (EM-GANSim) for real-time electromagnetic (EM) propagation that is used for wireless communication simulation in 3D indoor environments. Our approach uses a modified conditional Generative Adversarial Network (GAN) that incorporates encoded geometry and transmitter location while adhering to the electromagnetic propagation theory. The overall physi… ▽ More We present a novel machine-learning (ML) approach (EM-GANSim) for real-time electromagnetic (EM) propagation that is used for wireless communication simulation in 3D indoor environments. Our approach uses a modified conditional Generative Adversarial Network (GAN) that incorporates encoded geometry and transmitter location while adhering to the electromagnetic propagation theory. The overall physically-inspired learning is able to predict the power distribution in 3D scenes, which is represented using heatmaps. Our overall accuracy is comparable to ray tracing-based EM simulation, as evidenced by lower mean squared error values. Furthermore, our GAN-based method drastically reduces the computation time, achieving a 5X speedup on complex benchmarks. In practice, it can compute the signal strength in a few milliseconds on any location in 3D indoor environments. We also present a large dataset of 3D models and EM ray tracing-simulated heatmaps. To the best of our knowledge, EM-GANSim is the first real-time algorithm for EM simulation in complex 3D indoor environments. We plan to release the code and the dataset. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 10 pages, 8 figures, 5 tables

arXiv:2405.15927 [pdf]

Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum"

Authors: MHD Anas Alsakkal, Runze Wang, Jayawan Wijekoon, Huajin Tang

Abstract: Spike-based encoders represent information as sequences of spikes or pulses, which are transmitted between neurons. A prevailing consensus suggests that spike-based approaches demonstrate exceptional capabilities in capturing the temporal dynamics of neural activity and have the potential to provide energy-efficient solutions for low-power applications. The Spiketrum encoder efficiently compresses… ▽ More Spike-based encoders represent information as sequences of spikes or pulses, which are transmitted between neurons. A prevailing consensus suggests that spike-based approaches demonstrate exceptional capabilities in capturing the temporal dynamics of neural activity and have the potential to provide energy-efficient solutions for low-power applications. The Spiketrum encoder efficiently compresses input data using spike trains or code sets (for non-spiking applications) and is adaptable to both hardware and software implementations, with lossless signal reconstruction capability. The paper proposes and assesses Spiketrum's hardware, evaluating its output under varying spike rates and its classification performance with popular spiking and non-spiking classifiers, and also assessing the quality of information compression and hardware resource utilization. The paper extensively benchmarks both Spiketrum hardware and its software counterpart against state-of-the-art, biologically-plausible encoders. The evaluations encompass benchmarking criteria, including classification accuracy, training speed, and sparsity when using encoder outputs in pattern recognition and classification with both spiking and non-spiking classifiers. Additionally, they consider encoded output entropy and hardware resource utilization and power consumption of the hardware version of the encoders. Results demonstrate Spiketrum's superiority in most benchmarking criteria, making it a promising choice for various applications. It efficiently utilizes hardware resources with low power consumption, achieving high classification accuracy. This work also emphasizes the potential of encoders in spike-based processing to improve the efficiency and performance of neural computing systems. △ Less

Submitted 31 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: To be published at "IEEE/ACM Transactions on Audio, Speech, and Language Processing"

arXiv:2405.15863 [pdf, other]

Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Authors: Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, Yuan Jiang

Abstract: In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering a novel approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, which often constitutes only a fraction of available datasets. Within open-source datasets, the prevalence of issues like mi… ▽ More In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering a novel approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, which often constitutes only a fraction of available datasets. Within open-source datasets, the prevalence of issues like mislabeling, weak labeling, unlabeled data, and low-quality music waveform significantly hampers the development of music generation models. To overcome these challenges, we introduce a novel quality-aware masked diffusion transformer (QA-MDT) approach that enables generative models to discern the quality of input music waveform during training. Building on the unique properties of musical signals, we have adapted and implemented a MDT model for TTM task, while further unveiling its distinct capacity for quality control. Moreover, we address the issue of low-quality captions with a caption refinement data processing approach. Our demo page is shown in https://qa-mdt.github.io/. Code on https://github.com/ivcylc/qa-mdt △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.13339 [pdf, other]

Floor-Plan-aided Indoor Localization: Zero-Shot Learning Framework, Data Sets, and Prototype

Authors: Haiyao Yu, Changyang She, Yunkai Hu, Geng Wang, Rui Wang, Branka Vucetic, Yonghui Li

Abstract: Machine learning has been considered a promising approach for indoor localization. Nevertheless, the sample efficiency, scalability, and generalization ability remain open issues of implementing learning-based algorithms in practical systems. In this paper, we establish a zero-shot learning framework that does not need real-world measurements in a new communication environment. Specifically, a gra… ▽ More Machine learning has been considered a promising approach for indoor localization. Nevertheless, the sample efficiency, scalability, and generalization ability remain open issues of implementing learning-based algorithms in practical systems. In this paper, we establish a zero-shot learning framework that does not need real-world measurements in a new communication environment. Specifically, a graph neural network that is scalable to the number of access points (APs) and mobile devices (MDs) is used for obtaining coarse locations of MDs. Based on the coarse locations, the floor-plan image between an MD and an AP is exploited to improve localization accuracy in a floor-plan-aided deep neural network. To further improve the generalization ability, we develop a synthetic data generator that provides synthetic data samples in different scenarios, where real-world samples are not available. We implement the framework in a prototype that estimates the locations of MDs. Experimental results show that our zero-shot learning method can reduce localization errors by around $30$\% to $55$\% compared with three baselines from the existing literature. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.11432 [pdf, other]

On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks

Authors: Nicholas H. Barbara, Ruigang Wang, Ian R. Manchester

Abstract: This paper presents a study of robust policy networks in deep reinforcement learning. We investigate the benefits of policy parameterizations that naturally satisfy constraints on their Lipschitz bound, analyzing their empirical performance and robustness on two representative problems: pendulum swing-up and Atari Pong. We illustrate that policy networks with small Lipschitz bounds are significant… ▽ More This paper presents a study of robust policy networks in deep reinforcement learning. We investigate the benefits of policy parameterizations that naturally satisfy constraints on their Lipschitz bound, analyzing their empirical performance and robustness on two representative problems: pendulum swing-up and Atari Pong. We illustrate that policy networks with small Lipschitz bounds are significantly more robust to disturbances, random noise, and targeted adversarial attacks than unconstrained policies composed of vanilla multi-layer perceptrons or convolutional neural networks. Moreover, we find that choosing a policy parameterization with a non-conservative Lipschitz bound and an expressive, nonlinear layer architecture gives the user much finer control over the performance-robustness trade-off than existing state-of-the-art methods based on spectral normalization. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.11115 [pdf]

Ptychographic non-line-of-sight imaging for depth-resolved visualization of hidden objects

Authors: Pengming Song, Qianhao Zhao, Ruihai Wang, Ninghe Liu, Yingqi Qiang, Tianbo Wang, Xincheng Zhang, Yi Zhang, Liangcai Cao, Guoan Zheng

Abstract: Non-line-of-sight (NLOS) imaging enables the visualization of objects hidden from direct view, with applications in surveillance, remote sensing, and light detection and ranging. Here, we introduce a NLOS imaging technique termed ptychographic NLOS (pNLOS), which leverages coded ptychography for depth-resolved imaging of obscured objects. Our approach involves scanning a laser spot on a wall to il… ▽ More Non-line-of-sight (NLOS) imaging enables the visualization of objects hidden from direct view, with applications in surveillance, remote sensing, and light detection and ranging. Here, we introduce a NLOS imaging technique termed ptychographic NLOS (pNLOS), which leverages coded ptychography for depth-resolved imaging of obscured objects. Our approach involves scanning a laser spot on a wall to illuminate the hidden objects in an obscured region. The reflected wavefields from these objects then travel back to the wall, get modulated by the wall's complex-valued profile, and the resulting diffraction patterns are captured by a camera. By modulating the object wavefields, the wall surface serves the role of the coded layer as in coded ptychography. As we scan the laser spot to different positions, the reflected object wavefields on the wall translate accordingly, with the shifts varying for objects at different depths. This translational diversity enables the acquisition of a set of modulated diffraction patterns referred to as a ptychogram. By processing the ptychogram, we recover both the objects at different depths and the modulation profile of the wall surface. Experimental results demonstrate high-resolution, high-fidelity imaging of hidden objects, showcasing the potential of pNLOS for depth-aware vision beyond the direct line of sight. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.06186 [pdf, other]

Sensing-Assisted Adaptive Channel Contention for Mobile Delay-Sensitive Communications

Authors: Bojie Lv, Qianren Li, Rui Wang

Abstract: This paper proposes an adaptive channel contention mechanism to optimize the queuing performance of a distributed millimeter wave (mmWave) uplink system with the capability of environment and mobility sensing. The mobile agents determine their back-off timer parameters according to their local knowledge of the uplink queue lengths, channel quality, and future channel statistics, where the channel… ▽ More This paper proposes an adaptive channel contention mechanism to optimize the queuing performance of a distributed millimeter wave (mmWave) uplink system with the capability of environment and mobility sensing. The mobile agents determine their back-off timer parameters according to their local knowledge of the uplink queue lengths, channel quality, and future channel statistics, where the channel prediction relies on the environment and mobility sensing. The optimization of queuing performance with this adaptive channel contention mechanism is formulated as a decentralized multi-agent Markov decision process (MDP). Although the channel contention actions are determined locally at the mobile agents, the optimization of local channel contention policies of all mobile agents is conducted in a centralized manner according to the system statistics before the scheduling. In the solution, the local policies are approximated by analytical models, and the optimization of their parameters becomes a stochastic optimization problem along an adaptive Markov chain. An unbiased gradient estimation is proposed so that the local policies can be optimized efficiently via the stochastic gradient descent method. It is demonstrated by simulation that the proposed gradient estimation is significantly more efficient in optimization than the existing methods, e.g., simultaneous perturbation stochastic approximation (SPSA). △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.19646 [pdf]

A Fully Screen-Printed Vanadium-Dioxide Switches Based Wideband Reconfigurable Intelligent Surface for 5G Bands

Authors: Yiming Yang, Mohammad Vaseem, Ruiqi Wang, Behrooz Makki, Atif Shamim

Abstract: Reconfigurable Intelligent Surface (RIS) is attracting more and more research interest because of its ability to reprogram the radio environment. Designing and implementing the RIS, however, is challenging because of limitations of printed circuit board (PCB) technology related to manufacturing of large sizes as well as the cost of switches. Thus, a low-cost manufacturing process suitable for larg… ▽ More Reconfigurable Intelligent Surface (RIS) is attracting more and more research interest because of its ability to reprogram the radio environment. Designing and implementing the RIS, however, is challenging because of limitations of printed circuit board (PCB) technology related to manufacturing of large sizes as well as the cost of switches. Thus, a low-cost manufacturing process suitable for large size and volume of devices, such as screen-printing is necessary. In this paper, for the first time, a fully screen-printed reconfigurable intelligent surface (RIS) with vanadium dioxide (VO2) switches for 5G and beyond communications is proposed. A VO2 ink has been prepared and batches of switches have been printed and integrated with the resonator elements. These switches are a fraction of the cost of commercial switches. Furthermore, the printing of these switches directly on metal patterns negates the need of any minute soldering of the switches. To avoid the complications of multilayer printing and realizing the RIS without vias, the resonators and the biasing lines are realized on a single layer. However, this introduces the challenge of interference between the biasing lines and the resonators, which is tackled in this work by designing the bias lines as part of the resonator. By adjusting the unit cell periodicity and the dimension of the H-shaped resonator, we achieve a 220 to 170° phase shift from 23.5 GHz to 29.5 GHz covering both n257 and n258 bands. Inside the wide bandwidth, the maximum ON reflection magnitude is 74%, and the maximum OFF magnitude is 94%. The RIS array comprises 20x20 unit cells (4.54x4.54λ^2 at 29.5 GHz). Each column of unit cells is serially connected to a current biasing circuit. To validate the array's performance, we conduct full-wave simulations as well as near-field and far-field measurements. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.15830 [pdf, other]

SNR Maximization and Localization for UAV-IRS-Assisted Near-Field Systems

Authors: Hanfu Zhang, Yidan Mei, Erwu Liu, Rui Wang

Abstract: This letter introduces a novel unmanned aerial vehicle (UAV)-intelligent reflecting surface (IRS) structure into near-field localization systems to enhance the design flexibility of IRS, thereby obtaining additional performance gains. Specifically, a UAV-IRS is utilized to improve the harsh wireless environment and provide localization possibilities. To improve the localization accuracy, a joint o… ▽ More This letter introduces a novel unmanned aerial vehicle (UAV)-intelligent reflecting surface (IRS) structure into near-field localization systems to enhance the design flexibility of IRS, thereby obtaining additional performance gains. Specifically, a UAV-IRS is utilized to improve the harsh wireless environment and provide localization possibilities. To improve the localization accuracy, a joint optimization problem considering UAV position and UAV-IRS passive beamforming is formulated to maximize the receiving signal-to-noise ratio (SNR). An alternative optimization algorithm is proposed to solve the complex non-convex problem leveraging the projected gradient ascent (PGA) algorithm and the principle of minimizing the phase difference of the receiving signals. Closed-form expressions for UAV-IRS phase shift are derived to reduce the algorithm complexity. In the simulations, the proposed algorithm is compared with three different schemes and outperforms the others in both receiving SNR and localization accuracy. △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: 5 pages, 3 figures

arXiv:2404.15761 [pdf, other]

Rechargeable UAV Trajectory Optimization for Real-Time Persistent Data Collection of Large-Scale Sensor Networks

Authors: Rui Wang, Deshi Li, Qingqing Wu, Kaitao Meng, Boning Feng, Lele Cong

Abstract: Unmanned aerial vehicles (UAVs) have received plenty of attention due to their high flexibility and enhanced communication ability, nonetheless, the limited onboard energy restricts UAVs' application on persistent data collection missions in large areas. In this paper, we propose a rechargeable UAV-assisted periodic data collection scheme, where a UAV is dispatched to periodically collect data fro… ▽ More Unmanned aerial vehicles (UAVs) have received plenty of attention due to their high flexibility and enhanced communication ability, nonetheless, the limited onboard energy restricts UAVs' application on persistent data collection missions in large areas. In this paper, we propose a rechargeable UAV-assisted periodic data collection scheme, where a UAV is dispatched to periodically collect data from sensor nodes (SNs) in the mission area and charged by a wireless charging platform. Specifically, the periodic data collection completion time is minimized by optimizing the UAV trajectory to reach the optimal balance among the collection time, flight time, and recharging time. The formulated problem is non-convex and difficult to solve directly. To tackle this problem, we divide the main problem into two sub-problems and address them by leveraging successive convex approximation (SCA), bisection search, and heuristic methods. Then, we propose a periodic trajectory optimization algorithm to iteratively solve the two sub-problems to minimize the completion time. Furthermore, to deal with the dynamics of SNs, we propose a low-complexity trajectory adjustment strategy, where the trajectory can be maintained or adjusted locally at the SNs change, which significantly mitigates the computation cost of re-optimization. The simulation results show the superiority and robustness of the proposed scheme and the completion time is on average 39% and 33% lower than the two benchmarks, respectively. △ Less

Submitted 6 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

Comments: 13 pages, 17 figures, submitted to IEEE for possible publication

arXiv:2404.15584 [pdf]

Research on OPF control of three-phase four-wire low-voltage distribution network considering uncertainty

Authors: Rui Wang, Xiaoqing Bai, Shengquan Huang, Shoupu Wei

Abstract: As power systems become more complex and uncertain, low-voltage distribution networks face numerous challenges, including three-phase imbalances caused by asymmetrical loads and distributed energy resources. We propose a robust stochastic optimization (RSO) based optimal power flow (OPF) control method for three-phase, four-wire low-voltage distribution networks that consider uncertainty to addres… ▽ More As power systems become more complex and uncertain, low-voltage distribution networks face numerous challenges, including three-phase imbalances caused by asymmetrical loads and distributed energy resources. We propose a robust stochastic optimization (RSO) based optimal power flow (OPF) control method for three-phase, four-wire low-voltage distribution networks that consider uncertainty to address these issues. Using historical data and deep learning classification methods, the proposed method simulates optimal system behaviour without requiring communication infrastructure. The simulation results verify that the proposed method effectively controls the voltage and current amplitude while minimizing the operational cost and three-phase imbalance within acceptable limits. The proposed method shows promise for managing uncertainties and optimizing performance in low-voltage distribution networks. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: systems optimization, robust optimization, local control

arXiv:2404.12554 [pdf, other]

Learning Stable and Passive Neural Differential Equations

Authors: Jing Cheng, Ruigang Wang, Ian R. Manchester

Abstract: In this paper, we introduce a novel class of neural differential equation, which are intrinsically Lyapunov stable, exponentially stable or passive. We take a recently proposed Polyak Lojasiewicz network (PLNet) as an Lyapunov function and then parameterize the vector field as the descent directions of the Lyapunov function. The resulting models have a same structure as the general Hamiltonian dyn… ▽ More In this paper, we introduce a novel class of neural differential equation, which are intrinsically Lyapunov stable, exponentially stable or passive. We take a recently proposed Polyak Lojasiewicz network (PLNet) as an Lyapunov function and then parameterize the vector field as the descent directions of the Lyapunov function. The resulting models have a same structure as the general Hamiltonian dynamics, where the Hamiltonian is lower- and upper-bounded by quadratic functions. Moreover, it is also positive definite w.r.t. either a known or learnable equilibrium. We illustrate the effectiveness of the proposed model on a damped double pendulum system. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.12077 [pdf, other]

TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning Approaches

Authors: Rong Wang, Kun Sun

Abstract: This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and challenges of multi-task learning versus single-task models. The motivation for this research is twofold: firstly, to empirically assess the advantages and drawbacks… ▽ More This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and challenges of multi-task learning versus single-task models. The motivation for this research is twofold: firstly, to empirically assess the advantages and drawbacks of multi-task learning over single-task models in the context of speaker profiling; secondly, to emphasize the undiminished significance of skillful feature engineering for speaker recognition tasks. The findings reveal challenges in accent classification, and multi-task learning is found advantageous for tasks of similar complexity. Non-sequential features are favored for speaker recognition, but sequential ones can serve as starting points for complex models. The study underscores the necessity of meticulous experimentation and parameter tuning for deep learning models. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.09969 [pdf, other]

Reconstructing classes of 3D FRI signals from sampled tomographic projections at unknown angles

Authors: Renke Wang, Francien G. Bossema, Thierry Blu, Pier Luigi Dragotti

Abstract: Traditional sampling schemes often assume that the sampling locations are known. Motivated by the recent bioimaging technique known as cryogenic electron microscopy (cryoEM), we consider the problem of reconstructing an unknown 3D structure from samples of its 2D tomographic projections at unknown angles. We focus on 3D convex bilevel polyhedra and 3D point sources and show that the exact estimati… ▽ More Traditional sampling schemes often assume that the sampling locations are known. Motivated by the recent bioimaging technique known as cryogenic electron microscopy (cryoEM), we consider the problem of reconstructing an unknown 3D structure from samples of its 2D tomographic projections at unknown angles. We focus on 3D convex bilevel polyhedra and 3D point sources and show that the exact estimation of these 3D structures and of the projection angles can be achieved up to an orthogonal transformation. Moreover, we are able to show that the minimum number of projections needed to achieve perfect reconstruction is independent of the complexity of the signal model. By using the divergence theorem, we are able to retrieve the projected vertices of the polyhedron from the sampled tomographic projections, and then we show how to retrieve the 3D object and the projection angles from this information. The proof of our theorem is constructive and leads to a robust reconstruction algorithm, which we validate under various conditions. Finally, we apply aspects of the proposed framework to calibration of X-ray computed tomography (CT) data. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.02461 [pdf, other]

On the Efficiency and Robustness of Vibration-based Foundation Models for IoT Sensing: A Case Study

Authors: Tomoyoshi Kimura, Jinyang Li, Tianshi Wang, Denizhan Kara, Yizhuo Chen, Yigong Hu, Ruijie Wang, Maggie Wigness, Shengzhong Liu, Mani Srivastava, Suhas Diggavi, Tarek Abdelzaher

Abstract: This paper demonstrates the potential of vibration-based Foundation Models (FMs), pre-trained with unlabeled sensing data, to improve the robustness of run-time inference in (a class of) IoT applications. A case study is presented featuring a vehicle classification application using acoustic and seismic sensing. The work is motivated by the success of foundation models in the areas of natural lang… ▽ More This paper demonstrates the potential of vibration-based Foundation Models (FMs), pre-trained with unlabeled sensing data, to improve the robustness of run-time inference in (a class of) IoT applications. A case study is presented featuring a vehicle classification application using acoustic and seismic sensing. The work is motivated by the success of foundation models in the areas of natural language processing and computer vision, leading to generalizations of the FM concept to other domains as well, where significant amounts of unlabeled data exist that can be used for self-supervised pre-training. One such domain is IoT applications. Foundation models for selected sensing modalities in the IoT domain can be pre-trained in an environment-agnostic fashion using available unlabeled sensor data and then fine-tuned to the deployment at hand using a small amount of labeled data. The paper shows that the pre-training/fine-tuning approach improves the robustness of downstream inference and facilitates adaptation to different environmental conditions. More specifically, we present a case study in a real-world setting to evaluate a simple (vibration-based) FM-like model, called FOCAL, demonstrating its superior robustness and adaptation, compared to conventional supervised deep neural networks (DNNs). We also demonstrate its superior convergence over supervised solutions. Our findings highlight the advantages of vibration-based FMs (and FM-inspired selfsupervised models in general) in terms of inference robustness, runtime efficiency, and model adaptation (via fine-tuning) in resource-limited IoT settings. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.02159 [pdf, other]

Fairness-aware Age-of-Information Minimization in WPT-Assisted Short-Packet THz Communications for mURLLC

Authors: Yao Zhu, Xiaopeng Yuan, Yulin Hu, Bo Ai, Ruikang Wang, Bin Han, Anke Schmeink

Abstract: The technological landscape is swiftly advancing towards large-scale systems, creating significant opportunities, particularly in the domain of Terahertz (THz) communications. Networks designed for massive connectivity, comprising numerous Internet of Things (IoT) devices, are at the forefront of this advancement. In this paper, we consider Wireless Power Transfer (WPT)-enabled networks that suppo… ▽ More The technological landscape is swiftly advancing towards large-scale systems, creating significant opportunities, particularly in the domain of Terahertz (THz) communications. Networks designed for massive connectivity, comprising numerous Internet of Things (IoT) devices, are at the forefront of this advancement. In this paper, we consider Wireless Power Transfer (WPT)-enabled networks that support these IoT devices with massive Ultra-Reliable and Low-Latency Communication (mURLLC) services.The focus of such networks is information freshness, with the Age-of-Information (AoI) serving as the pivotal performance metric. In particular, we aim to minimize the maximum AoI among IoT devices by optimizing the scheduling policy. Our analytical findings establish the convexity property of the problem, which can be solved efficiently. Furthermore, we introduce the concept of AoI-oriented cluster capacity, examining the relationship between the number of supported devices and the AoI performance in the network. Numerical simulations validate the advantage of our proposed approach in enhancing AoI performance, indicating its potential to guide the design of future THz communication systems for IoT applications requiring mURLLC services. △ Less

Submitted 15 February, 2024; originally announced April 2024.

arXiv:2403.17275 [pdf]

200Gb/s VCSEL transmission using 60m OM4 MMF and KP4 FEC for AI computing clusters

Authors: Tom Wettlin, Youxi Lin, Nebojsa Stojanovic, Stefano Calabrò, Ruoxu Wang, Lewei Zhang, Maxim Kuschnerov

Abstract: We show a beyond 200Gb/s VCSEL transmission experiment. Results are based on 35GHz VCSEL and advanced DSP. We show an AIR of 245Gb/s PAM-6 back-to-back, and 200Gb/s PAM-4 over 60m OM4 fiber assuming KP4-FEC. We show a beyond 200Gb/s VCSEL transmission experiment. Results are based on 35GHz VCSEL and advanced DSP. We show an AIR of 245Gb/s PAM-6 back-to-back, and 200Gb/s PAM-4 over 60m OM4 fiber assuming KP4-FEC. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.11091 [pdf, other]

Multitask frame-level learning for few-shot sound event detection

Authors: Liang Zou, Genwei Yan, Ruoyu Wang, Jun Du, Meng Lei, Tian Gao, Xin Fang

Abstract: This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been… ▽ More This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 6 pages, 4 figures, conference

arXiv:2403.09302 [pdf, other]

StainFuser: Controlling Diffusion for Faster Neural Style Transfer in Multi-Gigapixel Histology Images

Authors: Robert Jewsbury, Ruoyu Wang, Abhir Bhalerao, Nasir Rajpoot, Quoc Dang Vu

Abstract: Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Dif… ▽ More Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Diffusion architecture, eliminating the need for handcrafted color components. With this method, we curate SPI-2M the largest stain normalization dataset to date of over 2 million histology images with neural style transfer for high-quality transformations. Trained on this data, StainFuser outperforms current state-of-the-art deep learning and handcrafted methods in terms of the quality of normalized images and in terms of downstream model performance on the CoNIC dataset. △ Less

Submitted 12 July, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.04245 [pdf, other]

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Authors: Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Jiefeng Ma, Haotian Wang, Chin-Hui Lee

Abstract: Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting p… ▽ More Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: the paper is accepted by CVPR2024

arXiv:2403.02942 [pdf, other]

Tensor Decomposition-based Time Varying Channel Estimation for mmWave MIMO-OFDM Systems

Authors: Ruizhe Wang, Hong Ren, Cunhua Pan, Gui Zhou, Jiangzhou Wang

Abstract: In this paper, we consider the time-varying channel estimation in millimeter wave (mmWave) multiple-input multiple-output MIMO systems with hybrid beamforming architectures. Different from the existing contributions that considered single-carrier mmWave systems with high mobility, the wideband orthogonal frequency division multiplexing (OFDM) system is considered in this work. To solve the channel… ▽ More In this paper, we consider the time-varying channel estimation in millimeter wave (mmWave) multiple-input multiple-output MIMO systems with hybrid beamforming architectures. Different from the existing contributions that considered single-carrier mmWave systems with high mobility, the wideband orthogonal frequency division multiplexing (OFDM) system is considered in this work. To solve the channel estimation problem under channel double selectivity, we propose a pilot transmission scheme based on 5G OFDM, and the received signals are formed as a fourth-order tensor, which fits the low-rank CANDECOMP/PARAFAC (CP) model. By further exploring the Vandermonde structure of factor matrix, a tensor-subspace decomposition based channel estimation method is proposed to solve the CP decomposition, where the uniqueness condition is analyzed. Based on the decomposed factor matrices, the channel parameters, including angles of arrival/departure, delays, channel gains and Doppler shifts are estimated, and the Cramér-Rao bound (CRB) results are derived as performance metrics. Simulation results demonstrate the superior performance of the proposed method over other benchmarks. Furthermore, the channel estimation methods are tested based on the channel parameters generated by Wireless InSites, and simulation results show the effectiveness of the proposed method in practical scenarios. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.00897 [pdf, other]

VisRec: A Semi-Supervised Approach to Radio Interferometric Data Reconstruction

Authors: Ruoqi Wang, Haitao Wang, Qiong Luo, Feng Wang, Hejun Wu

Abstract: Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from… ▽ More Radio telescopes produce visibility data about celestial objects, but these data are sparse and noisy. As a result, images created on raw visibility data are of low quality. Recent studies have used deep learning models to reconstruct visibility data to get cleaner images. However, these methods rely on a substantial amount of labeled training data, which requires significant labeling effort from radio astronomers. Addressing this challenge, we propose VisRec, a model-agnostic semi-supervised learning approach to the reconstruction of visibility data. Specifically, VisRec consists of both a supervised learning module and an unsupervised learning module. In the supervised learning module, we introduce a set of data augmentation functions to produce diverse training examples. In comparison, the unsupervised learning module in VisRec augments unlabeled data and uses reconstructions from non-augmented visibility data as pseudo-labels for training. This hybrid approach allows VisRec to effectively leverage both labeled and unlabeled data. This way, VisRec performs well even when labeled data is scarce. Our evaluation results show that VisRec outperforms all baseline methods in reconstruction quality, robustness against common observation perturbation, and generalizability to different telescope configurations. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.12820 [pdf, other]

ASCEND: Accurate yet Efficient End-to-End Stochastic Computing Acceleration of Vision Transformer

Authors: Tong Xie, Yixuan Hu, Renjie Wei, Meng Li, Yuan Wang, Runsheng Wang, Ru Huang

Abstract: Stochastic computing (SC) has emerged as a promising computing paradigm for neural acceleration. However, how to accelerate the state-of-the-art Vision Transformer (ViT) with SC remains unclear. Unlike convolutional neural networks, ViTs introduce notable compatibility and efficiency challenges because of their nonlinear functions, e.g., softmax and Gaussian Error Linear Units (GELU). In this pape… ▽ More Stochastic computing (SC) has emerged as a promising computing paradigm for neural acceleration. However, how to accelerate the state-of-the-art Vision Transformer (ViT) with SC remains unclear. Unlike convolutional neural networks, ViTs introduce notable compatibility and efficiency challenges because of their nonlinear functions, e.g., softmax and Gaussian Error Linear Units (GELU). In this paper, for the first time, a ViT accelerator based on end-to-end SC, dubbed ASCEND, is proposed. ASCEND co-designs the SC circuits and ViT networks to enable accurate yet efficient acceleration. To overcome the compatibility challenges, ASCEND proposes a novel deterministic SC block for GELU and leverages an SC-friendly iterative approximate algorithm to design an accurate and efficient softmax circuit. To improve inference efficiency, ASCEND develops a two-stage training pipeline to produce accurate low-precision ViTs. With extensive experiments, we show the proposed GELU and softmax blocks achieve 56.3% and 22.6% error reduction compared to existing SC designs, respectively and reduce the area-delay product (ADP) by 5.29x and 12.6x, respectively. Moreover, compared to the baseline low-precision ViTs, ASCEND also achieves significant accuracy improvements on CIFAR10 and CIFAR100. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted in DATE 2024

arXiv:2402.11186 [pdf, other]

Low-Dose CT Reconstruction Using Dataset-free Learning

Authors: Feng Wang, Renfang Wang, Hong Qiu

Abstract: Low-Dose computer tomography (LDCT) is an ideal alternative to reduce radiation risk in clinical applications. Although supervised-deep-learning-based reconstruction methods have demonstrated superior performance compared to conventional model-driven reconstruction algorithms, they require collecting massive pairs of low-dose and norm-dose CT images for neural network training, which limits their… ▽ More Low-Dose computer tomography (LDCT) is an ideal alternative to reduce radiation risk in clinical applications. Although supervised-deep-learning-based reconstruction methods have demonstrated superior performance compared to conventional model-driven reconstruction algorithms, they require collecting massive pairs of low-dose and norm-dose CT images for neural network training, which limits their practical application in LDCT imaging. In this paper, we propose an unsupervised and training data-free learning reconstruction method for LDCT imaging that avoids the requirement for training data. The proposed method is a post-processing technique that aims to enhance the initial low-quality reconstruction results, and it reconstructs the high-quality images by neural work training that minimizes the $\ell_1$-norm distance between the CT measurements and their corresponding simulated sinogram data, as well as the total variation (TV) value of the reconstructed image. Moreover, the proposed method does not require to set the weights for both the data fidelity term and the plenty term. Experimental results on the AAPM challenge data and LoDoPab-CT data demonstrate that the proposed method is able to effectively suppress the noise and preserve the tiny structures. Also, these results demonstrate the rapid convergence and low computational cost of the proposed method. The source code is available at \url{https://github.com/linfengyu77/IRLDCT}. △ Less

Submitted 22 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.04097 [pdf, other]

Analysis of Deep Image Prior and Exploiting Self-Guidance for Image Reconstruction

Authors: Shijun Liang, Evan Bell, Qing Qu, Rongrong Wang, Saiprasad Ravishankar

Abstract: The ability of deep image prior (DIP) to recover high-quality images from incomplete or corrupted measurements has made it popular in inverse problems in image restoration and medical imaging including magnetic resonance imaging (MRI). However, conventional DIP suffers from severe overfitting and spectral bias effects. In this work, we first provide an analysis of how DIP recovers information from… ▽ More The ability of deep image prior (DIP) to recover high-quality images from incomplete or corrupted measurements has made it popular in inverse problems in image restoration and medical imaging including magnetic resonance imaging (MRI). However, conventional DIP suffers from severe overfitting and spectral bias effects. In this work, we first provide an analysis of how DIP recovers information from undersampled imaging measurements by analyzing the training dynamics of the underlying networks in the kernel regime for different architectures. This study sheds light on important underlying properties for DIP-based recovery. Current research suggests that incorporating a reference image as network input can enhance DIP's performance in image reconstruction compared to using random inputs. However, obtaining suitable reference images requires supervision, and raises practical difficulties. In an attempt to overcome this obstacle, we further introduce a self-driven reconstruction process that concurrently optimizes both the network weights and the input while eliminating the need for training data. Our method incorporates a novel denoiser regularization term which enables robust and stable joint estimation of both the network input and reconstructed image. We demonstrate that our self-guided method surpasses both the original DIP and modern supervised methods in terms of MR image reconstruction performance and outperforms previous DIP-based schemes for image inpainting. △ Less

Submitted 7 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.15984 [pdf]

Choroidal thinning assessment through facial video analysis

Authors: Qinghua He, Yi Zhang, Mengxi Shen, Giovanni Gregori, Philip J. Rosenfeld, Ruikang K. Wang

Abstract: Different features of skin are associated with various medical conditions and provide opportunities to evaluate and monitor body health. This study created a strategy to assess choroidal thinning through the video analysis of facial skin. Videos capturing the entire facial skin were collected from 48 participants with age-related macular degeneration (AMD) and 12 healthy individuals. These facial… ▽ More Different features of skin are associated with various medical conditions and provide opportunities to evaluate and monitor body health. This study created a strategy to assess choroidal thinning through the video analysis of facial skin. Videos capturing the entire facial skin were collected from 48 participants with age-related macular degeneration (AMD) and 12 healthy individuals. These facial videos were analyzed using video-based trans-angiosomes imaging photoplethysmography (TaiPPG) to generate facial imaging biomarkers that were correlated with choroidal thickness (CT) measurements. The CT of all patients was determined using swept-source optical coherence tomography (SS-OCT). The results revealed the relationship between relative blood pulsation amplitude (BPA) in three typical facial angiosomes (cheek, side-forehead and mid-forehead) and the average macular CT (r = 0.48, p < 0.001; r = -0.56, p < 0.001; r = -0.40, p < 0.01). When considering a diagnostic threshold of 200μm, the newly developed facial video analysis tool effectively distinguished between cases of choroidal thinning and normal cases, yielding areas under the curve of 0.75, 0.79 and 0.69. These findings shed light on the connection between choroidal blood flow and facial skin hemodynamics, which suggests the potential for predicting vascular diseases through widely accessible skin imaging data. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: 8 pages, 4 figures

arXiv:2401.11270 [pdf, other]

RoTIR: Rotation-Equivariant Network and Transformers for Fish Scale Image Registration

Authors: Ruixiong Wang, Alin Achim, Renata Raele-Rolfe, Qiao Tong, Dylan Bergen, Chrissy Hammond, Stephen Cross

Abstract: Image registration is an essential process for aligning features of interest from multiple images. With the recent development of deep learning techniques, image registration approaches have advanced to a new level. In this work, we present 'Rotation-Equivariant network and Transformers for Image Registration' (RoTIR), a deep-learning-based method for the alignment of fish scale images captured by… ▽ More Image registration is an essential process for aligning features of interest from multiple images. With the recent development of deep learning techniques, image registration approaches have advanced to a new level. In this work, we present 'Rotation-Equivariant network and Transformers for Image Registration' (RoTIR), a deep-learning-based method for the alignment of fish scale images captured by light microscopy. This approach overcomes the challenge of arbitrary rotation and translation detection, as well as the absence of ground truth data. We employ feature-matching approaches based on Transformers and general E(2)-equivariant steerable CNNs for model creation. Besides, an artificial training dataset is employed for semi-supervised learning. Results show RoTIR successfully achieves the goal of fish scale image registration. △ Less

Submitted 27 July, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: 7 pages, 4 figures, 2 tables

arXiv:2401.09833 [pdf, other]

Slicer Networks

Authors: Hang Zhang, Xiang Chen, Rongguang Wang, Renjiu Hu, Dongdong Liu, Gaolei Li

Abstract: In medical imaging, scans often reveal objects with varied contrasts but consistent internal intensities or textures. This characteristic enables the use of low-frequency approximations for tasks such as segmentation and deformation field estimation. Yet, integrating this concept into neural network architectures for medical image analysis remains underexplored. In this paper, we propose the Slice… ▽ More In medical imaging, scans often reveal objects with varied contrasts but consistent internal intensities or textures. This characteristic enables the use of low-frequency approximations for tasks such as segmentation and deformation field estimation. Yet, integrating this concept into neural network architectures for medical image analysis remains underexplored. In this paper, we propose the Slicer Network, a novel architecture designed to leverage these traits. Comprising an encoder utilizing models like vision transformers for feature extraction and a slicer employing a learnable bilateral grid, the Slicer Network strategically refines and upsamples feature maps via a splatting-blurring-slicing process. This introduces an edge-preserving low-frequency approximation for the network outcome, effectively enlarging the effective receptive field. The enhancement not only reduces computational complexity but also boosts overall performance. Experiments across different medical imaging applications, including unsupervised and keypoints-based image registration and lesion segmentation, have verified the Slicer Network's improved accuracy and efficiency. △ Less

Submitted 18 January, 2024; originally announced January 2024.

Comments: 8 figures and 3 tables

arXiv:2401.09517 [pdf]

Dimensional Neuroimaging Endophenotypes: Neurobiological Representations of Disease Heterogeneity Through Machine Learning

Authors: Junhao Wen, Mathilde Antoniades, Zhijian Yang, Gyujoon Hwang, Ioanna Skampardoni, Rongguang Wang, Christos Davatzikos

Abstract: Machine learning has been increasingly used to obtain individualized neuroimaging signatures for disease diagnosis, prognosis, and response to treatment in neuropsychiatric and neurodegenerative disorders. Therefore, it has contributed to a better understanding of disease heterogeneity by identifying disease subtypes that present significant differences in various brain phenotypic measures. In thi… ▽ More Machine learning has been increasingly used to obtain individualized neuroimaging signatures for disease diagnosis, prognosis, and response to treatment in neuropsychiatric and neurodegenerative disorders. Therefore, it has contributed to a better understanding of disease heterogeneity by identifying disease subtypes that present significant differences in various brain phenotypic measures. In this review, we first present a systematic literature overview of studies using machine learning and multimodal MRI to unravel disease heterogeneity in various neuropsychiatric and neurodegenerative disorders, including Alzheimer disease, schizophrenia, major depressive disorder, autism spectrum disorder, multiple sclerosis, as well as their potential in transdiagnostic settings. Subsequently, we summarize relevant machine learning methodologies and discuss an emerging paradigm which we call dimensional neuroimaging endophenotype (DNE). DNE dissects the neurobiological heterogeneity of neuropsychiatric and neurodegenerative disorders into a low dimensional yet informative, quantitative brain phenotypic representation, serving as a robust intermediate phenotype (i.e., endophenotype) largely reflecting underlying genetics and etiology. Finally, we discuss the potential clinical implications of the current findings and envision future research avenues. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.08154 [pdf, ps, other]

TLIC: Learned Image Compression with ROI-Weighted Distortion and Bit Allocation

Authors: Wei Jiang, Yongqi Zhai, Hangyu Li, Ronggang Wang

Abstract: This short paper describes our method for the track of image compression. To achieve better perceptual quality, we use the adversarial loss to generate realistic textures, use region of interest (ROI) mask to guide the bit allocation for different regions. Our Team name is TLIC. This short paper describes our method for the track of image compression. To achieve better perceptual quality, we use the adversarial loss to generate realistic textures, use region of interest (ROI) mask to guide the bit allocation for different regions. Our Team name is TLIC. △ Less

Submitted 23 March, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: 2nd Place in the Image Compression Track, CLIC 2024, DCC 2024

arXiv:2401.07754 [pdf, ps, other]

Passive Beamforming For Practical RIS-Assisted Communication Systems With Non-Ideal Hardware

Authors: Yiming Liu, Rui Wang, Zhu Han

Abstract: Reconfigurable intelligent surface (RIS) technology is a promising solution to improve the performance of existing wireless communications. To achieve its cost-effectiveness advantage, there inevitably exist certain hardware impairments in the system. Therefore, it is more reasonable to design passive beamforming in this scenario. Some existing research has considered such problems under transceiv… ▽ More Reconfigurable intelligent surface (RIS) technology is a promising solution to improve the performance of existing wireless communications. To achieve its cost-effectiveness advantage, there inevitably exist certain hardware impairments in the system. Therefore, it is more reasonable to design passive beamforming in this scenario. Some existing research has considered such problems under transceiver impairments. However, their performance still leaves room for improvement, possibly due to their algorithms not properly handling the fractional structure of the objective function. To address this, the passive beamforming is redesigned in this correspondence paper, taking into account both transceiver impairments and the practical phase-shift model. We tackle the fractional structure of the problem by employing the quadratic transform. The remaining sub-problems are addressed utilizing the penalty-based method and the difference-of-convex programming. Since we provide closed-form solutions for all sub-problems, our algorithm is highly efficient. The simulation results demonstrate the superiority of our proposed algorithm. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2401.07446 [pdf, other]

Quantized RIS-aided mmWave Massive MIMO Channel Estimation with Uniform Planar Arrays

Authors: Ruizhe Wang, Hong Ren, Cunhua Pan, Shi Jin, Petar Popovski, Jiangzhou Wang

Abstract: In this paper, we investigate a cascaded channel estimation method for a millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) system aided by a reconfigurable intelligent surface (RIS) with the BS equipped with low-resolution analog-to-digital converters (ADCs), where the BS and the RIS are both equipped with a uniform planar array (UPA). Due to the sparse property of mmWave chan… ▽ More In this paper, we investigate a cascaded channel estimation method for a millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) system aided by a reconfigurable intelligent surface (RIS) with the BS equipped with low-resolution analog-to-digital converters (ADCs), where the BS and the RIS are both equipped with a uniform planar array (UPA). Due to the sparse property of mmWave channel, the channel estimation can be solved as a compressed sensing (CS) problem. However, the low-resolution quantization cause severe information loss of signals, and traditional CS algorithms are unable to work well. To recovery the signal and the sparse angular domain channel from quantization, we introduce Bayesian inference and efficient vector approximate message passing (VAMP) algorithm to solve the quantize output CS problem. To further improve the efficiency of the VAMP algorithm, a Fast Fourier Transform (FFT) based fast computation method is derived. Simulation results demonstrate the effectiveness and the accuracy of the proposed cascaded channel estimation method for the RIS-aided mmWave massive MIMO system with few-bit ADCs. Furthermore, the proposed channel estimation method can reach an acceptable performance gap between the low-resolution ADCs and the infinite ADCs for the low signal-to-noise ratio (SNR), which implies the applicability of few-bit ADCs in practice. △ Less

Submitted 14 January, 2024; originally announced January 2024.

arXiv:2401.00246 [pdf, other]

Boosting Large Language Model for Speech Synthesis: An Empirical Study

Authors: Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei

Abstract: Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities re… ▽ More Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%) WER reduction. △ Less

Submitted 30 December, 2023; originally announced January 2024.

arXiv:2312.12534 [pdf, other]

Near-Field Localization and Phase Shift Optimization for RIS-Assisted Non-Ideal OFDM Systems

Authors: Hanfu Zhang, Erwu Liu, Rui Wang, Zhe Xing, Yan Liu

Abstract: By incorporating reconfigurable intelligent surface (RIS) into communication-assisted localization systems, the issue of signal blockage caused by obstacles can be addressed, and passive beamforming can be employed to enhance localization accuracy. However, existing works mainly consider ideal channels and do not account for the effects of realistic impairments like carrier frequency offset (CFO)… ▽ More By incorporating reconfigurable intelligent surface (RIS) into communication-assisted localization systems, the issue of signal blockage caused by obstacles can be addressed, and passive beamforming can be employed to enhance localization accuracy. However, existing works mainly consider ideal channels and do not account for the effects of realistic impairments like carrier frequency offset (CFO) and phase noise (PN) on localization. This paper proposes an iterative joint estimation algorithm for CFO, PN, and user position based on maximum a posteriori (MAP) criterion and gradient descent (GD) algorithm. Closed-form expressions for CFO and PN updates are provided. The hybrid Cramér-Rao lower bound (HCRLB) for the estimation parameters is derived, and the ambiguity in CFO and PN estimation is analyzed. To minimize the HCRLB, a non-convex RIS shift optimization problem is formulated and is transformed into a convex semidefinite programming (SDP) problem using the technique of semidefinite relaxation (SDR) and Schur complement. After optimizing the RIS phase shift, the theoretical positioning accuracy within the area of interest (AOI) can be improved by two orders of magnitude, with a maximum positioning root mean square error (RMSE) lower than $\rm 10^{-2}m$. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 11 pages, 11 figures

arXiv:2312.09659 [pdf, ps, other]

A Near Field Low Time Complexity Beam Training Scheme Based on Spatial Orthogonal Decomposition

Authors: Xiyuan Liu, Qingqing Wu, Rui Wang, Jun Wu

Abstract: With the application of high-frequency communication and extremely large MIMO (XL-MIMO), the near-field effect has become increasingly apparent. The near-field beam design now requires consideration not only of the angle of arrival (AoA) information but also the curvature of arrival (CoA) information. However, due to their mutual coupling, orthogonally decomposing the near-field space becomes chal… ▽ More With the application of high-frequency communication and extremely large MIMO (XL-MIMO), the near-field effect has become increasingly apparent. The near-field beam design now requires consideration not only of the angle of arrival (AoA) information but also the curvature of arrival (CoA) information. However, due to their mutual coupling, orthogonally decomposing the near-field space becomes challenging. In this paper, we propose a Joint Autocorrelation and Cross-correlation (JAC) scheme to address the coupling information between near-field CoA and AoA. First, we analyze the similarity between the near-field problem and the Doppler problem in digital signal processing, revealing that the autocorrelation function can effectively extract CoA information. Subsequently, utilizing the obtained CoA, we transform the near-field problem into a far-field form, enabling the direct application of beam training schemes designed for the far-field in the near-field scenario. Finally, we analyze the characteristics of the far and near-field signal subspaces from the perspective of matrix theory and discuss how the JAC algorithm handles them. Numerical results demonstrate that the JAC scheme outperforms traditional methods in the high signal-to-noise ratio (SNR) regime. Moreover, the time complexity of the JAC algorithm is $\mathcal O(N+1)$, significantly smaller than existing near-field beam training algorithms. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: 11 pages with double column, 7 figures

arXiv:2312.08571 [pdf, other]

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Authors: Chengxi Lei, Satwinder Singh, Feng Hou, Xiaoyun Jia, Ruili Wang

Abstract: Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operatio… ▽ More Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9\% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9\% and 15.9\%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods. △ Less

Submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.01573 [pdf]

Survey on deep learning in multimodal medical imaging for cancer detection

Authors: Yan Tian, Zhaocheng Xu, Yujun Ma, Weiping Ding, Ruili Wang, Zhihong Gao, Guohua Cheng, Linyang He, Xuran Zhao

Abstract: The task of multimodal cancer detection is to determine the locations and categories of lesions by using different imaging techniques, which is one of the key research methods for cancer diagnosis. Recently, deep learning-based object detection has made significant developments due to its strength in semantic feature extraction and nonlinear function fitting. However, multimodal cancer detection r… ▽ More The task of multimodal cancer detection is to determine the locations and categories of lesions by using different imaging techniques, which is one of the key research methods for cancer diagnosis. Recently, deep learning-based object detection has made significant developments due to its strength in semantic feature extraction and nonlinear function fitting. However, multimodal cancer detection remains challenging due to morphological differences in lesions, interpatient variability, difficulty in annotation, and imaging artifacts. In this survey, we mainly investigate over 150 papers in recent years with respect to multimodal cancer detection using deep learning, with a focus on datasets and solutions to various challenges such as data annotation, variance between classes, small-scale lesions, and occlusion. We also provide an overview of the advantages and drawbacks of each approach. Finally, we discuss the current scope of work and provide directions for the future development of multimodal cancer detection. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Journal ref: Neural Computing and Applications. 2023 Nov 29:1-6

arXiv:2311.17065 [pdf, other]

Efficient Deep Speech Understanding at the Edge

Authors: Rongxiang Wang, Felix Xiaozhu Lin

Abstract: In contemporary speech understanding (SU), a sophisticated pipeline is employed, encompassing the ingestion of streaming voice input. The pipeline executes beam search iteratively, invoking a deep neural network to generate tentative outputs (referred to as hypotheses) in an autoregressive manner. Periodically, the pipeline assesses attention and Connectionist Temporal Classification (CTC) scores.… ▽ More In contemporary speech understanding (SU), a sophisticated pipeline is employed, encompassing the ingestion of streaming voice input. The pipeline executes beam search iteratively, invoking a deep neural network to generate tentative outputs (referred to as hypotheses) in an autoregressive manner. Periodically, the pipeline assesses attention and Connectionist Temporal Classification (CTC) scores. This paper aims to enhance SU performance on edge devices with limited resources. Adopting a hybrid strategy, our approach focuses on accelerating on-device execution and offloading inputs surpassing the device's capacity. While this approach is established, we tackle SU's distinctive challenges through innovative techniques: (1) Late Contextualization: This involves the parallel execution of a model's attentive encoder during input ingestion. (2) Pilot Inference: Addressing temporal load imbalances in the SU pipeline, this technique aims to mitigate them effectively. (3) Autoregression Offramps: Decisions regarding offloading are made solely based on hypotheses, presenting a novel approach. These techniques are designed to seamlessly integrate with existing speech models, pipelines, and frameworks, offering flexibility for independent or combined application. Collectively, they form a hybrid solution for edge SU. Our prototype, named XYZ, has undergone testing on Arm platforms featuring 6 to 8 cores, demonstrating state-of-the-art accuracy. Notably, it achieves a 2x reduction in end-to-end latency and a corresponding 2x decrease in offloading requirements. △ Less

Submitted 4 December, 2023; v1 submitted 22 November, 2023; originally announced November 2023.

Showing 1–50 of 268 results for author: Wang, R