Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 101 results for author: Pan, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.20962  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

    Authors: Xiaowei Chi, Yatian Wang, Aosong Cheng, Pengjun Fang, Zeyue Tian, Yingqing He, Zhaoyang Liu, Xingqun Qi, Jiahao Pan, Rongyu Zhang, Mengfei Li, Ruibin Yuan, Yanbing Jiang, Wei Xue, Wenhan Luo, Qifeng Chen, Shanghang Zhang, Qifeng Liu, Yike Guo

    Abstract: Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modalit… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: 15 Pages. Dataset report

  2. arXiv:2407.20108  [pdf, other

    eess.IV cs.AI cs.CV

    Classification, Regression and Segmentation directly from k-Space in Cardiac MRI

    Authors: Ruochen Li, Jiazhen Pan, Youxiang Zhu, Juncheng Ni, Daniel Rueckert

    Abstract: Cardiac Magnetic Resonance Imaging (CMR) is the gold standard for diagnosing cardiovascular diseases. Clinical diagnoses predominantly rely on magnitude-only Digital Imaging and Communications in Medicine (DICOM) images, omitting crucial phase information that might provide additional diagnostic benefits. In contrast, k-space is complex-valued and encompasses both magnitude and phase information,… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  3. arXiv:2407.19274  [pdf, other

    cs.CV eess.IV

    Mamba? Catch The Hype Or Rethink What Really Helps for Image Registration

    Authors: Bailiang Jian, Jiazhen Pan, Morteza Ghahremani, Daniel Rueckert, Christian Wachinger, Benedikt Wiestler

    Abstract: Our findings indicate that adopting "advanced" computational elements fails to significantly improve registration accuracy. Instead, well-established registration-specific designs offer fair improvements, enhancing results by a marginal 1.5\% over the baseline. Our findings emphasize the importance of rigorous, unbiased evaluation and contribution disentanglement of all low- and high-level registr… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: WBIR 2024 Workshop on Biomedical Imaging Registration

  4. arXiv:2406.16872  [pdf, other

    eess.SP cs.AI

    Multi-channel Time Series Decomposition Network For Generalizable Sensor-Based Activity Recognition

    Authors: Jianguo Pan, Zhengxin Hu, Lingdun Zhang, Xia Cai

    Abstract: Sensor-based human activity recognition is important in daily scenarios such as smart healthcare and homes due to its non-intrusive privacy and low cost advantages, but the problem of out-of-domain generalization caused by differences in focusing individuals and operating environments can lead to significant accuracy degradation on cross-person behavior recognition due to the inconsistent distribu… ▽ More

    Submitted 28 March, 2024; originally announced June 2024.

  5. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  6. arXiv:2406.00329  [pdf, other

    eess.IV cs.CV cs.LG

    Whole Heart 3D+T Representation Learning Through Sparse 2D Cardiac MR Images

    Authors: Yundi Zhang, Chen Chen, Suprosanna Shit, Sophie Starck, Daniel Rueckert, Jiazhen Pan

    Abstract: Cardiac Magnetic Resonance (CMR) imaging serves as the gold-standard for evaluating cardiac morphology and function. Typically, a multi-view CMR stack, covering short-axis (SA) and 2/3/4-chamber long-axis (LA) views, is acquired for a thorough cardiac assessment. However, efficiently streamlining the complex, high-dimensional 3D+T CMR data and distilling compact, coherent representation remains a… ▽ More

    Submitted 6 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

  7. arXiv:2406.00192  [pdf, other

    eess.IV cs.CV cs.LG

    Direct Cardiac Segmentation from Undersampled K-space Using Transformers

    Authors: Yundi Zhang, Nil Stolt-Ansó, Jiazhen Pan, Wenqi Huang, Kerstin Hammernik, Daniel Rueckert

    Abstract: The prevailing deep learning-based methods of predicting cardiac segmentation involve reconstructed magnetic resonance (MR) images. The heavy dependency of segmentation approaches on image quality significantly limits the acceleration rate in fast MR reconstruction. Moreover, the practice of treating reconstruction and segmentation as separate sequential processes leads to artifact generation and… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

  8. arXiv:2405.16952  [pdf, other

    eess.AS

    A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition

    Authors: Zilu Guo, Qing Wang, Jun Du, Jia Pan, Qing-Feng Liu, Chin-Hui

    Abstract: In this paper, we propose a variance-preserving interpolation framework to improve diffusion models for single-channel speech enhancement (SE) and automatic speech recognition (ASR). This new variance-preserving interpolation diffusion model (VPIDM) approach requires only 25 iterative steps and obviates the need for a corrector, an essential element in the existing variance-exploding interpolation… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  9. arXiv:2404.18081  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    ComposerX: Multi-Agent Symbolic Music Composition with LLMs

    Authors: Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin, Yizhi Li, Yinghao Ma, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenwu Wang, Guangyu Xia, Wei Xue, Yike Guo

    Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C… ▽ More

    Submitted 30 April, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

  10. arXiv:2404.17621  [pdf, other

    eess.IV cs.CV cs.LG

    Attention-aware non-rigid image registration for accelerated MR imaging

    Authors: Aya Ghoul, Jiazhen Pan, Andreas Lingg, Jens Kübler, Patrick Krumm, Kerstin Hammernik, Daniel Rueckert, Sergios Gatidis, Thomas Küstner

    Abstract: Accurate motion estimation at high acceleration factors enables rapid motion-compensated reconstruction in Magnetic Resonance Imaging (MRI) without compromising the diagnostic image quality. In this work, we introduce an attention-aware deep learning-based framework that can perform non-rigid pairwise registration for fully sampled and accelerated MRI. We extract local visual representations to bu… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 14 pages, 7 figures

  11. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  12. arXiv:2404.14700  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    FlashSpeech: Efficient Zero-Shot Speech Synthesis

    Authors: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

    Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large… ▽ More

    Submitted 24 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Efficient zero-shot speech synthesis

  13. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  14. arXiv:2404.08857  [pdf, other

    cs.SD cs.AI eess.AS

    Voice Attribute Editing with Text Prompt

    Authors: Zhengyan Sheng, Yang Ai, Li-Juan Liu, Jia Pan, Zhen-Hua Ling

    Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this t… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  15. arXiv:2404.01611  [pdf

    cs.LG cs.SD eess.AS

    Audio Simulation for Sound Source Localization in Virtual Evironment

    Authors: Yi Di Yuan, Swee Liang Wong, Jonathan Pan

    Abstract: Non-line-of-sight localization in signal-deprived environments is a challenging yet pertinent problem. Acoustic methods in such predominantly indoor scenarios encounter difficulty due to the reverberant nature. In this study, we aim to locate sound sources to specific locations within a virtual environment by leveraging physically grounded sound propagation simulations and machine learning methods… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: 2024 IEEE World Forum on Public Safety Technology

  16. arXiv:2404.00656  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    WavLLM: Towards Robust and Adaptive Speech Large Language Model

    Authors: Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

    Abstract: The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In th… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  17. arXiv:2401.12173  [pdf, other

    eess.SP

    Waveform-Domain Complementary Signal Sets for Interrupted Sampling Repeater Jamming Suppression

    Authors: Hanning Su, Qinglong Bao, Jiameng Pan, Fucheng Guo, Weidong Hu

    Abstract: The interrupted-sampling repeater jamming (ISRJ) is coherent and has the characteristic of suppression and deception to degrade the radar detection capabilities. The study focuses on anti-ISRJ techniques in the waveform domain, primarily capitalizing on waveform design and and anti-jamming signal processing methods in the waveform domain. By exploring the relationship between waveform-domain adapt… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  18. arXiv:2311.15309  [pdf, other

    eess.IV

    Deep Refinement-Based Joint Source Channel Coding over Time-Varying Channels

    Authors: Junyu Pan, Hanlei Li, Guangyi Zhang, Yunlong Cai, Guanding Yu

    Abstract: In recent developments, deep learning (DL)-based joint source-channel coding (JSCC) for wireless image transmission has made significant strides in performance enhancement. Nonetheless, the majority of existing DL-based JSCC methods are tailored for scenarios featuring stable channel conditions, notably a fixed signal-to-noise ratio (SNR). This specialization poses a limitation, as their performan… ▽ More

    Submitted 26 November, 2023; originally announced November 2023.

  19. arXiv:2311.02248  [pdf, other

    cs.CL cs.AI eess.AS

    COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

    Authors: Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

    Abstract: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters… ▽ More

    Submitted 14 June, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

  20. arXiv:2309.16206  [pdf, other

    eess.IV cs.CV

    Alzheimer's Disease Prediction via Brain Structural-Functional Deep Fusing Network

    Authors: Qiankun Zuo, Junren Pan, Shuqiang Wang

    Abstract: Fusing structural-functional images of the brain has shown great potential to analyze the deterioration of Alzheimer's disease (AD). However, it is a big challenge to effectively fuse the correlated and complementary information from multimodal neuroimages. In this paper, a novel model termed cross-modal transformer generative adversarial network (CT-GAN) is proposed to effectively fuse the functi… ▽ More

    Submitted 5 October, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: 10 pages

  21. arXiv:2309.14087  [pdf, other

    eess.SP

    Adaptive Three Layer Hybrid Reconfigurable Intelligent Surface for 6G Wireless Communication: Trade-offs and Performance

    Authors: Rashed Hasan Ratul, Muhammad Iqbal, Tabinda Ashraf, Jen-Yi Pan, Yi-Han Wang, Shao-Yu Lien

    Abstract: A potential candidate technology for the development of future 6G networks has been recognized as Reconfigurable Intelligent Surface (RIS). However, due to the variation in radio link quality, traditional passive RISs only accomplish a minimal signal gain in situations with strong direct links between user equipment (UE) and base station (BS). In order to get over this fundamental restriction of s… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted for presentation and publication at the 8th IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob) Conference

  22. arXiv:2309.10832  [pdf, ps, other

    cs.SD eess.AS

    Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

    Authors: Jiahui Pan, Pengjie Shen, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel speech enhancement and often takes short-time Fourier Transform (STFT) as inputs directly. To fully leverage the spatial information, we introduce a method usin… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: arXiv admin note: text overlap with arXiv:2309.10393

  23. arXiv:2309.10393  [pdf, ps, other

    cs.SD eess.AS

    Hierarchical Modeling of Spatial Cues via Spherical Harmonics for Multi-Channel Speech Enhancement

    Authors: Jiahui Pan, Shulin He, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement utilizes spatial information from multiple microphones to extract the target speech. However, most existing methods do not explicitly model spatial cues, instead relying on implicit learning from multi-channel spectra. To better leverage spatial information, we propose explicitly incorporating spatial modeling by applying spherical harmonic transforms (SHT) to the… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  24. arXiv:2309.10379  [pdf, ps, other

    cs.SD eess.AS

    PDPCRN: Parallel Dual-Path CRN with Bi-directional Inter-Branch Interactions for Multi-Channel Speech Enhancement

    Authors: Jiahui Pan, Shulin He, Tianci Wu, Hui Zhang, Xueliang Zhang

    Abstract: Multi-channel speech enhancement seeks to utilize spatial information to distinguish target speech from interfering signals. While deep learning approaches like the dual-path convolutional recurrent network (DPCRN) have made strides, challenges persist in effectively modeling inter-channel correlations and amalgamating multi-level information. In response, we introduce the Parallel Dual-Path Convo… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  25. arXiv:2309.08643  [pdf, other

    eess.IV

    NISF: Neural Implicit Segmentation Functions

    Authors: Nil Stolt-Ansó, Julian McGinnis, Jiazhen Pan, Kerstin Hammernik, Daniel Rueckert

    Abstract: Segmentation of anatomical shapes from medical images has taken an important role in the automation of clinical measurements. While typical deep-learning segmentation approaches are performed on discrete voxels, the underlying objects being analysed exist in a real-valued continuous space. Approaches that rely on convolutional neural networks (CNNs) are limited to grid-like inputs and not easily a… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  26. arXiv:2309.08348  [pdf, other

    eess.AS cs.SD

    The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

    Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

    Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  27. arXiv:2309.01994  [pdf, other

    eess.SY

    Cloud Control of Connected Vehicle under Bi-directional Time-varying delay: An Application of Predictor-observer Structured Controller

    Authors: Ji-An Pan, Qing Xu, Keqiang Li, Chunying Yang, Jianqiang Wang

    Abstract: This article is devoted to addressing the cloud control of connected vehicles, specifically focusing on analyzing the effect of bi-directional communication-induced delays. To mitigate the adverse effects of such delays, a novel predictor-observer structured controller is proposed which compensate for both measurable output delays and unmeasurable, yet bounded, input delays simultaneously. The stu… ▽ More

    Submitted 9 December, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

  28. arXiv:2308.14638  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

    Authors: Ruoyu Wang, Maokui He, Jun Du, Hengshun Zhou, Shutong Niu, Hang Chen, Yanyan Yue, Gaobin Yang, Shilong Wu, Lei Sun, Yanhui Tu, Haitao Tang, Shuangqing Qian, Tian Gao, Mengzhi Wang, Genshun Wan, Jia Pan, Jianqing Gao, Chin-Hui Lee

    Abstract: This technical report details our submission system to the CHiME-7 DASR Challenge, which focuses on speaker diarization and speech recognition under complex multi-speaker scenarios. Additionally, it also evaluates the efficiency of systems in handling diverse array devices. To address these issues, we implemented an end-to-end speaker diarization system and introduced a rectification strategy base… ▽ More

    Submitted 10 October, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted by 2023 CHiME Workshop, Oral

  29. arXiv:2308.10488  [pdf, other

    eess.IV cs.CV

    Enhancing Medical Image Segmentation: Optimizing Cross-Entropy Weights and Post-Processing with Autoencoders

    Authors: Pranav Singh, Luoyao Chen, Mei Chen, Jinqian Pan, Raviteja Chukkapalli, Shravan Chaudhari, Jacopo Cirrone

    Abstract: The task of medical image segmentation presents unique challenges, necessitating both localized and holistic semantic understanding to accurately delineate areas of interest, such as critical tissues or aberrant features. This complexity is heightened in medical image segmentation due to the high degree of inter-class similarities, intra-class variations, and possible image obfuscation. The segmen… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV CVAMD 2023

  30. arXiv:2307.12672  [pdf, other

    eess.IV cs.CV cs.LG

    Global k-Space Interpolation for Dynamic MRI Reconstruction using Masked Image Modeling

    Authors: Jiazhen Pan, Suprosanna Shit, Özgün Turgut, Wenqi Huang, Hongwei Bran Li, Nil Stolt-Ansó, Thomas Küstner, Kerstin Hammernik, Daniel Rueckert

    Abstract: In dynamic Magnetic Resonance Imaging (MRI), k-space is typically undersampled due to limited scan time, resulting in aliasing artifacts in the image domain. Hence, dynamic MR reconstruction requires not only modeling spatial frequency components in the x and y directions of k-space but also considering temporal redundancy. Most previous works rely on image-domain regularizers (priors) to conduct… ▽ More

    Submitted 18 October, 2023; v1 submitted 24 July, 2023; originally announced July 2023.

  31. arXiv:2307.12397  [pdf

    cs.NI eess.AS

    Performance Comparison Between VoLTE and non-VoLTE Voice Calls During Mobility in Commercial Deployment: A Drive Test-Based Analysis

    Authors: Rashed Hasan Ratul, Muhammad Iqbal, Jen-Yi Pan, Mohammad Mahadi Al Deen, Mohammad Tawhid Kawser, Mohammad Masum Billah

    Abstract: The optimization of network performance is vital for the delivery of services using standard cellular technologies for mobile communications. Call setup delay and User Equipment (UE) battery savings significantly influence network performance. Improving these factors is vital for ensuring optimal service delivery. In comparison to traditional circuit-switched voice calls, VoLTE (Voice over LTE) te… ▽ More

    Submitted 23 July, 2023; originally announced July 2023.

    Comments: Accepted for presentation and Publication on the IEEE 10th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI 2023)

  32. arXiv:2307.03368  [pdf, other

    eess.SP

    Waveform-Domain Adaptive Matched Filtering for Suppressing Interrupted-Sampling Repeater Jamming

    Authors: Hanning Su, Qinglong Bao, Jiameng Pan, Fucheng Guo, Weidong Hu

    Abstract: The inadequate adaptability to flexible interference scenarios remains an unresolved challenge in the majority of techniques utilized for mitigating interrupted-sampling repeater jamming (ISRJ). Matched filtering system based methods is desirable to incorporate anti-ISRJ measures based on prior ISRJ modeling, either preceding or succeeding the matched filtering. Due to the partial matching nature… ▽ More

    Submitted 13 November, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

  33. arXiv:2306.17103  [pdf, other

    cs.CL cs.SD eess.AS

    LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

    Authors: Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wei Xue, Yike Guo

    Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language mo… ▽ More

    Submitted 25 July, 2024; v1 submitted 29 June, 2023; originally announced June 2023.

    Comments: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023

  34. arXiv:2303.15065  [pdf, other

    eess.IV cs.CV

    Single-subject Multi-contrast MRI Super-resolution via Implicit Neural Representations

    Authors: Julian McGinnis, Suprosanna Shit, Hongwei Bran Li, Vasiliki Sideri-Lampretsa, Robert Graf, Maik Dannecker, Jiazhen Pan, Nil Stolt Ansó, Mark Mühlau, Jan S. Kirschke, Daniel Rueckert, Benedikt Wiestler

    Abstract: Clinical routine and retrospective cohorts commonly include multi-parametric Magnetic Resonance Imaging; however, they are mostly acquired in different anisotropic 2D views due to signal-to-noise-ratio and scan-time constraints. Thus acquired views suffer from poor out-of-plane resolution and affect downstream volumetric image analysis that typically requires isotropic 3D scans. Combining differen… ▽ More

    Submitted 4 January, 2024; v1 submitted 27 March, 2023; originally announced March 2023.

  35. arXiv:2302.02504  [pdf, other

    eess.IV cs.CV cs.LG

    Reconstruction-driven motion estimation for motion-compensated MR CINE imaging

    Authors: Jiazhen Pan, Wenqi Huang, Daniel Rueckert, Thomas Küstner, Kerstin Hammernik

    Abstract: In cardiac CINE, motion-compensated MR reconstruction (MCMR) is an effective approach to address highly undersampled acquisitions by incorporating motion information between frames. In this work, we propose a deep learning-based framework to address the MCMR problem efficiently. Contrary to state-of-the-art (SOTA) MCMR methods which break the original problem into two sub-optimization problems, i.… ▽ More

    Submitted 5 February, 2023; originally announced February 2023.

  36. Learning-based Predictive Path Following Control for Nonlinear Systems Under Uncertain Disturbances

    Authors: Rui Yang, Lei Zheng, Jiesen Pan, Hui Cheng

    Abstract: Accurate path following is challenging for autonomous robots operating in uncertain environments. Adaptive and predictive control strategies are crucial for a nonlinear robotic system to achieve high-performance path following control. In this paper, we propose a novel learning-based predictive control scheme that couples a high-level model predictive path following controller (MPFC) with a low-le… ▽ More

    Submitted 26 December, 2022; originally announced December 2022.

    Comments: 8 pages, 7 figures, accepted for publication in IEEE Robotics and Automation Letters ( Volume: 6, Issue: 2, April 2021)

  37. arXiv:2212.08479  [pdf, other

    eess.IV cs.CV cs.LG eess.SP

    Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging

    Authors: Wenqi Huang, Hongwei Li, Jiazhen Pan, Gastao Cruz, Daniel Rueckert, Kerstin Hammernik

    Abstract: In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-spa… ▽ More

    Submitted 17 June, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

  38. arXiv:2212.05805  [pdf, other

    cs.CL cs.SD eess.AS

    Direct Speech-to-speech Translation without Textual Annotation using Bottleneck Features

    Authors: Junhui Zhang, Junjie Pan, Xiang Yin, Zejun Ma

    Abstract: Speech-to-speech translation directly translates a speech utterance to another between different languages, and has great potential in tasks such as simultaneous interpretation. State-of-art models usually contains an auxiliary module for phoneme sequences prediction, and this requires textual annotation of the training dataset. We propose a direct speech-to-speech translation model which can be t… ▽ More

    Submitted 12 December, 2022; originally announced December 2022.

    Comments: 4 pages, 3 figures

  39. arXiv:2212.03482  [pdf, other

    eess.AS cs.SD

    Improved Speech Pre-Training with Supervision-Enhanced Acoustic Unit

    Authors: Pengcheng Li, Genshun Wan, Fenglin Ding, Hang Chen, Jianqing Gao, Jia Pan, Cong Liu

    Abstract: Speech pre-training has shown great success in learning useful and general latent representations from large-scale unlabeled data. Based on a well-designed self-supervised learning pattern, pre-trained models can be used to serve lots of downstream speech tasks such as automatic speech recognition. In order to take full advantage of the labed data in low resource task, we present an improved pre-t… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  40. arXiv:2212.03480  [pdf, other

    eess.AS cs.SD

    Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

    Authors: Genshun Wan, Tan Liu, Hang Chen, Jia Pan, Cong Liu, Zhongfu Ye

    Abstract: Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning theoretically. To this end, we propose a progressive multi-scale self-supervised learning (PMS-SSL) method, which uses fine-grained target sets to compute SSL loss… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  41. arXiv:2212.03476  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

    Authors: Fenglin Ding, Genshun Wan, Pengcheng Li, Jia Pan, Cong Liu

    Abstract: Multilingual end-to-end models have shown great improvement over monolingual systems. With the development of pre-training methods on speech, self-supervised multilingual speech representation learning like XLSR has shown success in improving the performance of multilingual automatic speech recognition (ASR). However, similar to the supervised learning, multilingual pre-training may also suffer fr… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Subimitted to ICASSP 2023

  42. arXiv:2212.02782  [pdf, other

    eess.AS cs.SD

    Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

    Authors: Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu

    Abstract: In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: submitted to ICASSP 2023

  43. arXiv:2211.05910  [pdf, other

    eess.IV cs.CV

    Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

    Authors: Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, Jinwoo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li , et al. (71 additional authors not shown)

    Abstract: Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256

  44. arXiv:2210.00077  [pdf, other

    eess.AS cs.LG

    E-Branchformer: Branchformer with Enhanced merging for speech recognition

    Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe

    Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Bra… ▽ More

    Submitted 14 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  45. arXiv:2209.03785  [pdf, other

    eess.SP cs.AI cs.HC cs.LG

    A Novel Semi-supervised Meta Learning Method for Subject-transfer Brain-computer Interface

    Authors: Jingcong Li, Fei Wang, Haiyun Huang, Feifei Qi, Jiahui Pan

    Abstract: Brain-computer interface (BCI) provides a direct communication pathway between human brain and external devices. Before a new subject could use BCI, a calibration procedure is usually required. Because the inter- and intra-subject variances are so large that the models trained by the existing subjects perform poorly on new subjects. Therefore, effective subject-transfer and calibration method is e… ▽ More

    Submitted 7 September, 2022; originally announced September 2022.

  46. arXiv:2209.03671  [pdf, other

    eess.IV cs.CV cs.LG

    Learning-based and unrolled motion-compensated reconstruction for cardiac MR CINE imaging

    Authors: Jiazhen Pan, Daniel Rueckert, Thomas Küstner, Kerstin Hammernik

    Abstract: Motion-compensated MR reconstruction (MCMR) is a powerful concept with considerable potential, consisting of two coupled sub-problems: Motion estimation, assuming a known image, and image reconstruction, assuming known motion. In this work, we propose a learning-based self-supervised framework for MCMR, to efficiently deal with non-rigid motion corruption in cardiac MR imaging. Contrary to convent… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

  47. arXiv:2207.07828  [pdf, other

    cs.CV eess.IV

    Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement

    Authors: Cong Wang, Jinshan Pan, Xiao-Ming Wu

    Abstract: We propose an effective Structural Prior guided Generative Adversarial Transformer (SPGAT) to solve low-light image enhancement. Our SPGAT mainly contains a generator with two discriminators and a structural prior estimator (SPE). The generator is based on a U-shaped Transformer which is used to explore non-local information for better clear image restoration. The SPE is used to explore useful str… ▽ More

    Submitted 19 July, 2022; v1 submitted 16 July, 2022; originally announced July 2022.

  48. arXiv:2207.00583  [pdf, other

    eess.IV cs.CV q-bio.NC

    Feature-selected Graph Spatial Attention Network for Addictive Brain-Networks Identification

    Authors: Changwei Gong, Changhong Jing, Junren Pan, Shuqiang Wang

    Abstract: Functional alterations in the relevant neural circuits occur from drug addiction over a certain period. And these significant alterations are also revealed by analyzing fMRI. However, because of fMRI's high dimensionality and poor signal-to-noise ratio, it is challenging to encode efficient and robust brain regional embeddings for both graph-level identification and region-level biomarkers detecti… ▽ More

    Submitted 5 July, 2022; v1 submitted 29 June, 2022; originally announced July 2022.

  49. arXiv:2206.15192  [pdf

    cs.LG eess.SY

    Privacy-preserving household load forecasting based on non-intrusive load monitoring: A federated deep learning approach

    Authors: Xinxin Zhou, Jingru Feng, Jian Wang, Jianhong Pan

    Abstract: Load forecasting is very essential in the analysis and grid planning of power systems. For this reason, we first propose a household load forecasting method based on federated deep learning and non-intrusive load monitoring (NILM). For all we know, this is the first research on federated learning (FL) in household load forecasting based on NILM. In this method, the integrated power is decomposed i… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: Accepted by PeerJ Computer Science

  50. arXiv:2206.13393  [pdf, other

    eess.IV cs.CV cs.LG q-bio.NC

    Cross-Modal Transformer GAN: A Brain Structure-Function Deep Fusing Framework for Alzheimer's Disease

    Authors: Junren Pan, Shuqiang Wang

    Abstract: Cross-modal fusion of different types of neuroimaging data has shown great promise for predicting the progression of Alzheimer's Disease(AD). However, most existing methods applied in neuroimaging can not efficiently fuse the functional and structural information from multi-modal neuroimages. In this work, a novel cross-modal transformer generative adversarial network(CT-GAN) is proposed to fuse f… ▽ More

    Submitted 14 July, 2022; v1 submitted 20 June, 2022; originally announced June 2022.