Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 431 results for author: Huang, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13165  [pdf, other

    eess.IV cs.AI cs.CV cs.RO

    Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World Model

    Authors: Haojun Jiang, Zhenguo Sun, Ning Jia, Meng Li, Yu Sun, Shaqi Luo, Shiji Song, Gao Huang

    Abstract: Echocardiography is the only technique capable of real-time imaging of the heart and is vital for diagnosing the majority of cardiac diseases. However, there is a severe shortage of experienced cardiac sonographers, due to the heart's complex structure and significant operational challenges. To mitigate this situation, we present a Cardiac Copilot system capable of providing real-time probe moveme… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Early Accepted by MICCAI 2024

  2. arXiv:2406.08850  [pdf, other

    cs.CV

    COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

    Authors: Jiangshan Wang, Yue Ma, Jiayi Guo, Yicheng Xiao, Gao Huang, Xiu Li

    Abstract: Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video in a zero-shot manner. Despite extensive efforts, maintaining the temporal consistency of edited videos remains challenging due to the lack of temporal constraints in the regular T2I diffusion model. To address this issue, we propose COrrespondence-gui… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2406.08526  [pdf, other

    cs.LG cs.AI cs.DC cs.GT

    IMFL-AIGC: Incentive Mechanism Design for Federated Learning Empowered by Artificial Intelligence Generated Content

    Authors: Guangjing Huang, Qiong Wu, Jingyi Li, Xu Chen

    Abstract: Federated learning (FL) has emerged as a promising paradigm that enables clients to collaboratively train a shared global model without uploading their local data. To alleviate the heterogeneous data quality among clients, artificial intelligence-generated content (AIGC) can be leveraged as a novel data synthesis technique for FL model performance enhancement. Due to various costs incurred by AIGC… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: The paper has been accepted by IEEE Transactions on Mobile Computing

  4. arXiv:2406.05969  [pdf, other

    cs.RO

    Visual-Inertial SLAM as Simple as A, B, VINS

    Authors: Nathaniel Merrill, Guoquan Huang

    Abstract: We present AB-VINS, a different kind of visual-inertial SLAM system. Unlike most VINS systems which only use hand-crafted techniques, AB-VINS makes use of three different deep networks. Instead of estimating sparse feature positions, AB-VINS only estimates the scale and bias parameters (a and b) of monocular depth maps, as well as other terms to correct the depth using multi-view information which… ▽ More

    Submitted 15 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

  5. arXiv:2406.05478  [pdf, other

    cs.CV cs.AI

    Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

    Authors: Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang

    Abstract: The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their infe… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted by CVPR2024

  6. arXiv:2406.04342  [pdf, other

    cs.CV

    Learning 1D Causal Visual Representation with De-focus Attention Networks

    Authors: Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie Zhou, Jifeng Dai

    Abstract: Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-foc… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  7. arXiv:2406.04295  [pdf, other

    cs.CV

    Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

    Authors: Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang

    Abstract: Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditiona… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: GitHub: https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment

  8. arXiv:2406.02990  [pdf, other

    cs.CV

    Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification

    Authors: Gexin Huang, Chenfei Wu, Mingjie Li, Xiaojun Chang, Ling Chen, Ying Sun, Shen Zhao, Xiaodan Liang, Liang Lin

    Abstract: Predicting genetic mutations from whole slide images is indispensable for cancer diagnosis. However, existing work training multiple binary classification models faces two challenges: (a) Training multiple binary classifiers is inefficient and would inevitably lead to a class imbalance problem. (b) The biological relationships among genes are overlooked, which limits the prediction performance. To… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 16 pages, 8 figures, and 3 tables

  9. arXiv:2406.01179  [pdf, other

    cs.CL cs.AI

    Are AI-Generated Text Detectors Robust to Adversarial Perturbations?

    Authors: Guanhua Huang, Yuchen Zhang, Zhe Li, Yongjian You, Mingze Wang, Zhouwang Yang

    Abstract: The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-cr… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 main conference

  10. arXiv:2406.00210  [pdf, other

    cs.CV

    A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies

    Authors: Jinchao Zhu, Yuxuan Wang, Siyuan Pan, Pengfei Wan, Di Zhang, Gao Huang

    Abstract: The Stable Diffusion Model (SDM) is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. Despite various attempts at sampler optimization, model distillation, and network quantification, these approaches typically maintain the original network architecture. The extensive parameter scale and substantial computational demands have limited research into adjusti… ▽ More

    Submitted 17 June, 2024; v1 submitted 31 May, 2024; originally announced June 2024.

    Comments: 19 pages, 16 figures, submitted to IEEE Transactions on Neural Networks and Learning Systems

  11. arXiv:2405.20763  [pdf, other

    cs.LG math.OC stat.ML

    Improving Generalization and Convergence by Enhancing Implicit Regularization

    Authors: Mingze Wang, Haotian He, Jinbo Wang, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, Lei Wu

    Abstract: In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that I… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: 35 pages

  12. arXiv:2405.19818  [pdf, other

    cs.CV cs.AI

    WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark

    Authors: Chunhui Zhang, Li Liu, Guanjie Huang, Hao Wen, Xi Zhou, Yanfeng Wang

    Abstract: Underwater object tracking (UOT) is a foundational task for identifying and tracing submerged entities in underwater video sequences. However, current UOT datasets suffer from limitations in scale, diversity of target categories and scenarios covered, hindering the training and evaluation of modern tracking algorithms. To bridge this gap, we take the first step and introduce WebUOT-1M, \ie, the la… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: GitHub project: https://github.com/983632847/Awesome-Multimodal-Object-Tracking

  13. arXiv:2405.19026  [pdf, other

    cs.LG cs.AI cs.CL cs.CR

    DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints

    Authors: Andrew Zhao, Quentin Xu, Matthieu Lin, Shenzhi Wang, Yong-jin Liu, Zilong Zheng, Gao Huang

    Abstract: Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximiz… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  14. arXiv:2405.17873  [pdf, other

    cs.CV cs.AI

    MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

    Authors: Tianchen Zhao, Xuefei Ning, Tongcheng Fang, Enshu Liu, Guyue Huang, Zinan Lin, Shengen Yan, Guohao Dai, Yu Wang

    Abstract: Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantiz… ▽ More

    Submitted 29 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Project Page: https://a-suozhang.xyz/mixdq.github.io/

  15. arXiv:2405.16605  [pdf, other

    cs.CV

    Demystify Mamba in Vision: A Linear Attention Perspective

    Authors: Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang

    Abstract: Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similar… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  16. arXiv:2405.16114  [pdf, other

    cs.AI cs.CV cs.LG

    Multi-scale Quaternion CNN and BiGRU with Cross Self-attention Feature Fusion for Fault Diagnosis of Bearing

    Authors: Huanbai Liu, Fanlong Zhang, Yin Tan, Lian Huang, Yan Li, Guoheng Huang, Shenghong Luo, An Zeng

    Abstract: In recent years, deep learning has led to significant advances in bearing fault diagnosis (FD). Most techniques aim to achieve greater accuracy. However, they are sensitive to noise and lack robustness, resulting in insufficient domain adaptation and anti-noise ability. The comparison of studies reveals that giving equal attention to all features does not differentiate their significance. In this… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  17. arXiv:2405.15738  [pdf, other

    cs.CV

    ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

    Authors: Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng

    Abstract: High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which emplo… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 17 pages

  18. arXiv:2405.14751  [pdf, other

    cs.LG

    AGILE: A Novel Framework of LLM Agents

    Authors: Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li

    Abstract: We introduce a novel framework of LLM agents named AGILE (AGent that Interacts and Learns from Environments) designed to perform complex conversational tasks with users, leveraging LLMs, memory, tools, and interactions with experts. The agent's abilities include not only conversation but also reflection, utilization of tools, and consultation with experts. We formulate the construction of such an… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  19. arXiv:2405.10874  [pdf, other

    cs.RO

    Square-Root Inverse Filter-based GNSS-Visual-Inertial Navigation

    Authors: Jun Hu, Xiaoming Lang, Feng Zhang, Yinian Mao, Guoquan Huang

    Abstract: While Global Navigation Satellite System (GNSS) is often used to provide global positioning if available, its intermittency and/or inaccuracy calls for fusion with other sensors. In this paper, we develop a novel GNSS-Visual-Inertial Navigation System (GVINS) that fuses visual, inertial, and raw GNSS measurements within the square-root inverse sliding window filtering (SRI-SWF) framework in a tigh… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  20. arXiv:2405.08768  [pdf, other

    cs.CV cs.AI cs.LG

    EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

    Authors: Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang

    Abstract: The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns with… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Journal version of arXiv:2211.09703 (ICCV 2023). Code is available at: https://github.com/LeapLabTHU/EfficientTrain

  21. arXiv:2405.04812  [pdf, other

    cs.RO cs.CV

    General Place Recognition Survey: Towards Real-World Autonomy

    Authors: Peng Yin, Jianhao Jiao, Shiqi Zhao, Lingyun Xu, Guoquan Huang, Howie Choset, Sebastian Scherer, Jianda Han

    Abstract: In the realm of robotics, the quest for achieving real-world autonomy, capable of executing large-scale and long-term operations, has positioned place recognition (PR) as a cornerstone technology. Despite the PR community's remarkable strides over the past two decades, garnering attention from fields like computer vision and robotics, the development of PR methods that sufficiently support real-wo… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 20 pages, 12 figures, under review

  22. arXiv:2405.03520  [pdf, other

    cs.CV

    Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

    Authors: Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, Guan Huang

    Abstract: General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical law… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey

  23. How to Gain Commit Rights in Modern Top Open Source Communities?

    Authors: Xin Tan, Yan Gong, Geyu Huang, Haohua Wu, Li Zhang

    Abstract: The success of open source software (OSS) projects relies on voluntary contributions from various community roles.Being a committer signifies gaining trust and higher privileges. Substantial studies have focused on the requirements of becoming a committer, but most of them are based on interviews or several hypotheses, lacking a comprehensive understanding of committers' qualifications.We explore… ▽ More

    Submitted 16 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: 23 pages,5 figures,FSE 2024

    Journal ref: Proceedings of the ACM on Software Engineering (PACMSE) Issue FSE 2024

  24. arXiv:2404.17946  [pdf, ps, other

    cs.IT

    Geometric Characteristics in Phaseless Operator and Structured Matrix Recovery

    Authors: Gao Huang, Song Li

    Abstract: In this paper, we first propose a simple and unified approach to stability of phaseless operator to both amplitude and intensity measurement, both complex and real cases on arbitrary geometric set, thus characterizing the robust performance of phase retrieval via empirical minimization method. The unified analysis involves the random embedding of concave lifting operator on tangent space. Similarl… ▽ More

    Submitted 14 June, 2024; v1 submitted 27 April, 2024; originally announced April 2024.

    MSC Class: 94A12; 68Q87; 65C50; 60G12

  25. arXiv:2404.15260  [pdf, other

    quant-ph cs.AR

    Distributed Architecture for FPGA-based Superconducting Qubit Control

    Authors: Neelay Fruitwala, Gang Huang, Yilun Xu, Abhi Rajagopala, Akel Hashim, Ravi K. Naik, Kasra Nowrouzi, David I. Santiago, Irfan Siddiqi

    Abstract: Quantum circuits utilizing real time feedback techniques (such as active reset and mid-circuit measurement) are a powerful tool for NISQ-era quantum computing. Such techniques are crucial for implementing error correction protocols, and can reduce the resource requirements of certain quantum algorithms. Realizing these capabilities requires flexible, low-latency classical control. We have develope… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 10 pages, 13 figures

  26. arXiv:2404.12621  [pdf, other

    cs.SE

    Research on WebAssembly Runtimes: A Survey

    Authors: Yixuan Zhang, Mugeng Liu, Haoyu Wang, Yun Ma, Gang Huang, Xuanzhe Liu

    Abstract: WebAssembly (abbreviated as Wasm) was initially introduced for the Web but quickly extended its reach into various domains beyond the Web. To create Wasm applications, developers can compile high-level programming languages into Wasm binaries or manually convert equivalent textual formats into Wasm binaries. Regardless of whether it is utilized within or outside the Web, the execution of Wasm bina… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  27. arXiv:2404.10630  [pdf, other

    cs.CL cs.LG

    HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

    Authors: Haozheng Fan, Hao Zhou, Guangtai Huang, Parameswaran Raman, Xinwei Fu, Gaurav Gupta, Dhananjay Ram, Yida Wang, Jun Huan

    Abstract: Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (su… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  28. arXiv:2404.10304  [pdf, other

    cs.SE cs.LG

    LLM-Powered Test Case Generation for Detecting Tricky Bugs

    Authors: Kaibo Liu, Yiyang Liu, Zhenpeng Chen, Jie M. Zhang, Yudong Han, Yun Ma, Ge Li, Gang Huang

    Abstract: Conventional automated test generation tools struggle to generate test oracles and tricky bug-revealing test inputs. Large Language Models (LLMs) can be prompted to produce test inputs and oracles for a program directly, but the precision of the tests can be very low for complex scenarios (only 6.3% based on our experiments). To fill this gap, this paper proposes AID, which combines LLMs with diff… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  29. arXiv:2404.09445  [pdf, other

    cs.LG cs.AI cs.CV

    Exploring Text-to-Motion Generation with Human Preference

    Authors: Jenny Sheng, Matthieu Lin, Andrew Zhao, Kevin Pruvost, Yu-Hui Wen, Yangguang Li, Gao Huang, Yong-Jin Liu

    Abstract: This paper presents an exploration of preference learning in text-to-motion generation. We find that current improvements in text-to-motion generation still rely on datasets requiring expert labelers with motion capture systems. Instead, learning from human preference data does not require motion capture systems; a labeler with no expertise simply compares two generated motions. This is particular… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024 HuMoGen Workshop

  30. arXiv:2404.08844  [pdf, other

    cs.RO cs.AI

    Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping

    Authors: Lei Zhang, Kaixin Bai, Guowen Huang, Zhaopeng Chen, Jianwei Zhang

    Abstract: The integration of optimization method and generative models has significantly advanced dexterous manipulation techniques for five-fingered hand grasping. Yet, the application of these techniques in cluttered environments is a relatively unexplored area. To address this research gap, we have developed a novel method for generating five-fingered hand grasp samples in cluttered settings. This method… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 8 pages

  31. arXiv:2404.04120  [pdf, other

    cs.CV

    Cross-Modality Gait Recognition: Bridging LiDAR and Camera Modalities for Human Identification

    Authors: Rui Wang, Chuanfu Shen, Manuel J. Marin-Jimenez, George Q. Huang, Shiqi Yu

    Abstract: Current gait recognition research mainly focuses on identifying pedestrians captured by the same type of sensor, neglecting the fact that individuals may be captured by different sensors in order to adapt to various environments. A more practical approach should involve cross-modality matching across different sensors. Hence, this paper focuses on investigating the problem of cross-modality gait r… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  32. arXiv:2404.01194  [pdf, other

    cs.CV

    Adaptive Query Prompting for Multi-Domain Landmark Detection

    Authors: Qiusen Wei, Guoheng Huang, Xiaochen Yuan, Xuhang Chen, Guo Zhong, Jianwen Huang, Jiajie Huang

    Abstract: Medical landmark detection is crucial in various medical imaging modalities and procedures. Although deep learning-based methods have achieve promising performance, they are mostly designed for specific anatomical regions or tasks. In this work, we propose a universal model for multi-domain landmark detection by leveraging transformer architecture and developing a prompting component, named as Ada… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  33. arXiv:2404.01127  [pdf, other

    cs.CV cs.AI

    Medical Visual Prompting (MVP): A Unified Framework for Versatile and High-Quality Medical Image Segmentation

    Authors: Yulin Chen, Guoheng Huang, Kai Huang, Zijin Lin, Guo Zhong, Shenghong Luo, Jie Deng, Jian Zhou

    Abstract: Accurate segmentation of lesion regions is crucial for clinical diagnosis and treatment across various diseases. While deep convolutional networks have achieved satisfactory results in medical image segmentation, they face challenges such as loss of lesion shape information due to continuous convolution and downsampling, as well as the high cost of manually labeling lesions with varying shapes and… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  34. arXiv:2403.18259  [pdf, other

    cs.RO

    RoboKeyGen: Robot Pose and Joint Angles Estimation via Diffusion-based 3D Keypoint Generation

    Authors: Yang Tian, Jiyao Zhang, Guowei Huang, Bin Wang, Ping Wang, Jiangmiao Pang, Hao Dong

    Abstract: Estimating robot pose and joint angles is significant in advanced robotics, enabling applications like robot collaboration and online hand-eye calibration.However, the introduction of unknown joint angles makes prediction more complex than simple robot pose estimation, due to its higher dimensionality.Previous methods either regress 3D keypoints directly or utilise a render&compare strategy. These… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted by ICRA 2024

  35. arXiv:2403.17701   

    eess.IV cs.CV cs.LG

    Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation

    Authors: Hao Tang, Lianglun Cheng, Guoheng Huang, Zhengguang Tan, Junhao Lu, Kaihong Wu

    Abstract: Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its var… ▽ More

    Submitted 3 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Experimental method encountered errors, undergoing experiment again

  36. arXiv:2403.12536  [pdf, other

    cs.CV

    Vox-Fusion++: Voxel-based Neural Implicit Dense Tracking and Mapping with Multi-maps

    Authors: Hongjia Zhai, Hai Li, Xingrui Yang, Gan Huang, Yuhang Ming, Hujun Bao, Guofeng Zhang

    Abstract: In this paper, we introduce Vox-Fusion++, a multi-maps-based robust dense tracking and mapping system that seamlessly fuses neural implicit representations with traditional volumetric fusion techniques. Building upon the concept of implicit mapping and positioning systems, our approach extends its applicability to real-world scenarios. Our system employs a voxel-based neural implicit surface repre… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: 14 pages. arXiv admin note: text overlap with arXiv:2210.15858

  37. From Explainable to Interpretable Deep Learning for Natural Language Processing in Healthcare: How Far from Reality?

    Authors: Guangming Huang, Yingya Li, Shoaib Jameel, Yunfei Long, Giorgos Papanastasiou

    Abstract: Deep learning (DL) has substantially enhanced natural language processing (NLP) in healthcare research. However, the increasing complexity of DL-based NLP necessitates transparent model interpretability, or at least explainability, for reliable decision-making. This work presents a thorough scoping review of explainable and interpretable DL in healthcare NLP. The term "eXplainable and Interpretabl… ▽ More

    Submitted 9 May, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted by Computational and Structural Biotechnology Journal

  38. arXiv:2403.11808  [pdf, other

    cs.CV

    Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation

    Authors: Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You

    Abstract: Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper,… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  39. arXiv:2403.11703  [pdf, other

    cs.CV cs.AI

    LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

    Authors: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

    Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in t… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Preprint

  40. arXiv:2403.11626  [pdf, other

    cs.GR cs.AI cs.CV cs.MM cs.SD eess.AS

    QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation

    Authors: Zhizhen Zhou, Yejing Huo, Guoheng Huang, An Zeng, Xuhang Chen, Lian Huang, Zinuo Li

    Abstract: The study of music-generated dance is a novel and challenging Image generation task. It aims to input a piece of music and seed motions, then generate natural dance movements for the subsequent music. Transformer-based methods face challenges in time series prediction tasks related to human movements and music due to their struggle in capturing the nonlinear relationship and temporal aspects. This… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted by The Visual Computer Journal

  41. arXiv:2403.11127  [pdf, other

    cs.CV

    GRA: Detecting Oriented Objects through Group-wise Rotating and Attention

    Authors: Jiangshan Wang, Yifan Pu, Yizeng Han, Jiayi Guo, Yiru Wang, Xiu Li, Gao Huang

    Abstract: Oriented object detection, an emerging task in recent years, aims to identify and locate objects across varied orientations. This requires the detector to accurately capture the orientation information, which varies significantly within and across images. Despite the existing substantial efforts, simultaneously ensuring model effectiveness and parameter efficiency remains challenging in this scena… ▽ More

    Submitted 19 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

    Comments: tech report

  42. arXiv:2403.10840  [pdf, other

    cs.RO cs.CV

    MSI-NeRF: Linking Omni-Depth with View Synthesis through Multi-Sphere Image aided Generalizable Neural Radiance Field

    Authors: Dongyu Yan, Guanyu Huang, Fengyu Quan, Haoyao Chen

    Abstract: Panoramic observation using fisheye cameras is significant in robot perception, reconstruction, and remote operation. However, panoramic images synthesized by traditional methods lack depth information and can only provide three degrees-of-freedom (3DoF) rotation rendering in virtual reality applications. To fully preserve and exploit the parallax information within the original fisheye cameras, w… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

    Comments: 8 pages, 7 figures, Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems 2024

  43. arXiv:2403.10790  [pdf, other

    quant-ph cs.CR cs.LG

    QuantumLeak: Stealing Quantum Neural Networks from Cloud-based NISQ Machines

    Authors: Zhenxiao Fu, Min Yang, Cheng Chu, Yilun Xu, Gang Huang, Fan Chen

    Abstract: Variational quantum circuits (VQCs) have become a powerful tool for implementing Quantum Neural Networks (QNNs), addressing a wide range of complex problems. Well-trained VQCs serve as valuable intellectual assets hosted on cloud-based Noisy Intermediate Scale Quantum (NISQ) computers, making them susceptible to malicious VQC stealing attacks. However, traditional model extraction techniques desig… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Journal ref: published in IJCNN 2024

  44. arXiv:2403.09622  [pdf, other

    cs.CV

    Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

    Authors: Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan

    Abstract: Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the ch… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: technical report, 18 pages, 19 figures

  45. arXiv:2403.07153  [pdf, other

    cs.CV

    2023 Low-Power Computer Vision Challenge (LPCVC) Summary

    Authors: Leo Chen, Benjamin Boardley, Ping Hu, Yiru Wang, Yifan Pu, Xin Jin, Yongqiang Yao, Ruihao Gong, Bo Li, Gao Huang, Xianglong Liu, Zifu Wan, Xinwang Chen, Ning Liu, Ziyi Zhang, Dongping Liu, Ruijie Shan, Zhengping Che, Fachao Zhang, Xiaofeng Mou, Jian Tang, Maxim Chuprov, Ivan Malofeev, Alexander Goncharenko, Andrey Shcherbin , et al. (5 additional authors not shown)

    Abstract: This article describes the 2023 IEEE Low-Power Computer Vision Challenge (LPCVC). Since 2015, LPCVC has been an international competition devoted to tackling the challenge of computer vision (CV) on edge devices. Most CV researchers focus on improving accuracy, at the expense of ever-growing sizes of machine models. LPCVC balances accuracy with resource requirements. Winners must achieve high accu… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: LPCVC 2023, website: https://lpcv.ai/

  46. arXiv:2403.06845  [pdf, other

    cs.CV

    DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

    Authors: Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang

    Abstract: World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which builds upon the framework of DriveDreamer and incorporates a Large Language Model (LLM) to generate user-defined driving videos. Specificall… ▽ More

    Submitted 11 April, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Project Page: https://drivedreamer2.github.io

  47. arXiv:2403.06726  [pdf, other

    cs.LG cs.CV

    Probabilistic Contrastive Learning for Long-Tailed Visual Recognition

    Authors: Chaoqun Du, Yulin Wang, Shiji Song, Gao Huang

    Abstract: Long-tailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples. Such imbalance issue considerably impairs the performance of standard supervised learning algorithms, which are mainly designed for balanced training sets. Recent investigations have revealed that supervised contrastive learning exhibits promising potenti… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI)

  48. arXiv:2402.19350  [pdf, other

    cs.CL

    Prompting Explicit and Implicit Knowledge for Multi-hop Question Answering Based on Human Reading Process

    Authors: Guangming Huang, Yunfei Long, Cunjin Luo, Jiaxing Shen, Xia Sun

    Abstract: Pre-trained language models (PLMs) leverage chains-of-thought (CoT) to simulate human reasoning and inference processes, achieving proficient performance in multi-hop QA. However, a gap persists between PLMs' reasoning abilities and those of humans when tackling complex problems. Psychological studies suggest a vital connection between explicit information in passages and human prior knowledge dur… ▽ More

    Submitted 4 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: This paper has been accepted at LREC-COLING 2024

  49. arXiv:2402.14099  [pdf, other

    eess.IV cs.CV physics.med-ph

    EXACT-Net:EHR-guided lung tumor auto-segmentation for non-small cell lung cancer radiotherapy

    Authors: Hamed Hooshangnejad, Xue Feng, Gaofeng Huang, Rui Zhang, Quan Chen, Kai Ding

    Abstract: Lung cancer is a devastating disease with the highest mortality rate among cancer types. Over 60% of non-small cell lung cancer (NSCLC) patients, which accounts for 87% of diagnoses, require radiation therapy. Rapid treatment initiation significantly increases the patient's survival rate and reduces the mortality rate. Accurate tumor segmentation is a critical step in the diagnosis and treatment o… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  50. arXiv:2402.13505  [pdf, other

    cs.LG cs.CV

    SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning

    Authors: Chaoqun Du, Yizeng Han, Gao Huang

    Abstract: Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of model… ▽ More

    Submitted 1 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ICML2024 camera-ready version