Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 851 results for author: Wu, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.19984  [pdf, other

    cs.CL

    Confidence Estimation for Automatic Detection of Depression and Alzheimer's Disease Based on Clinical Interviews

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: Speech-based automatic detection of Alzheimer's disease (AD) and depression has attracted increased attention. Confidence estimation is crucial for a trust-worthy automatic diagnostic system which informs the clinician about the confidence of model predictions and helps reduce the risk of misdiagnosis. This paper investigates confidence estimation for automatic detection of AD and depression based… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2407.19807  [pdf, other

    cs.CL

    Cool-Fusion: Fuse Large Language Models without Training

    Authors: Cong Liu, Xiaojun Quan, Yan Pan, Liang Lin, Weigang Wu, Xu Chen

    Abstract: We focus on the problem of fusing two or more heterogeneous large language models (LLMs) to facilitate their complementary strengths. One of the challenges on model fusion is high computational load, i.e. to fine-tune or to align vocabularies via combinatorial optimization. To this end, we propose \emph{Cool-Fusion}, a simple yet effective approach that fuses the knowledge of heterogeneous source… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  3. arXiv:2407.16040  [pdf, other

    cs.LG cs.AI

    Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

    Authors: Kuluhan Binici, Weiming Wu, Tulika Mitra

    Abstract: Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationall… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Accepted by the BMVC-24

  4. arXiv:2407.15719  [pdf, other

    cs.CV cs.AI

    GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI

    Authors: Zhaojie Fang, Shenghao Zhu, Yifei Chen, Binfeng Zou, Fan Jia, Linwei Qiu, Chang Liu, Yiyu Huang, Xiang Feng, Feiwei Qin, Changmiao Wang, Yeru Wang, Jin Fan, Changbiao Chu, Wan-Zhen Wu, Hu Zhao

    Abstract: Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classif… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 35 pages, 4 figures

  5. arXiv:2407.12322  [pdf, other

    cs.CV

    Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer

    Authors: Wenhan Wu, Ce Zheng, Zihao Yang, Chen Chen, Srijan Das, Aidong Lu

    Abstract: Recently, transformers have demonstrated great potential for modeling long-term dependencies from skeleton sequences and thereby gained ever-increasing attention in skeleton action recognition. However, the existing transformer-based approaches heavily rely on the naive attention mechanism for capturing the spatiotemporal features, which falls short in learning discriminative representations that… ▽ More

    Submitted 29 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Accepted by ACM Multimedia 2024

  6. arXiv:2407.11481  [pdf, other

    cs.LG cs.AI eess.SP

    Multi-Channel Masked Autoencoder and Comprehensive Evaluations for Reconstructing 12-Lead ECG from Arbitrary Single-Lead ECG

    Authors: Jiarong Chen, Wanqing Wu, Tong Liu, Shenda Hong

    Abstract: In the context of cardiovascular diseases (CVD) that exhibit an elevated prevalence and mortality, the electrocardiogram (ECG) is a popular and standard diagnostic tool for doctors, commonly utilizing a 12-lead configuration in clinical practice. However, the 10 electrodes placed on the surface would cause a lot of inconvenience and discomfort, while the rapidly advancing wearable devices adopt th… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted by KDD-AIDSH 2024

  7. arXiv:2407.10649  [pdf, other

    cs.CV

    APC: Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

    Authors: Wangyu Wu, Tianhong Dai, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao

    Abstract: Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels,… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  8. arXiv:2407.10441  [pdf

    cs.AI cs.LG

    Enhancing Building Safety Design for Active Shooter Incidents: Exploration of Building Exit Parameters using Reinforcement Learning-Based Simulations

    Authors: Ruying Liu, Wanjing Wu, Burcin Becerik-Gerber, Gale M. Lucas

    Abstract: With the alarming rise in active shooter incidents (ASIs) in the United States, enhancing public safety through building design has become a pressing need. This study proposes a reinforcement learning-based simulation approach addressing gaps in existing research that has neglected the dynamic behaviours of shooters. We developed an autonomous agent to simulate an active shooter within a realistic… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Journal ref: 31st EG-ICE International Workshop on Intelligent Computing in Engineering 2024

  9. arXiv:2407.08725  [pdf, other

    cs.CV cs.AI cs.RO

    MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

    Authors: Wayne Wu, Honglin He, Yiran Wang, Chenda Duan, Jack He, Zhizheng Liu, Quanyi Li, Bolei Zhou

    Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while diverse robot dogs and humanoids have recently emerged in the street. Ensurin… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Technical report. Project page: https://metadriverse.github.io/metaurban/

  10. arXiv:2407.08546  [pdf, other

    cs.CV cs.LG q-bio.QM

    Quantitative Evaluation of the Saliency Map for Alzheimer's Disease Classifier with Anatomical Segmentation

    Authors: Yihan Zhang, Xuanshuo Zhang, Wei Wu, Haohan Wang

    Abstract: Saliency maps have been widely used to interpret deep learning classifiers for Alzheimer's disease (AD). However, since AD is heterogeneous and has multiple subtypes, the pathological mechanism of AD remains not fully understood and may vary from patient to patient. Due to the lack of such understanding, it is difficult to comprehensively and effectively assess the saliency map of AD classifier. I… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  11. arXiv:2407.08164  [pdf, other

    cs.AI cs.MA cs.RO

    Hierarchical Consensus-Based Multi-Agent Reinforcement Learning for Multi-Robot Cooperation Tasks

    Authors: Pu Feng, Junkang Liang, Size Wang, Xin Yu, Rongye Shi, Wenjun Wu

    Abstract: In multi-agent reinforcement learning (MARL), the Centralized Training with Decentralized Execution (CTDE) framework is pivotal but struggles due to a gap: global state guidance in training versus reliance on local observations in execution, lacking global signals. Inspired by human societal consensus mechanisms, we introduce the Hierarchical Consensus-based Multi-Agent Reinforcement Learning (HC-… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 8 pages, 10 figures. Accepted for presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

  12. arXiv:2407.06677  [pdf, other

    cs.CL

    Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

    Authors: Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan

    Abstract: Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, ca… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  13. arXiv:2407.06112  [pdf, other

    cs.CL

    Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

    Authors: Yadong Zhang, Shaoguang Mao, Wenshan Wu, Yan Xia, Tao Ge, Man Lan, Furu Wei

    Abstract: This paper introduces BI-Directional DEliberation Reasoning (BIDDER), a novel reasoning approach to enhance the decision rationality of language models. Traditional reasoning methods typically rely on historical information and employ uni-directional (left-to-right) reasoning strategy. This lack of bi-directional deliberation reasoning results in limited awareness of potential future outcomes and… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  14. arXiv:2407.04118  [pdf, other

    cs.CL cs.AI

    MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

    Authors: Yuyan Chen, Zhihao Wen, Ge Fan, Zhengyu Chen, Wei Wu, Dayiheng Liu, Zhixu Li, Bang Liu, Yanghua Xiao

    Abstract: Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this w… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted to EMNLP 2023 (Findings)

  15. arXiv:2407.01079  [pdf, ps, other

    stat.ML cs.AI cs.LG

    On Statistical Rates and Provably Efficient Criteria of Latent Diffusion Transformers (DiTs)

    Authors: Jerry Yao-Chieh Hu, Weimin Wu, Zhuoru Li, Zhao Song, Han Liu

    Abstract: We investigate the statistical and computational limits of latent \textbf{Di}ffusion \textbf{T}ransformers (\textbf{DiT}s) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we deri… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  16. arXiv:2406.19934  [pdf, other

    cs.CL cs.AI

    From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis

    Authors: Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

    Abstract: We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  17. arXiv:2406.19781  [pdf, other

    cs.RO

    LCSim: A Large-Scale Controllable Traffic Simulator

    Authors: Yuheng Zhang, Tianjian Ouyang, Fudan Yu, Cong Ma, Lei Qiao, Wei Wu, Jian Yuan, Yong Li

    Abstract: With the rapid development of urban transportation and the continuous advancement in autonomous vehicles, the demand for safely and efficiently testing autonomous driving and traffic optimization algorithms arises, which needs accurate modeling of large-scale urban traffic scenarios. Existing traffic simulation systems encounter two significant limitations. Firstly, they often rely on open-source… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Submitted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

  18. Blockchain Based Zero-Knowledge Proof of Location in IoT

    Authors: Wei Wu, Erwu Liu, Xinglin Gong, Rui Wang

    Abstract: With the development of precise positioning technology, a growing number of location-based services (LBSs) facilitate people's life. Most LBSs require proof of location (PoL) to prove that the user satisfies the service requirement, which exposes the user's privacy. In this paper, we propose a zero-knowledge proof of location (zk-PoL) protocol to better protect the user's privacy. With the zk-PoL… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Published on ICC 2020-2020 IEEE International Conference on Communications (ICC)

  19. arXiv:2406.18045  [pdf, other

    cs.CL cs.AI

    PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

    Authors: Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang , et al. (11 additional authors not shown)

    Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpo… ▽ More

    Submitted 9 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  20. arXiv:2406.16066  [pdf, other

    cs.CE

    Constructing Boundary-identical Microstructures by Guided Diffusion for Fast Multiscale Designs

    Authors: Jingxuan Feng, Lili Wang, Xiaoya Zhai, Kai Chen, Wenming Wu, Ligang Liu, Xiao-Ming Fu

    Abstract: We propose a novel method to construct large-scale boundary-identical microstructure datasets with high attribute coverage for highly efficient multiscale design. Central to our technique is using a deep generative model to generate microstructures under the two conditions, including the specified boundary and homogenized elastic tensor. We achieve the desired dataset by alternately adding microst… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

  21. arXiv:2406.15245  [pdf, other

    cs.CL cs.LG

    Unsupervised Morphological Tree Tokenizer

    Authors: Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu

    Abstract: As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of word… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  22. arXiv:2406.14753  [pdf, other

    cs.LG stat.ME

    A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

    Authors: Weiqin Chen, Mark S. Squillante, Chai Wah Wu, Santiago Paternain

    Abstract: We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish theoretical properties of our approach and derive an algorithm based on a specific instance of this approach. Our empirical results demonstrate the significant benefits of our approach.

    Submitted 20 June, 2024; originally announced June 2024.

  23. arXiv:2406.13625  [pdf

    cs.CV cs.AI physics.med-ph

    Enhance the Image: Super Resolution using Artificial Intelligence in MRI

    Authors: Ziyu Li, Zihan Li, Haoxiang Li, Qiuyun Fan, Karla L. Miller, Wenchuan Wu, Akshay S. Chaudhari, Qiyuan Tian

    Abstract: This chapter provides an overview of deep learning techniques for improving the spatial resolution of MRI, ranging from convolutional neural networks, generative adversarial networks, to more advanced models including transformers, diffusion models, and implicit neural representations. Our exploration extends beyond the methodologies to scrutinize the impact of super-resolved images on clinical an… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: A book chapter in Machine Learning in MRI: From methods to clinical translation. Copyright may be transferred without notice, after which this version may no longer be accessible

  24. arXiv:2406.11698  [pdf, other

    cs.CL

    Meta Reasoning for Large Language Models

    Authors: Peizhong Gao, Ao Xie, Shaoguang Mao, Wenshan Wu, Yan Xia, Haipeng Mi, Furu Wei

    Abstract: We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) inspired by human meta-reasoning. Traditional in-context learning-based reasoning techniques, such as Tree-of-Thoughts, show promise but lack consistent state-of-the-art performance across diverse tasks due to their specialized nature. MRP addresses this limitation by guiding… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  25. arXiv:2406.11633  [pdf, other

    cs.CV

    DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

    Authors: Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Bin Wang, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang, Bin Wang, Junchi Yan, Fei Wu, Yu Qiao

    Abstract: Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extract… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Homepage of DocGenome: https://unimodal4reasoning.github.io/DocGenome_page 22 pages, 11 figures

  26. arXiv:2406.11176  [pdf, other

    cs.CL cs.AI

    Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

    Authors: Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li

    Abstract: Large language model agents have exhibited exceptional performance across a range of complex interactive tasks. Recent approaches have utilized tuning with expert trajectories to enhance agent performance, yet they primarily concentrate on outcome rewards, which may lead to errors or suboptimal actions due to the absence of process supervision signals. In this paper, we introduce the Iterative ste… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  27. arXiv:2406.09333  [pdf, other

    cs.CV

    Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis

    Authors: Weiyi Wu, Chongyang Gao, Xinwen Xu, Siting Li, Jiang Gui

    Abstract: Whole Slide Images (WSIs) are crucial for modern pathological diagnosis, yet their gigapixel-scale resolutions and sparse informative regions pose significant computational challenges. Traditional dense attention mechanisms, widely used in computer vision and natural language processing, are impractical for WSI analysis due to the substantial data scale and the redundant processing of uninformativ… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  28. arXiv:2406.09194  [pdf, ps, other

    stat.ML cs.IT cs.LG math.NA math.ST

    Benign overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Inductive Bias

    Authors: Honam Wong, Wendao Wu, Fanghui Liu, Yiping Lu

    Abstract: Recent advances in machine learning have inspired a surge of research into reconstructing specific quantities of interest from measurements that comply with certain physical laws. These efforts focus on inverse problems that are governed by partial differential equations (PDEs). In this work, we develop an asymptotic Sobolev norm learning curve for kernel ridge(less) regression when addressing (el… ▽ More

    Submitted 16 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  29. arXiv:2406.07411  [pdf, other

    cs.SE cs.CL

    VersiCode: Towards Version-controllable Code Generation

    Authors: Tongtong Wu, Weigang Wu, Xingyu Wang, Kang Xu, Suyu Ma, Bo Jiang, Ping Yang, Zhenchang Xing, Yuan-Fang Li, Gholamreza Haffari

    Abstract: Significant research has focused on improving the performance of large language model on code-related tasks due to their practical importance. Although performance is typically evaluated using public benchmark datasets, the existing datasets do not account for the concept of \emph{version}, which is crucial in professional software development. In this paper, we introduce VersiCode, the first comp… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  30. arXiv:2406.06393  [pdf, other

    cs.CV cs.CL q-bio.GN

    STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics

    Authors: Jiawen Chen, Muqing Zhou, Wenrong Wu, Jinwei Zhang, Yun Li, Didong Li

    Abstract: Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology ima… ▽ More

    Submitted 20 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

    ACM Class: I.4.10; I.2.10

  31. arXiv:2406.03882  [pdf, other

    cs.CL cs.SD eess.AS

    Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models

    Authors: Ziyun Cui, Chang Lei, Wen Wu, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang

    Abstract: The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse… ▽ More

    Submitted 9 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  32. arXiv:2406.03199  [pdf, other

    cs.CL cs.AI cs.LG

    Bayesian WeakS-to-Strong from Text Classification to Generation

    Authors: Ziyun Cui, Ziyang Zhang, Wen Wu, Guangzhi Sun, Chao Zhang

    Abstract: Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of… ▽ More

    Submitted 24 May, 2024; originally announced June 2024.

  33. arXiv:2406.02987  [pdf, other

    cs.CV

    Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

    Authors: Wenliang Zhong, Wenyi Wu, Qi Li, Rob Barton, Boxin Du, Shioulin Sam, Karim Bouyarmane, Ismail Tutar, Junzhou Huang

    Abstract: Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  34. arXiv:2406.01059  [pdf, other

    cs.CV

    VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model

    Authors: Jinze Yang, Haoran Wang, Zining Zhu, Chenglong Liu, Meng Wymond Wu, Zeke Xie, Zhong Ji, Jungong Han, Mingming Sun

    Abstract: In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 15 pages

  35. arXiv:2406.00654  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

    Authors: Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

    Abstract: In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even st… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 19 pages, Preprint

  36. arXiv:2405.20064  [pdf, other

    eess.AS cs.SD

    1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

    Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

    Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  37. arXiv:2405.17659  [pdf, other

    eess.IV cs.CV

    Enhancing Global Sensitivity and Uncertainty Quantification in Medical Image Reconstruction with Monte Carlo Arbitrary-Masked Mamba

    Authors: Jiahao Huang, Liutao Yang, Fanwen Wang, Yang Nan, Weiwen Wu, Chengyan Wang, Kuangyu Shi, Angelica I. Aviles-Rivero, Carola-Bibiane Schönlieb, Daoqiang Zhang, Guang Yang

    Abstract: Deep learning has been extensively applied in medical image reconstruction, where Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant paradigms, each possessing distinct advantages and inherent limitations: CNNs exhibit linear complexity with local sensitivity, whereas ViTs demonstrate quadratic complexity with global sensitivity. The emerging Mamba has sh… ▽ More

    Submitted 25 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  38. arXiv:2405.17167  [pdf

    eess.IV cs.CV

    Partitioned Hankel-based Diffusion Models for Few-shot Low-dose CT Reconstruction

    Authors: Wenhao Zhang, Bin Huang, Shuyue Chen, Xiaoling Xu, Weiwen Wu, Qiegen Liu

    Abstract: Low-dose computed tomography (LDCT) plays a vital role in clinical applications by mitigating radiation risks. Nevertheless, reducing radiation doses significantly degrades image quality. Concurrently, common deep learning methods demand extensive data, posing concerns about privacy, cost, and time constraints. Consequently, we propose a few-shot low-dose CT reconstruction method using Partitioned… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  39. arXiv:2405.16464  [pdf, other

    cs.RO cs.CV

    Multi-Modal UAV Detection, Classification and Tracking Algorithm -- Technical Report for CVPR 2024 UG2 Challenge

    Authors: Tianchen Deng, Yi Zhou, Wenhua Wu, Mingrui Li, Jingwei Huang, Shuhong Liu, Yanzeng Song, Hao Zuo, Yanbo Wang, Yutao Yue, Hesheng Wang, Weidong Chen

    Abstract: This technical report presents the 1st winning model for UG2+, a task in CVPR 2024 UAV Tracking and Pose-Estimation Challenge. This challenge faces difficulties in drone detection, UAV-type classification and 2D/3D trajectory estimation in extreme weather conditions with multi-modal sensor information, including stereo vision, various Lidars, Radars, and audio arrays. Leveraging this information… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR 2024 workshop. The 1st winning model in CVPR 2024 UG2+ challenge. The code and configuration of our method are available at https://github.com/dtc111111/Multi-Modal-UAV

  40. arXiv:2405.15677  [pdf, other

    cs.RO cs.CV

    SMART: Scalable Multi-agent Real-time Simulation via Next-token Prediction

    Authors: Wei Wu, Xiaoxin Feng, Ziyan Gao, Yuheng Kan

    Abstract: Data-driven autonomous driving motion generation tasks are frequently impacted by the limitations of dataset size and the domain gap between datasets, which precludes their extensive application in real-world scenarios. To address this issue, we introduce SMART, a novel autonomous driving motion generation paradigm that models vectorized map and agent trajectory data into discrete sequence tokens.… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  41. arXiv:2405.14342  [pdf, other

    cs.CV

    RoGS: Large Scale Road Surface Reconstruction based on 2D Gaussian Splatting

    Authors: Zhiheng Feng, Wenhua Wu, Hesheng Wang

    Abstract: Road surface reconstruction plays a crucial role in autonomous driving, which can be used for road lane perception and autolabeling tasks. Recently, mesh-based road surface reconstruction algorithms show promising reconstruction results. However, these mesh-based methods suffer from slow speed and poor rendering quality. In contrast, the 3D Gaussian Splatting (3DGS) shows superior rendering speed… ▽ More

    Submitted 23 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  42. arXiv:2405.14256  [pdf, other

    cs.LG cs.AI

    ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

    Authors: Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang

    Abstract: KV cache stores key and value states from previous tokens to avoid re-computation, yet it demands substantial storage space, especially for long sequences. Adaptive KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance. However, previous methods of this approach exhibit significant performance degradation… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 15 pages

  43. arXiv:2405.14231  [pdf, other

    cs.CL

    From Role-Play to Drama-Interaction: An LLM Solution

    Authors: Weiqi Wu, Hongqiu Wu, Lai Jiang, Xingyuan Liu, Jiale Hong, Hai Zhao, Min Zhang

    Abstract: Drama is a form of storytelling inspired by human creativity, proceeding with a predefined storyline, carrying emotions and thoughts. This paper introduces \emph{LLM-based interactive drama}, which endows traditional drama with an unprecedented immersion, where a person is allowed to walk into it and interact with the characters and scenes. We define this new artistic genre by 6 essential elements… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Accepted by ACL 2024 Findings

  44. arXiv:2405.13800  [pdf, other

    cs.CV cs.AI

    Dense Connector for MLLMs

    Authors: Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang

    Abstract: Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Technical report. 25 pages

  45. arXiv:2405.13089  [pdf, other

    cs.LG

    SEGAN: semi-supervised learning approach for missing data imputation

    Authors: Xiaohua Pan, Weifeng Wu, Peiran Liu, Zhen Li, Peng Lu, Peijian Cao, Jianfeng Zhang, Xianfei Qiu, YangYang Wu

    Abstract: In many practical real-world applications, data missing is a very common phenomenon, making the development of data-driven artificial intelligence theory and technology increasingly difficult. Data completion is an important method for missing data preprocessing. Most existing miss-ing data completion models directly use the known information in the missing data set but ignore the impact of the da… ▽ More

    Submitted 12 June, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

  46. arXiv:2405.12520  [pdf, other

    cs.DC

    MOSS: A Large-scale Open Microscopic Traffic Simulation System

    Authors: Jun Zhang, Wenxuan Ao, Junbo Yan, Can Rong, Depeng Jin, Wei Wu, Yong Li

    Abstract: In the research of Intelligent Transportation Systems (ITS), traffic simulation is a key procedure for the evaluation of new methods and optimization of strategies. However, existing traffic simulation systems face two challenges. First, how to balance simulation scale with realism is a dilemma. Second, it is hard to simulate realistic results, which requires realistic travel demand data and simul… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Submitted to IEEE ITSC 2024

  47. arXiv:2405.11165  [pdf, other

    cs.CV

    Automated Multi-level Preference for MLLMs

    Authors: Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhao, Fanglong Liu, Yifan Sun, Haocheng Feng, Jingdong Wang

    Abstract: Current multimodal Large Language Models (MLLMs) suffer from ``hallucination'', occasionally generating responses that are not grounded in the input images. To tackle this challenge, one promising path is to utilize reinforcement learning from human feedback (RLHF), which steers MLLMs towards learning superior responses while avoiding inferior ones. We rethink the common practice of using binary p… ▽ More

    Submitted 28 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

    Comments: Preprint

  48. arXiv:2405.10591  [pdf, other

    cs.CV

    GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

    Authors: Xin Tan, Wenbin Wu, Zhiwei Zhang, Chaojie Fan, Yong Peng, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

    Abstract: 3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  49. arXiv:2405.10185  [pdf, other

    cs.CV

    DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data

    Authors: Chengxiang Fan, Muzhi Zhu, Hao Chen, Yang Liu, Weijia Wu, Huaqi Zhang, Chunhua Shen

    Abstract: Instance segmentation is data-hungry, and as model capacity increases, data scale becomes crucial for improving the accuracy. Most instance segmentation datasets today require costly manual annotation, limiting their data scale. Models trained on such data are prone to overfitting on the training set, especially for those rare categories. While recent works have delved into exploiting generative m… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: Accepted to CVPR 2024, codes are available at \href{this https URL}{https://github.com/aim-uofa/DiverGen}

  50. arXiv:2405.10130  [pdf, ps, other

    cs.MS

    PyOptInterface: Design and implementation of an efficient modeling language for mathematical optimization

    Authors: Yue Yang, Chenhui Lin, Luo Xu, Wenchuan Wu

    Abstract: This paper introduces the design and implementation of PyOptInterface, a modeling language for mathematical optimization embedded in Python programming language. PyOptInterface uses lightweight and compact data structure to bridge high-level entities in optimization models like variables and constraints to internal indices of optimizers efficiently. It supports a variety of optimization solvers an… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 10 pages