Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 748 results for author: Yuan, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.11745  [pdf, other

    eess.AS cs.AI cs.SD

    Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

    Authors: Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  2. arXiv:2407.09918  [pdf, other

    eess.IV cs.CV

    DiffRect: Latent Diffusion Label Rectification for Semi-supervised Medical Image Segmentation

    Authors: Xinyu Liu, Wuyang Li, Yixuan Yuan

    Abstract: Semi-supervised medical image segmentation aims to leverage limited annotated data and rich unlabeled data to perform accurate segmentation. However, existing semi-supervised methods are highly dependent on the quality of self-generated pseudo labels, which are prone to incorrect supervision and confirmation bias. Meanwhile, they are insufficient in capturing the label distributions in latent spac… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: MICCAI 2024

  3. arXiv:2407.09826  [pdf, other

    cs.CV

    3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

    Authors: Xiaoxu Xu, Yitian Yuan, Jinlong Li, Qiudan Zhang, Zequn Jie, Lin Ma, Hao Tang, Nicu Sebe, Xu Wang

    Abstract: In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-langu… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  4. arXiv:2407.09760  [pdf, other

    cs.CV cs.AI

    ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report

    Authors: Yixiao Yuan, Yingzhe Peng

    Abstract: The Visual-Dialog Based Emotion Explanation Generation Challenge focuses on generating emotion explanations through visual-dialog interactions in art discussions. Our approach combines state-of-the-art multi-modal models, including Language Model (LM) and Large Vision Language Model (LVLM), to achieve superior performance. By leveraging these models, we outperform existing benchmarks, securing the… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  5. arXiv:2407.09121  [pdf, other

    cs.CL cs.AI

    Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

    Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu

    Abstract: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at a… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  6. arXiv:2407.06881  [pdf, other

    cs.DS

    Efficient Stochastic Routing in Path-Centric Uncertain Road Networks -- Extended Version

    Authors: Chenjuan Guo, Ronghui Xu, Bin Yang, Ye Yuan, Tung Kieu, Yan Zhao, Christian S. Jensen

    Abstract: The availability of massive vehicle trajectory data enables the modeling of road-network constrained movement as travel-cost distributions rather than just single-valued costs, thereby capturing the inherent uncertainty of movement and enabling improved routing quality. Thus, stochastic routing has been studied extensively in the edge-centric model, where such costs are assigned to the edges in a… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  7. arXiv:2407.06048  [pdf, other

    cs.CL cs.CV

    Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation

    Authors: Alan Wu, Ye Yuan, Ming Zhang

    Abstract: Visually impaired people are a large group who can only use braille for reading and writing. However, the lack of special educational resources is the bottleneck for educating them. Educational equity is a reflection of the level of social civilization, cultural equality, and individual dignity. Facilitating and improving lifelong learning channels for the visually impaired is of great significanc… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: This paper is submitted to NeurIPS 2024 High School Project Track

  8. arXiv:2407.05540  [pdf, other

    cs.CV

    GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation

    Authors: Chenxin Li, Xinyu Liu, Cheng Wang, Yifan Liu, Weihao Yu, Jing Shao, Yixuan Yuan

    Abstract: Recent advances in learning multi-modal representation have witnessed the success in biomedical domains. While established techniques enable handling multi-modal information, the challenges are posed when extended to various clinical modalities and practical modalitymissing setting due to the inherent modality gaps. To tackle these, we propose an innovative Modality-prompted Heterogeneous Graph fo… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  9. arXiv:2407.04416  [pdf, other

    cs.SD cs.MM eess.AS

    Improving Audio Generation with Visual Enhanced Caption

    Authors: Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Mark D. Plumbley, Wenwu Wang

    Abstract: Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data. In this work, we aim to create a large-scale audio dataset with rich captions for improving audi… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 5 pages with 1 appendix

  10. arXiv:2407.04121  [pdf, other

    cs.CL cs.AI

    Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models

    Authors: Yuyan Chen, Qiang Fu, Yichen Yuan, Zhihao Wen, Ge Fan, Dayiheng Liu, Dongmei Zhang, Zhixu Li, Yanghua Xiao

    Abstract: Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks, including question answering and dialogue systems. However, a major drawback of LLMs is the issue of hallucination, where they generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences. In this paper, we propose a robust discriminator name… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted to CIKM 2023 (Long Paper)

  11. arXiv:2407.03825  [pdf, other

    cs.CV cs.RO

    StreamLTS: Query-based Temporal-Spatial LiDAR Fusion for Cooperative Object Detection

    Authors: Yunshuang Yuan, Monika Sester

    Abstract: Cooperative perception via communication among intelligent traffic agents has great potential to improve the safety of autonomous driving. However, limited communication bandwidth, localization errors and asynchronized capturing time of sensor data, all introduce difficulties to the data fusion of different agents. To some extend, previous works have attempted to reduce the shared data size, mitig… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  12. arXiv:2407.03640  [pdf, other

    cs.LG cs.CL cs.CV

    Generative Technology for Human Emotion Recognition: A Scope Review

    Authors: Fei Ma, Yucheng Yuan, Yifan Xie, Hongwei Ren, Ivan Liu, Ying He, Fuji Ren, Fei Richard Yu, Shiguang Ni

    Abstract: Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progre… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Under Review

  13. arXiv:2407.03133  [pdf, other

    cs.CY cs.AI cs.LG stat.ML

    Quantifying the Cross-sectoral Intersecting Discrepancies within Multiple Groups Using Latent Class Analysis Towards Fairness

    Authors: Yingfang Yuan, Kefan Chen, Mehdi Rizvi, Lynne Baillie, Wei Pang

    Abstract: The growing interest in fair AI development is evident. The ''Leave No One Behind'' initiative urges us to address multiple and intersecting forms of inequality in accessing services, resources, and opportunities, emphasising the significance of fairness in AI. This is particularly relevant as an increasing number of AI tools are applied to decision-making processes, such as resource allocation an… ▽ More

    Submitted 11 July, 2024; v1 submitted 24 May, 2024; originally announced July 2024.

  14. arXiv:2407.02392  [pdf, other

    cs.CV

    TokenPacker: Efficient Visual Projector for Multimodal LLM

    Authors: Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang

    Abstract: The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significa… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 16 pages, Codes:https://github.com/CircleRadon/TokenPacker

  15. arXiv:2407.01301  [pdf, other

    cs.CV

    GaussianStego: A Generalizable Stenography Pipeline for Generative 3D Gaussians Splatting

    Authors: Chenxin Li, Hengyu Liu, Zhiwen Fan, Wuyang Li, Yifan Liu, Panwang Pan, Yixuan Yuan

    Abstract: Recent advancements in large generative models and real-time neural rendering using point-based techniques pave the way for a future of widespread visual data distribution through sharing synthesized 3D assets. However, while standardized methods for embedding proprietary or copyright information, either overtly or subtly, exist for conventional visual content such as images and videos, this issue… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Project website: https://gaussian-stego.github.io/

  16. arXiv:2407.01029  [pdf, other

    cs.CV

    EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting

    Authors: Chenxin Li, Brandon Y. Feng, Yifan Liu, Hengyu Liu, Cheng Wang, Weihao Yu, Yixuan Yuan

    Abstract: 3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accpeted by MICCAI2024

  17. arXiv:2407.00468  [pdf, other

    cs.CV cs.AI cs.CL

    MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

    Authors: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

    Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/

  18. arXiv:2407.00187  [pdf, other

    cs.RO cs.CV cs.GR

    SMPLOlympics: Sports Environments for Physically Simulated Humanoids

    Authors: Zhengyi Luo, Jiashun Wang, Kangni Liu, Haotian Zhang, Chen Tessler, Jingbo Wang, Ye Yuan, Jinkun Cao, Zihui Lin, Fengyi Wang, Jessica Hodgins, Kris Kitani

    Abstract: We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for m… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: Project page: https://smplolympics.github.io/SMPLOlympics

  19. arXiv:2407.00099  [pdf, other

    q-bio.NC cs.LG stat.AP

    Optimal Transport for Latent Integration with An Application to Heterogeneous Neuronal Activity Data

    Authors: Yubai Yuan, Babak Shahbaba, Norbert Fortin, Keiland Cooper, Qing Nie, Annie Qu

    Abstract: Detecting dynamic patterns of task-specific responses shared across heterogeneous datasets is an essential and challenging problem in many scientific applications in medical science and neuroscience. In our motivating example of rodent electrophysiological data, identifying the dynamical patterns in neuronal activity associated with ongoing cognitive demands and behavior is key to uncovering the n… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

  20. arXiv:2406.18310  [pdf, other

    cs.CV cs.LG eess.IV

    Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

    Authors: Wenting Chen, Jie Liu, Tommy W. S. Chow, Yixuan Yuan

    Abstract: Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and mis… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to IEEE TRANSACTIONS ON MEDICAL IMAGING (TMI)

  21. arXiv:2406.17255  [pdf, other

    cs.CL

    MPCODER: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning

    Authors: Zhenlong Dai, Chang Yao, WenKang Han, Ying Yuan, Zhipeng Gao, Jingyuan Chen

    Abstract: Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn co… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024, Main Conference

  22. arXiv:2406.16073  [pdf, other

    cs.CV

    LGS: A Light-weight 4D Gaussian Splatting for Efficient Surgical Scene Reconstruction

    Authors: Hengyu Liu, Yifan Liu, Chenxin Li, Wuyang Li, Yixuan Yuan

    Abstract: The advent of 3D Gaussian Splatting (3D-GS) techniques and their dynamic scene modeling variants, 4D-GS, offers promising prospects for real-time rendering of dynamic surgical scenarios. However, the prerequisite for modeling dynamic scenes by a large number of Gaussian units, the high-dimensional Gaussian attributes and the high-resolution deformation fields, all lead to serve storage issues that… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted by MICCAI 2024. Project page: https://lgs-endo.github.io/

  23. arXiv:2406.15264  [pdf, other

    cs.IR cs.CL

    Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics

    Authors: Weijia Zhang, Mohammad Aliannejadi, Yifei Yuan, Jiahuan Pei, Jia-Hong Huang, Evangelos Kanoulas

    Abstract: Large language models (LLMs) often produce unsupported or unverifiable information, known as "hallucinations." To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estima… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: 12 pages, 3 figures

  24. arXiv:2406.13450  [pdf, other

    cs.AI

    Federating to Grow Transformers with Constrained Resources without Model Sharing

    Authors: Shikun Shen, Yifei Zou, Yuan Yuan, Yanwei Zheng, Peng Li, Xiuzhen Cheng, Dongxiao Yu

    Abstract: The high resource consumption of large-scale models discourages resource-constrained users from developing their customized transformers. To this end, this paper considers a federated framework named Fed-Grow for multiple participants to cooperatively scale a transformer from their pre-trained small models. Under the Fed-Grow, a Dual-LiGO (Dual Linear Growth Operator) architecture is designed to h… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  25. arXiv:2406.12270  [pdf, other

    cs.IT eess.SP

    Sparse MIMO for ISAC: New Opportunities and Challenges

    Authors: Xinrui Li, Hongqi Min, Yong Zeng, Shi Jin, Linglong Dai, Yifei Yuan, Rui Zhang

    Abstract: Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not on… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  26. arXiv:2406.11030  [pdf, other

    cs.CL

    FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

    Authors: Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders Søgaard, Daniel Hershcovich, Desmond Elliott

    Abstract: Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs)… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  27. arXiv:2406.10673  [pdf, other

    cs.CV

    SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

    Authors: Yike Yuan, Huanzhang Dou, Fengjun Guo, Xi Li

    Abstract: This paper represents a neat yet effective framework, named SemanticMIM, to integrate the advantages of masked image modeling (MIM) and contrastive learning (CL) for general visual representation. We conduct a thorough comparative analysis between CL and MIM, revealing that their complementary advantages fundamentally stem from two distinct phases, i.e., compression and reconstruction. Specificall… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  28. arXiv:2406.10508  [pdf, other

    cs.CV

    Learning to Adapt Foundation Model DINOv2 for Capsule Endoscopy Diagnosis

    Authors: Bowen Zhang, Ying Chen, Long Bai, Yan Zhao, Yuxiang Sun, Yixuan Yuan, Jianhua Zhang, Hongliang Ren

    Abstract: Foundation models have become prominent in computer vision, achieving notable success in various tasks. However, their effectiveness largely depends on pre-training with extensive datasets. Applying foundation models directly to small datasets of capsule endoscopy images from scratch is challenging. Pre-training on broad, general vision datasets is crucial for successfully fine-tuning our model fo… ▽ More

    Submitted 30 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

    Comments: To appear in ICBIR 2024

  29. arXiv:2406.10484  [pdf, other

    cs.CV

    Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

    Authors: Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

    Abstract: The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on s… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  30. arXiv:2406.10208  [pdf, other

    cs.CV

    Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

    Authors: Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, Yuhui Yuan

    Abstract: Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages b… ▽ More

    Submitted 12 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: Project page: https://glyph-byt5-v2.github.io/

  31. arXiv:2406.09509  [pdf, other

    cs.AI cs.LG cs.RO

    CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making

    Authors: Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yi Ma, Pengyi Li, Yan Zheng

    Abstract: Leveraging the powerful generative capability of diffusion models (DMs) to build decision-making agents has achieved extensive success. However, there is still a demand for an easy-to-use and modularized open-source library that offers customized and efficient development for DM-based decision-making algorithms. In this work, we introduce CleanDiffuser, the first DM library specifically designed f… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: The first two authors contribute equally to this work. Code and documentation: https://github.com/CleanDiffuserTeam/CleanDiffuser

  32. arXiv:2406.08392  [pdf, other

    cs.CV

    FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation

    Authors: Xinzhi Mu, Li Chen, Bohan Chen, Shuyang Gu, Jianmin Bao, Dong Chen, Ji Li, Yuhui Yuan

    Abstract: Recently, the application of modern diffusion-based text-to-image generation models for creating artistic fonts, traditionally the domain of professional designers, has garnered significant interest. Diverging from the majority of existing studies that concentrate on generating artistic typography, our research aims to tackle a novel and more demanding challenge: the generation of text effects for… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Project-page: https://font-studio.github.io/

  33. arXiv:2406.08024  [pdf, other

    cs.CV cs.AI

    Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

    Authors: Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

    Abstract: Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data. This paper addresses the challenge by leveraging the visual commonalities between images and videos to efficiently evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  34. arXiv:2406.06571  [pdf, other

    cs.CL cs.AI

    SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

    Authors: Quandong Wang, Yuxuan Yuan, Xiaoyu Yang, Ruike Zhang, Kang Zhao, Wei Liu, Jian Luan, Daniel Povey, Bin Wang

    Abstract: While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The sub… ▽ More

    Submitted 17 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: 9 pages, 3 figures, submitted to ECAI 2024

    ACM Class: I.2.7

  35. arXiv:2406.05995  [pdf, other

    cs.CL cs.AI cs.LG

    A Dual-View Approach to Classifying Radiology Reports by Co-Training

    Authors: Yutong Han, Yan Yuan, Lili Mou

    Abstract: Radiology report analysis provides valuable information that can aid with public health initiatives, and has been attracting increasing attention from the research community. In this work, we present a novel insight that the structure of a radiology report (namely, the Findings and Impression sections) offers different views of a radiology scan. Based on this intuition, we further propose a co-tra… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by LREC-COLING 2024

  36. arXiv:2406.04314  [pdf, other

    cs.CV

    Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

    Authors: Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, Liang Zheng

    Abstract: Recently, Direct Preference Optimization (DPO) has extended its success from aligning large language models (LLMs) to aligning text-to-image diffusion models with human preferences. Unlike most existing DPO methods that assume all diffusion steps share a consistent preference order with the final generated images, we argue that this assumption neglects step-specific denoising performance and that… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  37. arXiv:2406.02918  [pdf, other

    eess.IV cs.CV

    U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation

    Authors: Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, Yixuan Yuan

    Abstract: U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the… ▽ More

    Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  38. D-FaST: Cognitive Signal Decoding with Disentangled Frequency-Spatial-Temporal Attention

    Authors: Weiguo Chen, Changjian Wang, Kele Xu, Yuan Yuan, Yanru Bai, Dongsong Zhang

    Abstract: Cognitive Language Processing (CLP), situated at the intersection of Natural Language Processing (NLP) and cognitive science, plays a progressively pivotal role in the domains of artificial intelligence, cognitive intelligence, and brain science. Among the essential areas of investigation in CLP, Cognitive Signal Decoding (CSD) has made remarkable achievements, yet there still exist challenges rel… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 18 pages, 9 figures. Accepted by IEEE Transactions on Cognitive and Developmental Systems

  39. arXiv:2406.01911  [pdf, ps, other

    cs.SI cs.DS

    Influence Maximization in Hypergraphs by Stratified Sampling for Efficient Generation of Reverse Reachable Sets

    Authors: Lingling Zhang, Hong Jiang, Ye Yuan, Guoren Wang

    Abstract: Given a hypergraph, influence maximization (IM) is to discover a seed set containing $k$ vertices that have the maximal influence. Although the existing vertex-based IM algorithms perform better than the hyperedge-based algorithms by generating random reverse researchable (RR) sets, they are inefficient because (i) they ignore important structural information associated with hyperedges and thus ob… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 15 pages,10figures

  40. arXiv:2405.19697  [pdf, other

    math.OC cs.AI cs.LG stat.ML

    Bilevel reinforcement learning via the development of hyper-gradient without lower-level convexity

    Authors: Yan Yang, Bin Gao, Ya-xiang Yuan

    Abstract: Bilevel reinforcement learning (RL), which features intertwined two-level problems, has attracted growing interest recently. The inherent non-convexity of the lower-level RL problem is, however, to be an impediment to developing bilevel optimization methods. By employing the fixed point equation associated with the regularized RL, we characterize the hyper-gradient via fully first-order informatio… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 43 pages, 1 figure, 1 table

  41. arXiv:2405.18356  [pdf, other

    eess.IV cs.CV

    Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

    Authors: Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, Zongwei Zhou

    Abstract: The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted to Medical Image Analysis

  42. Negative as Positive: Enhancing Out-of-distribution Generalization for Graph Contrastive Learning

    Authors: Zixu Wang, Bingbing Xu, Yige Yuan, Huawei Shen, Xueqi Cheng

    Abstract: Graph contrastive learning (GCL), standing as the dominant paradigm in the realm of graph pre-training, has yielded considerable progress. Nonetheless, its capacity for out-of-distribution (OOD) generalization has been relatively underexplored. In this work, we point out that the traditional optimization of InfoNCE in GCL restricts the cross-domain pairs only to be negative samples, which inevitab… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: 5 pages, 5 figures, In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '24), July 14-18, 2024, Washington, DC, USA

    ACM Class: I.2

  43. arXiv:2405.16064  [pdf, other

    cs.CL

    Keypoint-based Progressive Chain-of-Thought Distillation for LLMs

    Authors: Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, Guoren Wang

    Abstract: Chain-of-thought distillation is a powerful technique for transferring reasoning abilities from large language models (LLMs) to smaller student models. Previous methods typically require the student to mimic the step-by-step rationale produced by LLMs, often facing the following challenges: (i) Tokens within a rationale vary in significance, and treating them equally may fail to accurately mimic k… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: Accepted by ICML 2024

  44. arXiv:2405.13964  [pdf, other

    cs.LG cs.CE

    Design Editing for Offline Model-based Optimization

    Authors: Ye Yuan, Youyuan Zhang, Can Chen, Haolun Wu, Zixuan Li, Jianmo Li, James J. Clark, Xue Liu

    Abstract: Offline model-based optimization (MBO) aims to maximize a black-box objective function using only an offline dataset of designs and scores. A prevalent approach involves training a conditional generative model on existing designs and their associated scores, followed by the generation of new designs conditioned on higher target scores. However, these newly generated designs often underperform due… ▽ More

    Submitted 26 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  45. arXiv:2405.12954  [pdf, other

    cs.LG cs.AI

    A Method on Searching Better Activation Functions

    Authors: Haoyuan Sun, Zihao Wu, Bo Xia, Pu Chang, Zibin Dong, Yifu Yuan, Yongzhe Chang, Xueqian Wang

    Abstract: The success of artificial neural networks (ANNs) hinges greatly on the judicious selection of an activation function, introducing non-linearity into network and enabling them to model sophisticated relationships in data. However, the search of activation functions has largely relied on empirical knowledge in the past, lacking theoretical guidance, which has hindered the identification of more effe… ▽ More

    Submitted 22 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

    Comments: 16 pages,3 figures

  46. arXiv:2405.12530  [pdf, other

    cs.NI

    Multi-hop Multi-RIS Wireless Communication Systems: Multi-reflection Path Scheduling and Beamforming

    Authors: Xiaoyan Ma, Haixia Zhang, Xianhao Chen, Yuguang Fangmand Dongfeng Yuan

    Abstract: Reconfigurable intelligent surface (RIS) provides a promising way to proactively augment propagation environments for better transmission performance in wireless communications. Existing multi-RIS works mainly focus on link-level optimization with predetermined transmission paths, which cannot be directly extended to system-level management, since they neither consider the interference caused by u… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE Transactions on Wireless Communication

  47. arXiv:2405.11804  [pdf, other

    cs.CL

    (Perhaps) Beyond Human Translation: Harnessing Multi-Agent Collaboration for Translating Ultra-Long Literary Texts

    Authors: Minghao Wu, Yulin Yuan, Gholamreza Haffari, Longyue Wang

    Abstract: Recent advancements in machine translation (MT) have significantly enhanced translation quality across various domains. However, the translation of literary texts remains a formidable challenge due to their complex language, figurative expressions, and cultural nuances. In this work, we introduce a novel multi-agent framework based on large language models (LLMs) for literary translation, implemen… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: work in progress

  48. arXiv:2405.10825  [pdf, other

    eess.SY cs.LG

    Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities

    Authors: Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu

    Abstract: Large language models (LLMs) have received considerable attention recently due to their outstanding comprehension and reasoning capabilities, leading to great progress in many fields. The advancement of LLM techniques also offers promising opportunities to automate many tasks in the telecommunication (telecom) field. After pre-training and fine-tuning, LLMs can perform diverse downstream tasks bas… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  49. arXiv:2405.10452  [pdf, other

    cs.CL cs.LG

    Navigating Public Sentiment in the Circular Economy through Topic Modelling and Hyperparameter Optimisation

    Authors: Junhao Song, Yingfang Yuan, Kaiwen Chang, Bing Xu, Jin Xuan, Wei Pang

    Abstract: To advance the circular economy (CE), it is crucial to gain insights into the evolution of public sentiments, cognitive pathways of the masses concerning circular products and digital technology, and recognise the primary concerns. To achieve this, we collected data related to the CE from diverse platforms including Twitter, Reddit, and The Guardian. This comprehensive data collection spanned acro… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  50. arXiv:2405.09333  [pdf, other

    cs.CV

    Application of Gated Recurrent Units for CT Trajectory Optimization

    Authors: Yuedong Yuan, Linda-Sophie Schneider, Andreas Maier

    Abstract: Recent advances in computed tomography (CT) imaging, especially with dual-robot systems, have introduced new challenges for scan trajectory optimization. This paper presents a novel approach using Gated Recurrent Units (GRUs) to optimize CT scan trajectories. Our approach exploits the flexibility of robotic CT systems to select projections that enhance image quality by improving resolution and con… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: 4 pages, 6 figures