Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 562 results for author: Wu, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.19845  [pdf, other

    cs.LG cs.CR

    BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning

    Authors: Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, Chao Shen

    Abstract: As an emerging approach to explore the vulnerability of deep neural networks (DNNs), backdoor learning has attracted increasing interest in recent years, and many seminal backdoor attack and defense algorithms are being developed successively or concurrently, in the status of a rapid arms race. However, mainly due to the diverse settings, and the difficulties of implementation and reproducibility… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Substantial extensions based on our previous conference version "Backdoorbench: A comprehensive benchmark of backdoor learning" published at NeurIPS D&B Track 2022. 20 backdoor attack algorithms, 32 backdoor defense algorithms, 11000+ pairs of attack-against-defense evaluations, 10 analyses, 18 analysis tools

  2. arXiv:2407.19546  [pdf, other

    cs.CV

    XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

    Authors: Biao Wu, Yutong Xie, Zeyu Zhang, Minh Hieu Phan, Qi Chen, Ling Chen, Qi Wu

    Abstract: Vision-and-language pretraining (VLP) in the medical field utilizes contrastive learning on image-text pairs to achieve effective transfer across tasks. Yet, current VLP approaches with the masked modelling strategy face two challenges when applied to the medical domain. First, current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data. Second,… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  3. arXiv:2407.16958  [pdf, other

    cs.LG cs.AI

    Cheems: Wonderful Matrices More Efficient and More Effective Architecture

    Authors: Jingze Shi, Lu He, Yuhan Wang, Tianyu He, Bingheng Wu, Mingkun Hou

    Abstract: Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. I studied the effectiveness of using different position encodings in structured s… ▽ More

    Submitted 24 July, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

  4. arXiv:2407.16364  [pdf, other

    cs.CV

    Harmonizing Visual Text Comprehension and Generation

    Authors: Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, Yuan Xie

    Abstract: In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervi… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  5. arXiv:2407.13937  [pdf, other

    cs.CV

    Boosting Online 3D Multi-Object Tracking through Camera-Radar Cross Check

    Authors: Sheng-Yao Kuan, Jen-Hao Cheng, Hsiang-Wei Huang, Wenhao Chai, Cheng-Yen Yang, Hugo Latapie, Gaowen Liu, Bing-Fei Wu, Jenq-Neng Hwang

    Abstract: In the domain of autonomous driving, the integration of multi-modal perception techniques based on data from diverse sensors has demonstrated substantial progress. Effectively surpassing the capabilities of state-of-the-art single-modality detectors through sensor fusion remains an active challenge. This work leverages the respective advantages of cameras in perspective view and radars in Bird's E… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 2024 IEEE Intelligent Vehicles Symposium (IV)

  6. arXiv:2407.13168  [pdf, other

    cs.AI cs.CL

    SciCode: A Research Coding Benchmark Curated by Scientists

    Authors: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du , et al. (5 additional authors not shown)

    Abstract: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields,… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: 25 pages, 9 figures, 7 tables

  7. arXiv:2407.11277  [pdf, other

    cs.CL eess.AS

    Target conversation extraction: Source separation using turn-taking dynamics

    Authors: Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota

    Abstract: Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in h… ▽ More

    Submitted 29 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by Interspeech 2024

  8. arXiv:2407.09751  [pdf, other

    cs.CV

    TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation

    Authors: Xiaopei Wu, Yuenan Hou, Xiaoshui Huang, Binbin Lin, Tong He, Xinge Zhu, Yuexin Ma, Boxi Wu, Haifeng Liu, Deng Cai, Wanli Ouyang

    Abstract: Training deep models for LiDAR semantic segmentation is challenging due to the inherent sparsity of point clouds. Utilizing temporal data is a natural remedy against the sparsity problem as it makes the input signal denser. However, previous multi-frame fusion algorithms fall short in utilizing sufficient temporal information due to the memory constraint, and they also ignore the informative tempo… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: Accepted by CVPR 2024

  9. arXiv:2407.08572  [pdf, other

    cs.CV

    Boosting Adversarial Transferability for Skeleton-based Action Recognition via Exploring the Model Posterior Space

    Authors: Yunfeng Diao, Baiqi Wu, Ruixuan Zhang, Xun Yang, Meng Wang, He Wang

    Abstract: Skeletal motion plays a pivotal role in human activity recognition (HAR). Recently, attack methods have been proposed to identify the universal vulnerability of skeleton-based HAR(S-HAR). However, the research of adversarial transferability on S-HAR is largely missing. More importantly, existing attacks all struggle in transfer across unknown S-HAR models. We observed that the key reason is that t… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  10. arXiv:2407.05033  [pdf, other

    cs.IR

    Preference Distillation for Personalized Generative Recommendation

    Authors: Jerome Ramos, Bin Wu, Aldo Lipani

    Abstract: Recently, researchers have investigated the capabilities of Large Language Models (LLMs) for generative recommender systems. Existing LLM-based recommender models are trained by adding user and item IDs to a discrete prompt template. However, the disconnect between IDs and natural language makes it difficult for the LLM to learn the relationship between users. To address this issue, we propose a P… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  11. arXiv:2407.03037  [pdf, other

    cs.SE

    Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model

    Authors: Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang

    Abstract: With the advancement of software rendering techniques, GUI pages in mobile apps now encompass a wealth of visual information, where the visual semantics of each page contribute to the overall app logic, presenting new challenges to software testing. Despite the progress in automated Graphical User Interface (GUI) testing, the absence of testing oracles has constrained its efficacy to identify only… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  12. arXiv:2407.01976  [pdf, other

    cs.CL cs.AI cs.MM

    A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

    Authors: Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang

    Abstract: Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In th… ▽ More

    Submitted 24 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

  13. arXiv:2407.01541  [pdf

    cs.NI

    Integration of Computer Networks and Artificial Neural Networks for an AI-based Network Operator

    Authors: Binbin Wu, Jingyu Xu, Yifan Zhang, Bo Liu, Yulu Gong, Jiaxin Huang

    Abstract: This paper proposes an integrated approach combining computer networks and artificial neural networks to construct an intelligent network operator, functioning as an AI model. State information from computer networks is transformed into embedded vectors, enabling the operator to efficiently recognize different pieces of information and accurately output appropriate operations for the computer netw… ▽ More

    Submitted 9 April, 2024; originally announced July 2024.

  14. arXiv:2407.00468  [pdf, other

    cs.CV cs.AI cs.CL

    MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

    Authors: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

    Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/

  15. arXiv:2407.00102  [pdf, other

    cs.LG cs.AI cs.CL

    Curriculum Learning with Quality-Driven Data Selection

    Authors: Biao Wu, Fang Meng, Ling Chen

    Abstract: The impressive multimodal capabilities demonstrated by OpenAI's GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

  16. Multi-agent Cooperative Games Using Belief Map Assisted Training

    Authors: Qinwei Huang, Chen Luo, Alex B. Wu, Simon Khan, Hai Li, Qinru Qiu

    Abstract: In a multi-agent system, agents share their local observations to gain global situational awareness for decision making and collaboration using a message passing system. When to send a message, how to encode a message, and how to leverage the received messages directly affect the effectiveness of the collaboration among agents. When training a multi-agent cooperative game using reinforcement learn… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Journal ref: ECAI 2023. IOS Press, 2023: 1617-1624

  17. arXiv:2406.19188  [pdf, other

    cs.LG

    Averaging log-likelihoods in direct alignment

    Authors: Nathan Grinsztajn, Yannis Flet-Berliac, Mohammad Gheshlaghi Azar, Florian Strub, Bill Wu, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Olivier Pietquin, Matthieu Geist

    Abstract: To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involvin… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  18. arXiv:2406.18373  [pdf, other

    cs.CL cs.SD eess.AS

    Dynamic Data Pruning for Automatic Speech Recognition

    Authors: Qiao Xiao, Pingchuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

    Abstract: The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  19. arXiv:2406.18187  [pdf, other

    cs.CL cs.AI cs.LG

    Selective Prompting Tuning for Personalized Conversations with LLMs

    Authors: Qiushi Huang, Xubo Liu, Tom Ko, Bo Wu, Wenwu Wang, Yu Zhang, Lilian Tang

    Abstract: In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 findings

  20. arXiv:2406.17803  [pdf, other

    cs.CL cs.AI cs.IR

    Understanding the Role of User Profile in the Personalization of Large Language Models

    Authors: Bin Wu, Zhengyan Shi, Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz

    Abstract: Utilizing user profiles to personalize Large Language Models (LLMs) has been shown to enhance the performance on a wide range of tasks. However, the precise role of user profiles and their effect mechanism on LLMs remains unclear. This study first confirms that the effectiveness of user profiles is primarily due to personalization information rather than semantic information. Furthermore, we inves… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  21. arXiv:2406.17519  [pdf, other

    cs.CL

    Entropy-Based Decoding for Retrieval-Augmented Large Language Models

    Authors: Zexuan Qiu, Zijing Ou, Bin Wu, Jingjing Li, Aiwei Liu, Irwin King

    Abstract: Augmenting Large Language Models (LLMs) with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses. Despite their success, retrieval-augmented LLMs still face the distractibility issue, where the generated responses are negatively influenced by noise from both external and internal knowledge sources. In this paper, we introduce a novel, trainin… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  22. arXiv:2406.17419  [pdf, other

    cs.CL cs.AI

    Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

    Authors: Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li

    Abstract: Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-contex… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: We release our code and data publicly at https://github.com/MozerWang/Loong

  23. arXiv:2406.16866  [pdf, other

    cs.CV

    Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

    Authors: Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S. -H. Gary Chan, Hongyang Zhang

    Abstract: Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  24. arXiv:2406.16495  [pdf, other

    cs.CL cs.AI

    OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

    Authors: Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

    Abstract: Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the seq… ▽ More

    Submitted 19 July, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  25. arXiv:2406.16254  [pdf, other

    cs.LG cs.AI cs.CL

    Confidence Regulation Neurons in Language Models

    Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

    Abstract: Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized b… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 25 pages, 14 figures

  26. arXiv:2406.15797  [pdf, other

    cs.LG cs.AI

    Synergistic Deep Graph Clustering Network

    Authors: Benyu Wu, Shifei Ding, Xiao Xu, Lili Guo, Ling Ding, Xindong Wu

    Abstract: Employing graph neural networks (GNNs) to learn cohesive and discriminative node representations for clustering has shown promising results in deep graph clustering. However, existing methods disregard the reciprocal relationship between representation learning and structure augmentation. This study suggests that enhancing embedding and structure synergistically becomes imperative for GNNs to unle… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  27. arXiv:2406.13357  [pdf, other

    cs.CL cs.SD eess.AS

    Transferable speech-to-text large language model alignment module

    Authors: Boyong Wu, Chao Yan, Haoran Pu

    Abstract: By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achiev… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024; 5 pages, 2 figures

  28. DLP: towards active defense against backdoor attacks with decoupled learning process

    Authors: Zonghao Ying, Bin Wu

    Abstract: Deep learning models are well known to be susceptible to backdoor attack, where the attacker only needs to provide a tampered dataset on which the triggers are injected. Models trained on the dataset will passively implant the backdoor, and triggers on the input can mislead the models during testing. Our study shows that the model shows different learning behaviors in clean and poisoned subsets du… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  29. arXiv:2406.11157  [pdf, other

    cs.CR

    DeFiGuard: A Price Manipulation Detection Service in DeFi using Graph Neural Networks

    Authors: Dabao Wang, Bang Wu, Xingliang Yuan, Lei Wu, Yajin Zhou, Helei Cui

    Abstract: The prosperity of Decentralized Finance (DeFi) unveils underlying risks, with reported losses surpassing 3.2 billion USD between 2018 and 2022 due to vulnerabilities in Decentralized Applications (DApps). One significant threat is the Price Manipulation Attack (PMA) that alters asset prices during transaction execution. As a result, PMA accounts for over 50 million USD in losses. To address the ur… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages, 7 figures

  30. NBA: defensive distillation for backdoor removal via neural behavior alignment

    Authors: Zonghao Ying, Bin Wu

    Abstract: Recently, deep neural networks have been shown to be vulnerable to backdoor attacks. A backdoor is inserted into neural networks via this attack paradigm, thus compromising the integrity of the network. As soon as an attacker presents a trigger during the testing phase, the backdoor in the model is activated, allowing the network to make specific wrong predictions. It is extremely important to def… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  31. arXiv:2406.05756  [pdf, other

    cs.AI cs.CL cs.CV cs.MM

    EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

    Authors: Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, Zhongyu Wei

    Abstract: The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving the gap between current LVLMs and qualified embodied intelligence unknown. Therefore, we construct EmbSpatial-Bench, a benchmark for evaluating embodied spatial… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024 Main

  32. arXiv:2406.04264  [pdf, other

    cs.CV cs.AI cs.CL

    MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

    Authors: Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu

    Abstract: The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To addres… ▽ More

    Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  33. arXiv:2406.03215  [pdf, other

    cs.CV

    Searching Priors Makes Text-to-Video Synthesis Better

    Authors: Haoran Cheng, Liang Peng, Linxuan Xia, Yuepeng Hu, Hengjia Li, Qinglin Lu, Xiaofei He, Boxi Wu

    Abstract: Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  34. arXiv:2406.01326  [pdf, other

    cs.CV

    TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

    Authors: Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shu Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Houqiang Li, Can Huang

    Abstract: Tables contain factual and quantitative data accompanied by various structures and contents that pose challenges for machine comprehension. Previous methods generally design task-specific architectures and objectives for individual tasks, resulting in modal isolation and intricate workflows. In this paper, we present a novel large vision-language model, TabPedia, equipped with a concept synergy me… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 20 pages, 8 figures

  35. arXiv:2406.00587  [pdf, other

    cs.CV

    Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

    Authors: Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

    Abstract: Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather th… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Champion Solution for CVPR 2024 PVUW VSS Track. arXiv admin note: text overlap with arXiv:2306.02894

  36. arXiv:2406.00500  [pdf, other

    cs.CV

    2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

    Authors: Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

    Abstract: Video Panoptic Segmentation (VPS) is a challenging task that is extends from image panoptic segmentation.VPS aims to simultaneously classify, track, segment all objects in a video, including both things and stuff. Due to its wide application in many downstream tasks such as video understanding, video editing, and autonomous driving. In order to deal with the task of video panoptic segmentation in… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 2nd Place Solution for CVPR 2024 PVUW VPS Track

  37. arXiv:2405.19092  [pdf, other

    cs.CV

    Benchmarking and Improving Detail Image Caption

    Authors: Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan Guo

    Abstract: Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annota… ▽ More

    Submitted 7 July, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  38. arXiv:2405.17876  [pdf, other

    cs.LG cs.DC math.OC

    Decentralized Directed Collaboration for Personalized Federated Learning

    Authors: Yingqi Liu, Yifan Shi, Qinglun Li, Baoyuan Wu, Xueqian Wang, Li Shen

    Abstract: Personalized Federated Learning (PFL) is proposed to find the greatest personalized models for each client. To avoid the central failure and communication bottleneck in the server-based FL, we concentrate on the Decentralized Personalized Federated Learning (DPFL) that performs distributed model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL are based on undirected and sy… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: CVPR 2024. arXiv admin note: text overlap with arXiv:2305.15157

  39. arXiv:2405.17871  [pdf, other

    cs.CV cs.AI cs.CL

    Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

    Authors: Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo

    Abstract: Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  40. arXiv:2405.16134  [pdf, other

    cs.CV

    Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

    Authors: Mingli Zhu, Siyuan Liang, Baoyuan Wu

    Abstract: Deep neural networks face persistent challenges in defending against backdoor attacks, leading to an ongoing battle between attacks and defenses. While existing backdoor defense strategies have shown promising performance on reducing attack success rates, can we confidently claim that the backdoor threat has truly been eliminated from the model? To address it, we re-investigate the characteristics… ▽ More

    Submitted 30 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  41. arXiv:2405.16112  [pdf, other

    cs.CR cs.CV

    Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor

    Authors: Shaokui Wei, Hongyuan Zha, Baoyuan Wu

    Abstract: Data-poisoning backdoor attacks are serious security threats to machine learning models, where an adversary can manipulate the training dataset to inject backdoors into models. In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned. Unlike most existing methods that primarily detect and remove/unlearn suspicious samp… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: 13 pages, 5 figures and 5 tables

  42. arXiv:2405.14407  [pdf, other

    cs.LG

    Gradient Transformation: Towards Efficient and Model-Agnostic Unlearning for Dynamic Graph Neural Networks

    Authors: He Zhang, Bang Wu, Xiangwen Yang, Xingliang Yuan, Chengqi Zhang, Shirui Pan

    Abstract: Graph unlearning has emerged as an essential tool for safeguarding user privacy and mitigating the negative impacts of undesirable data. Meanwhile, the advent of dynamic graph neural networks (DGNNs) marks a significant advancement due to their superior capability in learning from dynamic graphs, which encapsulate spatial-temporal variations in diverse real-world applications (e.g., traffic foreca… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  43. arXiv:2405.14394  [pdf, other

    cs.CL cs.AI

    Instruction Tuning With Loss Over Instructions

    Authors: Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, Aldo Lipani

    Abstract: Instruction tuning plays a crucial role in shaping the outputs of language models (LMs) to desired styles. In this work, we propose a simple yet effective method, Instruction Modelling (IM), which trains LMs by applying a loss function to the instruction and prompt part rather than solely to the output part. Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Code is available at https://github.com/ZhengxiangShi/InstructionModelling

  44. arXiv:2405.11286  [pdf, other

    cs.CV

    Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion

    Authors: Zeyu Zhang, Yiran Wang, Biao Wu, Shuo Chen, Zhiyuan Zhang, Shiya Huang, Wenbo Zhang, Meng Fang, Ling Chen, Yang Zhao

    Abstract: In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. A… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  45. arXiv:2405.10497  [pdf, other

    cs.MM cs.AI cs.CV cs.SI

    SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge

    Authors: Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, Jiebo Luo

    Abstract: Social Media Popularity Prediction (SMPP) is a crucial task that involves automatically predicting future popularity values of online posts, leveraging vast amounts of multimodal data available on social media platforms. Studying and investigating social media popularity becomes central to various online applications and requires novel methods of comprehensive analysis, multimodal comprehension, a… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: ACM Multimedia. arXiv admin note: text overlap with arXiv:1910.01795

  46. arXiv:2405.09713  [pdf, other

    cs.CV cs.AI cs.CL

    SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

    Authors: Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan

    Abstract: Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically withi… ▽ More

    Submitted 16 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: CVPR

  47. arXiv:2405.09711  [pdf, other

    cs.AI cs.CL cs.CV

    STAR: A Benchmark for Situated Reasoning in Real-World Videos

    Authors: Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan

    Abstract: Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: NeurIPS

  48. arXiv:2405.06676  [pdf, other

    cs.CL cs.AI cs.AR

    EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD

    Authors: Bing-Yue Wu, Utsav Sharma, Sai Rahul Dhanvi Kankipati, Ajay Yadav, Bintu Kappil George, Sai Ritish Guntupalli, Austin Rovinski, Vidya A. Chhabria

    Abstract: Large language models (LLMs) serve as powerful tools for design, providing capabilities for both task automation and design assistance. Recent advancements have shown tremendous potential for facilitating LLM integration into the chip design process; however, many of these works rely on data that are not publicly available and/or not permissively licensed for use in LLM training and distribution.… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Under review at Workshop on LLM-Aided Design (LAD'24)

  49. arXiv:2405.03003  [pdf, other

    cs.LG cs.AI cs.CL

    Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

    Authors: Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, Jia Li

    Abstract: Low-rank adaptation~(LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices $A$ and $B$ to represent the weight change, i.e., $ΔW=BA$. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or larger base models. In this work, we aim to fur… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: Accepted by ICML 2024

  50. arXiv:2404.19384  [pdf, other

    cs.CV cs.AI

    Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection

    Authors: Zhanwei Zhang, Minghao Chen, Shuai Xiao, Liang Peng, Hengjia Li, Binbin Lin, Ping Li, Wenxiao Wang, Boxi Wu, Deng Cai

    Abstract: Recent self-training techniques have shown notable improvements in unsupervised domain adaptation for 3D object detection (3D UDA). These techniques typically select pseudo labels, i.e., 3D boxes, to supervise models for the target domain. However, this selection process inevitably introduces unreliable 3D boxes, in which 3D points cannot be definitively assigned as foreground or background. Previ… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR2024