Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 427 results for author: Cheng, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.17491  [pdf, other

    cs.CV cs.LG

    Robust Adaptation of Foundation Models with Black-Box Visual Prompting

    Authors: Changdae Oh, Gyeongdeok Seo, Geunyoung Jung, Zhi-Qi Cheng, Hosik Choi, Jiyoung Jung, Kyungwoo Song

    Abstract: With the surge of large-scale pre-trained models (PTMs), adapting these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter-efficient transfer learning (PETL) of large models has grasped huge attention. While PETL methods show impressive performance, they commonly rely on two optimistic assumptions: 1) the entire parameters of a PTM are available, and 2) a suffic… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Extended work from the CVPR'23 paper: arxiv:2303.14773; This paper has been submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) for possible publication

  2. arXiv:2407.14029  [pdf, other

    cs.CV cs.LG

    PASS++: A Dual Bias Reduction Framework for Non-Exemplar Class-Incremental Learning

    Authors: Fei Zhu, Xu-Yao Zhang, Zhen Cheng, Cheng-Lin Liu

    Abstract: Class-incremental learning (CIL) aims to recognize new classes incrementally while maintaining the discriminability of old classes. Most existing CIL methods are exemplar-based, i.e., storing a part of old data for retraining. Without relearning old data, those methods suffer from catastrophic forgetting. In this paper, we figure out two inherent problems in CIL, i.e., representation bias and clas… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  3. arXiv:2407.12274  [pdf, other

    cs.CV

    MDPE: A Multimodal Deception Dataset with Personality and Emotional Characteristics

    Authors: Cong Cai, Shan Liang, Xuefei Liu, Kang Zhu, Zhengqi Wen, Jianhua Tao, Heng Xie, Jizhou Cui, Yiming Ma, Zhenhua Cheng, Hanzhe Xu, Ruibo Fu, Bin Liu, Yongwei Li

    Abstract: Deception detection has garnered increasing attention in recent years due to the significant growth of digital media and heightened ethical and security concerns. It has been extensively studied using multimodal methods, including video, audio, and text. In addition, individual differences in deception production and detection are believed to play a crucial role.Although some studies have utilized… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Code and data are available; Submitted to NeurIPS 2024 Datasets and Benchmarks Track

  4. arXiv:2407.10528  [pdf, other

    cs.CV

    Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

    Authors: Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

    Abstract: Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  5. arXiv:2407.08506  [pdf, other

    cs.RO

    Imitation Learning for Robotic Assisted Ultrasound Examination of Deep Venous Thrombosis using Kernelized Movement Primitives

    Authors: Diego Dall'Alba, Lorenzo Busellato, Thiusius Rajeeth Savarimuthu, Zhuoqi Cheng, Iñigo Iturrate

    Abstract: Deep Vein Thrombosis (DVT) is a common yet potentially fatal condition, often leading to critical complications like pulmonary embolism. DVT is commonly diagnosed using Ultrasound (US) imaging, which can be inconsistent due to its high dependence on the operator's skill. Robotic US Systems (RUSs) aim to improve diagnostic test consistency but face challenges with the complex scanning pattern neede… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  6. arXiv:2407.07053  [pdf, other

    cs.CV

    Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

    Authors: Wenqi Zhang, Zhenglin Cheng, Yuanyu He, Mengna Wang, Yongliang Shen, Zeqi Tan, Guiyang Hou, Mingqian He, Yanna Ma, Weiming Lu, Yueting Zhuang

    Abstract: Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In lig… ▽ More

    Submitted 23 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

    Comments: code: https://github.com/zwq2018/Multi-modal-Self-instruct dataset: https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct Leaderboard: https://multi-modal-self-instruct.github.io/

  7. arXiv:2407.05118  [pdf, other

    cs.CV

    SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

    Authors: Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kordjamshidi, Yu Kong

    Abstract: Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decomp… ▽ More

    Submitted 15 July, 2024; v1 submitted 6 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  8. arXiv:2407.03636  [pdf, other

    cs.CV

    Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

    Authors: Yuhong Zhang, Hengsheng Zhang, Xinning Chai, Zhengxue Cheng, Rong Xie, Li Song, Wenjun Zhang

    Abstract: Image restoration is a classic low-level problem aimed at recovering high-quality images from low-quality images with various degradations such as blur, noise, rain, haze, etc. However, due to the inherent complexity and non-uniqueness of degradation in real-world images, it is challenging for a model trained for single tasks to handle real-world restoration problems effectively. Moreover, existin… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  9. arXiv:2406.19859  [pdf, other

    cs.AI cs.HC cs.MM

    MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

    Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Jingdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, Jin-Peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G. Hauptmann

    Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition… ▽ More

    Submitted 4 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

    Comments: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt

  10. arXiv:2406.19236  [pdf, other

    cs.AI cs.CV cs.RO

    Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

    Authors: Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann

    Abstract: Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activitie… ▽ More

    Submitted 4 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: 30 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/

  11. arXiv:2406.15877  [pdf, other

    cs.SE cs.AI cs.CL

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Authors: Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu , et al. (8 additional authors not shown)

    Abstract: Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires… ▽ More

    Submitted 26 June, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

    Comments: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)

  12. arXiv:2406.11161  [pdf, other

    cs.AI cs.MM

    Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

    Authors: Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

    Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing su… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 37 pages, 12 figures, Project: https://github.com/ZebangCheng/Emotion-LLaMA, Demo: https://huggingface.co/spaces/ZebangCheng/Emotion-LLaMA

  13. arXiv:2406.09375  [pdf, other

    stat.ML cs.LG math.ST

    Learning conditional distributions on continuous spaces

    Authors: Cyril Bénézet, Ziteng Cheng, Sebastian Jaimungal

    Abstract: We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on neare… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  14. arXiv:2406.09180  [pdf, other

    cs.LG

    Detection-Rate-Emphasized Multi-objective Evolutionary Feature Selection for Network Intrusion Detection

    Authors: Zi-Hang Cheng, Haopu Shang, Chao Qian

    Abstract: Network intrusion detection is one of the most important issues in the field of cyber security, and various machine learning techniques have been applied to build intrusion detection systems. However, since the number of features to describe the network connections is often large, where some features are redundant or noisy, feature selection is necessary in such scenarios, which can both improve t… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  15. arXiv:2406.08689  [pdf, other

    cs.CR cs.AI

    Security of AI Agents

    Authors: Yifeng He, Ethan Wang, Yuyang Rong, Zifei Cheng, Hao Chen

    Abstract: The study and development of AI agents have been boosted by large language models. AI agents can function as intelligent assistants and complete tasks on behalf of their users with access to tools and the ability to execute commands in their environments, Through studying and experiencing the workflow of typical AI agents, we have raised several concerns regarding their security. These potential v… ▽ More

    Submitted 20 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  16. arXiv:2406.07476  [pdf, other

    cs.CV cs.CL

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Authors: Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

    Abstract: In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data… ▽ More

    Submitted 17 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: ZC, SL, HZ, YX, and XL contributed equally to this project

  17. arXiv:2406.06279  [pdf, other

    cs.CL

    Multi-Prompting Decoder Helps Better Language Understanding

    Authors: Zifeng Cheng, Zhaoling Chen, Zhiwei Jiang, Yafeng Yin, Shiping Ge, Yuliang Liu, Qing Gu

    Abstract: Recent Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the outp… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  18. arXiv:2406.06031  [pdf, other

    cs.IR

    A WT-ResNet based fault diagnosis model for the urban rail train transmission system

    Authors: Zuyu Cheng, Zhengcai Zhao, Yixiao Wang, Wentao Guo, Yufei Wang, Xiang Gao

    Abstract: This study presents a novel fault diagnosis model for urban rail transit systems based on Wavelet Transform Residual Neural Network (WT-ResNet). The model integrates the advantages of wavelet transform for feature extraction and ResNet for pattern recognition, offering enhanced diagnostic accuracy and robustness. Experimental results demonstrate the effectiveness of the proposed model in identifyi… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 12 pages,10 figures

  19. Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

    Authors: Zhiyuan Cheng, Cheng Han, James Liang, Qifan Wang, Xiangyu Zhang, Dongfang Liu

    Abstract: Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techn… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted in TPAMI'24. Extended from our ICLR'23 publication (arXiv:2301.13487). arXiv admin note: substantial text overlap with arXiv:2301.13487

  20. arXiv:2405.20325  [pdf, other

    cs.CV

    MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

    Authors: Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, Xintong Han, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denois… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 23 pages, 18 figures. Project page at https://francis-rings.github.io/MotionFollower/

    MSC Class: 68T45; 68T10

  21. arXiv:2405.18997  [pdf, other

    stat.ML cs.LG

    Kernel Semi-Implicit Variational Inference

    Authors: Ziheng Cheng, Longlin Yu, Tianyu Xie, Shiyue Zhang, Cheng Zhang

    Abstract: Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner. Due to the intractable densities of semi-implicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: ICML 2024 camera ready

  22. arXiv:2405.18347  [pdf, other

    cs.LG

    Dataset Growth

    Authors: Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You

    Abstract: Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. H… ▽ More

    Submitted 23 July, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: text overlap with arXiv:2305.20087 by other authors

  23. arXiv:2405.17509  [pdf, other

    cs.LG

    Reference Neural Operators: Learning the Smooth Dependence of Solutions of PDEs on Geometric Deformations

    Authors: Ze Cheng, Zhongkai Hao, Xiaoqiang Wang, Jianing Huang, Youjia Wu, Xudan Liu, Yiru Zhao, Songming Liu, Hang Su

    Abstract: For partial differential equations on domains of arbitrary shapes, existing works of neural operators attempt to learn a mapping from geometries to solutions. It often requires a large dataset of geometry-solution pairs in order to obtain a sufficiently accurate neural operator. However, for many industrial applications, e.g., engineering design optimization, it can be prohibitive to satisfy the r… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  24. arXiv:2405.16577  [pdf, other

    stat.ML cs.LG

    Reflected Flow Matching

    Authors: Tianyu Xie, Yu Zhu, Longlin Yu, Tong Yang, Ziheng Cheng, Shiyue Zhang, Xiangyu Zhang, Cheng Zhang

    Abstract: Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural sampl… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: ICML 2024 camera-ready

  25. arXiv:2405.14297  [pdf, other

    cs.LG cs.AI

    Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

    Authors: Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin

    Abstract: The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results. However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computatio… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 9 pages, 21 figures

  26. arXiv:2405.11667  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

    Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

    Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

  27. arXiv:2405.10313  [pdf, other

    cs.AI cs.CL cs.CY cs.LG

    How Far Are We From AGI

    Authors: Tao Feng, Chuanyang Jin, Jingyu Liu, Kunlun Zhu, Haoqin Tu, Zirui Cheng, Guanyu Lin, Jiaxuan You

    Abstract: The evolution of artificial intelligence (AI) has profoundly impacted human society, driving significant advancements in multiple sectors. Yet, the escalating demands on AI have highlighted the limitations of AI's current offerings, catalyzing a movement towards Artificial General Intelligence (AGI). AGI, distinguished by its ability to execute diverse real-world tasks with efficiency and effectiv… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  28. arXiv:2405.08463  [pdf, other

    cs.CV

    A Timely Survey on Vision Transformer for Deepfake Detection

    Authors: Zhikan Wang, Zhongyao Cheng, Jiajie Xiong, Xun Xu, Tianrui Li, Bharadwaj Veeravalli, Xulei Yang

    Abstract: In recent years, the rapid advancement of deepfake technology has revolutionized content creation, lowering forgery costs while elevating quality. However, this progress brings forth pressing concerns such as infringements on individual rights, national security threats, and risks to public safety. To counter these challenges, various detection methodologies have emerged, with Vision Transformer (… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  29. arXiv:2405.03064  [pdf, other

    cs.LG cs.AI cs.CR

    RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

    Authors: Zelei Cheng, Xian Wu, Jiahao Yu, Sabrina Yang, Gang Wang, Xinyu Xing

    Abstract: Deep reinforcement learning (DRL) is playing an increasingly important role in real-world applications. However, obtaining an optimally performing DRL agent for complex tasks, especially with sparse rewards, remains a significant challenge. The training of a DRL agent can be often trapped in a bottleneck without further progress. In this paper, we propose RICE, an innovative refining scheme for re… ▽ More

    Submitted 5 June, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

    Comments: Accepted by ICML 2024

  30. arXiv:2405.00587  [pdf, other

    cs.CV

    GraCo: Granularity-Controllable Interactive Segmentation

    Authors: Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen

    Abstract: Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant resul… ▽ More

    Submitted 16 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: CVPR2024 Highlight, Project: https://zhao-yian.github.io/GraCo

  31. arXiv:2404.18398  [pdf, other

    cs.CL cs.MM

    MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

    Authors: Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann

    Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  32. arXiv:2404.18243  [pdf, other

    cs.CL

    LEGENT: Open Platform for Embodied Agents

    Authors: Zhili Cheng, Zhitong Wang, Jinyi Hu, Shengding Hu, An Liu, Yuge Tu, Pengkai Li, Lei Shi, Zhiyuan Liu, Maosong Sun

    Abstract: Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platfo… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: Demo Paper

  33. arXiv:2404.18166  [pdf, other

    cs.IR

    Behavior-Contextualized Item Preference Modeling for Multi-Behavior Recommendation

    Authors: Mingshi Yan, Fan Liu, Jing Sun, Fuming Sun, Zhiyong Cheng, Yahong Han

    Abstract: In recommender systems, multi-behavior methods have demonstrated their effectiveness in mitigating issues like data sparsity, a common challenge in traditional single-behavior recommendation approaches. These methods typically infer user preferences from various auxiliary behaviors and apply them to the target behavior for recommendations. However, this direct transfer can introduce noise to the t… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: This paper has been accepted by SIGIR 2024

  34. arXiv:2404.17936  [pdf, other

    cs.CV

    FDCE-Net: Underwater Image Enhancement with Embedding Frequency and Dual Color Encoder

    Authors: Zheng Cheng, Guodong Fan, Jingchun Zhou, Min Gan, C. L. Philip Chen

    Abstract: Underwater images often suffer from various issues such as low brightness, color shift, blurred details, and noise due to light absorption and scattering caused by water and suspended particles. Previous underwater image enhancement (UIE) methods have primarily focused on spatial domain enhancement, neglecting the frequency domain information inherent in the images. However, the degradation factor… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

    Comments: 16pages,13 figures

  35. arXiv:2404.17297  [pdf, ps, other

    cs.PL

    Denotation-based Compositional Compiler Verification

    Authors: Zhang Cheng, Jiyang Wu, Di Wang, Qinxiang Cao

    Abstract: A desired but challenging property of compiler verification is compositionality in the sense that the compilation correctness of a program can be deduced from that of its substructures ranging from statements, functions, and modules incrementally. Previously proposed approaches have devoted extensive effort to module-level compositionality based on small-step semantics and simulation theories. Thi… ▽ More

    Submitted 15 May, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

    Comments: 38 pages, 8 figures

  36. arXiv:2404.15159  [pdf, other

    cs.CL cs.AI

    MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

    Authors: Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, Mingjie Tang

    Abstract: Fine-tuning Large Language Models (LLMs) is a common practice to adapt pre-trained models for specific applications. While methods like LoRA have effectively addressed GPU memory constraints during fine-tuning, their performance often falls short, especially in multi-task scenarios. In contrast, Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance in multi-task… ▽ More

    Submitted 19 July, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

    Comments: 18 pages, 5 figures

  37. arXiv:2404.15041  [pdf, other

    cs.CV

    LEAF: Unveiling Two Sides of the Same Coin in Semi-supervised Facial Expression Recognition

    Authors: Fan Zhang, Zhi-Qi Cheng, Jian Zhao, Xiaojiang Peng, Xuelong Li

    Abstract: Semi-supervised learning has emerged as a promising approach to tackle the challenge of label scarcity in facial expression recognition (FER) task. However, current state-of-the-art methods primarily focus on one side of the coin, i.e., generating high-quality pseudo-labels, while overlooking the other side: enhancing expression-relevant representations. In this paper, we unveil both sides of the… ▽ More

    Submitted 26 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

  38. arXiv:2404.11519  [pdf, other

    cs.IR

    Disentangled Cascaded Graph Convolution Networks for Multi-Behavior Recommendation

    Authors: Zhiyong Cheng, Jianhua Dong, Fan Liu, Lei Zhu, Xun Yang, Meng Wang

    Abstract: Multi-behavioral recommender systems have emerged as a solution to address data sparsity and cold-start issues by incorporating auxiliary behaviors alongside target behaviors. However, existing models struggle to accurately capture varying user preferences across different behaviors and fail to account for diverse item preferences within behaviors. Various user preference factors (such as price or… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  39. arXiv:2404.10321  [pdf, other

    cs.IR

    Cluster-based Graph Collaborative Filtering

    Authors: Fan Liu, Shuai Zhao, Zhiyong Cheng, Liqiang Nie, Mohan Kankanhalli

    Abstract: Graph Convolution Networks (GCNs) have significantly succeeded in learning user and item representations for recommendation systems. The core of their efficacy is the ability to explicitly exploit the collaborative signals from both the first- and high-order neighboring nodes. However, most existing GCN-based methods overlook the multiple interests of users while performing high-order graph convol… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: 22 pages, 8 figures

    ACM Class: H.3.3

  40. arXiv:2404.07972  [pdf, other

    cs.AI cs.CL

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Authors: Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu

    Abstract: Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature… ▽ More

    Submitted 30 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: 51 pages, 21 figures

  41. arXiv:2404.00924  [pdf, other

    cs.CV

    BadPart: Unified Black-box Adversarial Patch Attacks against Pixel-wise Regression Tasks

    Authors: Zhiyuan Cheng, Zhaoyi Liu, Tengda Guo, Shiwei Feng, Dongfang Liu, Mingjie Tang, Xiangyu Zhang

    Abstract: Pixel-wise regression tasks (e.g., monocular depth estimation (MDE) and optical flow estimation (OFE)) have been widely involved in our daily life in applications like autonomous driving, augmented reality and video composition. Although certain applications are security-critical or bear societal significance, the adversarial robustness of such models are not sufficiently studied, especially in th… ▽ More

    Submitted 24 May, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: Paper accepted at ICML 2024

  42. arXiv:2404.00511  [pdf, other

    cs.CL cs.CV cs.MM

    MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models

    Authors: Zebang Cheng, Fuqiang Niu, Yuxiang Lin, Zhi-Qi Cheng, Bowen Zhang, Xiaojiang Peng

    Abstract: This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveragin… ▽ More

    Submitted 11 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Ranked 3rd in SemEval '24 Task 3 with F1 of 0.3435, close to 1st & 2nd by 0.0339 & 0.0025

  43. arXiv:2403.18271  [pdf, other

    cs.CV

    Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding

    Authors: Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, Yuyin Zhou

    Abstract: The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However, its application in medical imaging presents challenges, requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  44. arXiv:2403.17421  [pdf, other

    cs.IR cs.AI

    MA4DIV: Multi-Agent Reinforcement Learning for Search Result Diversification

    Authors: Yiqun Chen, Jiaxin Mao, Yi Zhang, Dehong Ma, Long Xia, Jun Fan, Daiting Shi, Zhicong Cheng, Simiu Gu, Dawei Yin

    Abstract: The objective of search result diversification (SRD) is to ensure that selected documents cover as many different subtopics as possible. Existing methods primarily utilize a paradigm of "greedy selection", i.e., selecting one document with the highest diversity score at a time. These approaches tend to be inefficient and are easily trapped in a suboptimal state. In addition, some other methods aim… ▽ More

    Submitted 27 March, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

  45. arXiv:2403.15452  [pdf, other

    cs.CL cs.AI

    What Are Tools Anyway? A Survey from the Language Model Perspective

    Authors: Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, Graham Neubig

    Abstract: Language models (LMs) are powerful yet mostly for text generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills. However, many works adopt the term "tool" in different ways, raising the question: What is a tool anyway? Subsequently, where and how do tools help LMs? In this survey, we provide a unified definition of tools as external programs used… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  46. arXiv:2403.15274  [pdf

    q-bio.OT cs.AI

    Bioinformatics and Biomedical Informatics with ChatGPT: Year One Review

    Authors: Jinge Wang, Zien Cheng, Qiuming Yao, Li Liu, Dong Xu, Gangqing Hu

    Abstract: The year 2023 marked a significant surge in the exploration of applying large language model (LLM) chatbots, notably ChatGPT, across various disciplines. We surveyed the applications of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinforma… ▽ More

    Submitted 12 June, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: Peer-reviewed and accepted by Quantitative Biology

  47. arXiv:2403.14836  [pdf, other

    cs.CV

    Evaluating Panoramic 3D Estimation in Indoor Lighting Analysis

    Authors: Zining Cheng, Guanzhou Ji

    Abstract: This paper presents the use of panoramic 3D estimation in lighting simulation. Conventional lighting simulation necessitates detailed modeling as input, resulting in significant labor effort and time cost. The 3D layout estimation method directly takes a single panorama as input and generates a lighting simulation model with room geometry and window aperture. We evaluate the simulation results by… ▽ More

    Submitted 18 May, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: Annual Modeling and Simulation Conference (ANNSIM), May 20-23, 2024, Washington D.C., USA

  48. arXiv:2403.14221  [pdf, other

    cs.CL

    Improving the Robustness of Large Language Models via Consistency Alignment

    Authors: Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, Dawei Yin

    Abstract: Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improveme… ▽ More

    Submitted 22 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: Accepted by LREC-COLING 2024

  49. arXiv:2403.11959  [pdf, other

    cs.CV cs.AI cs.MM

    IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting

    Authors: Hang Wang, Zhi-Qi Cheng, Youtian Du, Lei Zhang

    Abstract: Video Action Counting (VAC) is crucial in analyzing sports, fitness, and everyday activities by quantifying repetitive actions in videos. However, traditional VAC methods have overlooked the complexity of action repetitions, such as interruptions and the variability in cycle duration. Our research addresses the shortfall by introducing a novel approach to VAC, called Irregular Video Action Countin… ▽ More

    Submitted 20 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: Source code: https://github.com/hwang-cs-ime/IVAC-P2L

  50. arXiv:2403.08229  [pdf, other

    cs.CL

    Boosting Disfluency Detection with Large Language Model as Disfluency Generator

    Authors: Zhenrong Cheng, Jiayan Guo, Hao Sun, Yan Zhang

    Abstract: Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data au… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.