Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 5,838 results for author: Zhang, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02482  [pdf, other

    cs.CV

    Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

    Authors: Fei Shen, Hu Ye, Sibo Liu, Jun Zhang, Cong Wang, Xiao Han, Wei Yang

    Abstract: Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which predominantly generate stories in an autoregressive and excessively caption-dependent manner, often underrate the contextual consistency and relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Condi… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2407.02411  [pdf, other

    cs.CV cs.CR cs.MM

    Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

    Authors: Jinmin Li, Kuofeng Gao, Yang Bai, Jingyun Zhang, Shu-Tao Xia

    Abstract: The advent of video-based Large Language Models (LLMs) has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerni… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2403.13507

  3. arXiv:2407.02345  [pdf, other

    cs.CL

    MORPHEUS: Modeling Role from Personalized Dialogue History by Exploring and Utilizing Latent Space

    Authors: Yihong Tang, Bo Wang, Dongming Zhao, Xiaojia Jin, Jijun Zhang, Ruifang He, Yuexian Hou

    Abstract: Personalized Dialogue Generation (PDG) aims to create coherent responses according to roles or personas. Traditional PDG relies on external role data, which can be scarce and raise privacy concerns. Approaches address these issues by extracting role information from dialogue history, which often fail to generically model roles in continuous space. To overcome these limitations, we introduce a nove… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  4. arXiv:2407.02182  [pdf, other

    cs.CV cs.RO eess.IV

    Occlusion-Aware Seamless Segmentation

    Authors: Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, Kailun Yang

    Abstract: Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Ble… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. The fresh dataset and the source code will be made publicly available at https://github.com/yihong-97/OASS

  5. arXiv:2407.02095  [pdf, other

    cs.SE

    TIGER: A Generating-Then-Ranking Framework for Practical Python Type Inference

    Authors: Chong Wang, Jian Zhang, Yiling Lou, Mingwei Liu, Weisong Sun, Yang Liu, Xin Peng

    Abstract: Python's dynamic typing system offers flexibility and expressiveness but can lead to type-related errors, prompting the need for automated type inference to enhance type hinting. While existing learning-based approaches show promising inference accuracy, they struggle with practical challenges in comprehensively handling various types, including complex generic types and (unseen) user-defined type… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  6. arXiv:2407.02052  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

    Authors: Minghui Wu, Luzhen Xu, Jie Zhang, Haitao Tang, Yanyan Yue, Ruizhi Liao, Jintao Zhao, Zhengzhe Zhang, Yichi Wang, Haoyin Yan, Hongliang Yu, Tongle Ma, Jiachen Liu, Chongliang Wu, Yongchao Li, Yanyong Zhang, Xin Fang, Yue Zhang

    Abstract: This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted at ICASSP 2024

  7. arXiv:2407.01956  [pdf, other

    eess.SY cs.RO

    Cloud-Edge-Terminal Collaborative AIGC for Autonomous Driving

    Authors: Jianan Zhang, Zhiwei Wei, Boxun Liu, Xiayi Wang, Yong Yu, Rongqing Zhang

    Abstract: In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIG… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  8. arXiv:2407.01950  [pdf, other

    cs.RO cs.AI

    LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance

    Authors: Wenhao Yu, Jie Peng, Huanyu Yang, Junrui Zhang, Yifan Duan, Jianmin Ji, Yanyong Zhang

    Abstract: The conditional diffusion model has been demonstrated as an efficient tool for learning robot policies, owing to its advancement to accurately model the conditional distribution of policies. The intricate nature of real-world scenarios, characterized by dynamic obstacles and maze-like structures, underscores the complexity of robot local navigation decision-making as a conditional distribution pro… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 8 pages, 6 figures, accepted by IROS 2024

  9. arXiv:2407.01893  [pdf, other

    cs.HC

    CausalPrism: A Visual Analytics Approach for Subgroup-based Causal Heterogeneity Exploration

    Authors: Jiehui Zhou, Xumeng Wang, Wong Kam-Kwai, Wei Zhang, Xingyu Liu, Juntian Zhang, Minfeng Zhu, Wei Chen

    Abstract: In causal inference, estimating Heterogeneous Treatment Effects (HTEs) from observational data is critical for understanding how different subgroups respond to treatments, with broad applications such as precision medicine and targeted advertising. However, existing work on HTE, subgroup discovery, and causal visualization is insufficient to address two challenges: first, the sheer number of poten… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 12 pages, 7 figures

  10. arXiv:2407.01872  [pdf, other

    cs.CV cs.RO eess.IV

    Referring Atomic Video Action Recognition

    Authors: Kunyu Peng, Jia Fu, Kailun Yang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

    Abstract: We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic acti… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR

  11. arXiv:2407.01577  [pdf, other

    q-fin.TR cs.AI cs.LG

    MOT: A Mixture of Actors Reinforcement Learning Method by Optimal Transport for Algorithmic Trading

    Authors: Xi Cheng, Jinghao Zhang, Yunan Zeng, Wenfang Xue

    Abstract: Algorithmic trading refers to executing buy and sell orders for specific assets based on automatically identified trading opportunities. Strategies based on reinforcement learning (RL) have demonstrated remarkable capabilities in addressing algorithmic trading problems. However, the trading patterns differ among market conditions due to shifted distribution data. Ignoring multiple patterns in the… ▽ More

    Submitted 2 June, 2024; originally announced July 2024.

    Comments: 13 pages, 5 figures, PAKDD2024 accepted

  12. arXiv:2407.01262  [pdf, other

    cs.LG

    Complementary Fusion of Deep Network and Tree Model for ETA Prediction

    Authors: YuRui Huang, Jie Zhang, HengDa Bao, Yang Yang, Jian Yang

    Abstract: Estimated time of arrival (ETA) is a very important factor in the transportation system. It has attracted increasing attentions and has been widely used as a basic service in navigation systems and intelligent transportation systems. In this paper, we propose a novel solution to the ETA estimation problem, which is an ensemble on tree models and neural networks. We proved the accuracy and robustne… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  13. arXiv:2407.01085  [pdf, other

    cs.LG cs.CL

    Rethinking LLM-based Preference Evaluation

    Authors: Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Jingang Wang, Zhenyu Chen, Jieyu Zhao, Hui Xiong

    Abstract: Recently, large language model (LLM)-based preference evaluation has been widely adopted to compare pairs of model responses. However, a severe bias towards lengthy responses has been observed, raising concerns about the reliability of this evaluation method. In this work, we designed a series of controlled experiments to study the major impacting factors of the metric of LLM-based preference eval… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  14. arXiv:2407.00909  [pdf, other

    cs.IR cs.CV

    Heterogeneous Graph-based Framework with Disentangled Representations Learning for Multi-target Cross Domain Recommendation

    Authors: Xiaopeng Liu, Juan Zhang, Chongqi Ren, Shenghui Xu, Zhaoming Pan, Zhimin Zhang

    Abstract: CDR (Cross-Domain Recommendation), i.e., leveraging information from multiple domains, is a critical solution to data sparsity problem in recommendation system. The majority of previous research either focused on single-target CDR (STCDR) by utilizing data from the source domains to improve the model's performance on the target domain, or applied dual-target CDR (DTCDR) by integrating data from th… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  15. arXiv:2407.00769  [pdf, other

    quant-ph cs.DC

    Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation

    Authors: Rong Fu, Zhongling Su, Han-Sen Zhong, Xiti Zhao, Jianyang Zhang, Feng Pan, Pan Zhang, Xianhe Zhao, Ming-Cheng Chen, Chao-Yang Lu, Jian-Wei Pan, Zhiling Pei, Xingcheng Zhang, Wanli Ouyang

    Abstract: Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google's sycamore, challenges remain in generating uncorrelated samples of random quantum circuits. In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and de… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  16. arXiv:2407.00708  [pdf, other

    cs.LG

    Heterogeneous Graph Contrastive Learning with Spectral Augmentation

    Authors: Jing Zhang, Xiaoqian Jiang, Yingjie Xie, Cangqi Zhou

    Abstract: Heterogeneous graphs can well describe the complex entity relationships in the real world. For example, online shopping networks contain multiple physical types of consumers and products, as well as multiple relationship types such as purchasing and favoriting. More and more scholars pay attention to this research because heterogeneous graph representation learning shows strong application potenti… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  17. arXiv:2407.00297  [pdf

    eess.IV cs.CV

    UADSN: Uncertainty-Aware Dual-Stream Network for Facial Nerve Segmentation

    Authors: Guanghao Zhu, Lin Liu, Jing Zhang, Xiaohui Du, Ruqian Hao, Juanxiu Liu

    Abstract: Facial nerve segmentation is crucial for preoperative path planning in cochlear implantation surgery. Recently, researchers have proposed some segmentation methods, such as atlas-based and deep learning-based methods. However, since the facial nerve is a tubular organ with a diameter of only 1.0-1.5mm, it is challenging to locate and segment the facial nerve in CT scans. In this work, we propose a… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  18. Personalized Federated Continual Learning via Multi-granularity Prompt

    Authors: Hao Yu, Xin Yang, Xin Gao, Yan Kang, Hao Wang, Junbo Zhang, Tianrui Li

    Abstract: Personalized Federated Continual Learning (PFCL) is a new practical scenario that poses greater challenges in sharing and personalizing knowledge. PFCL not only relies on knowledge fusion for server aggregation at the global spatial-temporal perspective but also needs model improvement for each client according to the local requirements. Existing methods, whether in Personalized Federated Learning… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

    Comments: Accepted by KDD 2024 Research Track

  19. arXiv:2407.00024  [pdf, other

    cs.CV cs.AI cs.MM

    LMVD: A Large-Scale Multimodal Vlog Dataset for Depression Detection in the Wild

    Authors: Lang He, Kai Chen, Junnan Zhao, Yimeng Wang, Ercheng Pei, Haifeng Chen, Jiewei Jiang, Shiqing Zhang, Jie Zhang, Zhongmin Wang, Tao He, Prayag Tiwari

    Abstract: Depression can significantly impact many aspects of an individual's life, including their personal and social functioning, academic and work performance, and overall quality of life. Many researchers within the field of affective computing are adopting deep learning technology to explore potential patterns related to the detection of depression. However, because of subjects' privacy protection con… ▽ More

    Submitted 8 May, 2024; originally announced July 2024.

  20. arXiv:2406.19976  [pdf, other

    cs.LG math.OC

    ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting

    Authors: Rui Pan, Jipeng Zhang, Xingyuan Pan, Renjie Pi, Xiaoyu Wang, Tong Zhang

    Abstract: Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particu… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  21. arXiv:2406.19796  [pdf, other

    eess.IV cs.CV

    Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting

    Authors: Wei Li, Jingyang Zhang, Pheng-Ann Heng, Lixu Gu

    Abstract: Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. Task-Incremental Learning (TIL) offers a privacy-preserving training paradigm using tasks arriving sequentially, instead of gathering them due to strict data sharing policies. However, the task evolution can span a wide scope that involves shifts in both image appearanc… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Accepted by MICCAI24

  22. arXiv:2406.19649  [pdf

    eess.IV cs.CV

    AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation

    Authors: Guanghao Zhu, Jing Zhang, Juanxiu Liu, Xiaohui Du, Ruqian Hao, Yong Liu, Lin Liu

    Abstract: Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). First… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  23. arXiv:2406.19485  [pdf, other

    eess.IV cs.CV

    GAPNet: Granularity Attention Network with Anatomy-Prior-Constraint for Carotid Artery Segmentation

    Authors: Lin Zhang, Chenggang Lu, Xin-yang Shi, Caifeng Shan, Jiong Zhang, Da Chen, Laurent D. Cohen

    Abstract: Atherosclerosis is a chronic, progressive disease that primarily affects the arterial walls. It is one of the major causes of cardiovascular disease. Magnetic Resonance (MR) black-blood vessel wall imaging (BB-VWI) offers crucial insights into vascular disease diagnosis by clearly visualizing vascular structures. However, the complex anatomy of the neck poses challenges in distinguishing the carot… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  24. arXiv:2406.19251  [pdf, other

    cs.CL cs.AI

    AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation

    Authors: Jia Fu, Xiaoting Qin, Fangkai Yang, Lu Wang, Jue Zhang, Qingwei Lin, Yubo Chen, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

    Abstract: Recent advancements in Large Language Models have transformed ML/AI development, necessitating a reevaluation of AutoML principles for the Retrieval-Augmented Generation (RAG) systems. To address the challenges of hyper-parameter optimization and online adaptation in RAG, we propose the AutoRAG-HP framework, which formulates the hyper-parameter tuning as an online multi-armed bandit (MAB) problem… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  25. arXiv:2406.19101  [pdf, other

    cs.CV

    DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

    Authors: Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin

    Abstract: Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual t… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  26. arXiv:2406.18849  [pdf, other

    cs.CV

    Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs

    Authors: Jie Zhang, Zhongqi Wang, Mengqi Lei, Zheng Yuan, Bei Yan, Shiguang Shan, Xilin Chen

    Abstract: Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  27. arXiv:2406.18518  [pdf, other

    cs.CL cs.AI cs.LG cs.SE

    APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

    Authors: Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

    Abstract: The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scal… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  28. arXiv:2406.18351  [pdf, other

    cs.LG cs.AI

    Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control

    Authors: Zifan Liu, Xinran Li, Shibo Chen, Gen Li, Jiashuo Jiang, Jun Zhang

    Abstract: Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  29. arXiv:2406.18327  [pdf, other

    eess.IV cs.CV cs.LG

    Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation

    Authors: Yuxuan Qi, Li Lin, Jiajun Wang, Jingya Zhang, Bin Zhang

    Abstract: Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by network… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  30. arXiv:2406.18284  [pdf, other

    cs.CV

    RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

    Authors: Xiaozhong Ji, Chuming Lin, Zhonggan Ding, Ying Tai, Jian Yang, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Donghao Luo, Chengjie Wang

    Abstract: Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-qual… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  31. arXiv:2406.18242  [pdf, other

    cs.CV eess.IV

    ConStyle v2: A Strong Prompter for All-in-One Image Restoration

    Authors: Dongqi Fan, Junhao Zhang, Liang Chang

    Abstract: This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  32. arXiv:2406.18152  [pdf, other

    cs.MA

    Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning

    Authors: Junkai Zhang, Yifan Zhang, Xi Sheryl Zhang, Yifan Zang, Jian Cheng

    Abstract: Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of a… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: The AAAI-2024 paper with the appendix

  33. arXiv:2406.18074  [pdf, other

    cs.CV cs.AI

    Few-Shot Medical Image Segmentation with High-Fidelity Prototypes

    Authors: Song Tang, Shaxu Yan, Xiaozhi Qi, Jianxin Gao, Mao Ye, Jianwei Zhang, Xiatian Zhu

    Abstract: Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal fo… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  34. arXiv:2406.18037  [pdf, other

    cs.CV

    Towards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation

    Authors: Dunyuan Xu, Xi Wang, Jingyang Zhang, Pheng-Ann Heng

    Abstract: The ability to learn sequentially from different data sites is crucial for a deep network in solving practical medical image diagnosis problems due to privacy restrictions and storage limitations. However, adapting on incoming site leads to catastrophic forgetting on past sites and decreases generalizablity on unseen sites. Existing Continual Learning (CL) and Domain Generalization (DG) methods ha… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  35. arXiv:2406.18011  [pdf, other

    cs.CV

    Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation

    Authors: Yijie Yang, Jinlu Zhang, Jiaxu Zhang, Zhigang Tu

    Abstract: In the realm of skeleton-based action recognition, the traditional methods which rely on coarse body keypoints fall short of capturing subtle human actions. In this work, we propose Expressive Keypoints that incorporates hand and foot details to form a fine-grained skeletal representation, improving the discriminative ability for existing models in discerning intricate actions. To efficiently mode… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  36. arXiv:2406.17898  [pdf, other

    cs.RO cs.AI

    Human-centered In-building Embodied Delivery Benchmark

    Authors: Zhuoqun Xu, Yang Liu, Xiaoqi Li, Jiyao Zhang, Hao Dong

    Abstract: Recently, the concept of embodied intelligence has been widely accepted and popularized, leading people to naturally consider the potential for commercialization in this field. In this work, we propose a specific commercial scenario simulation, human-centered in-building embodied delivery. Furthermore, for this scenario, we have developed a brand-new virtual environment system from scratch, constr… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  37. arXiv:2406.17768  [pdf, other

    cs.RO cs.AI cs.LG

    EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data

    Authors: Jesse Zhang, Minho Heo, Zuxin Liu, Erdem Biyik, Joseph J Lim, Yao Liu, Rasool Fakoor

    Abstract: Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either re… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 22 pages, 13 figures

  38. arXiv:2406.17758  [pdf, other

    cs.CV

    MotionBooth: Motion-Aware Customized Text-to-Video Generation

    Authors: Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen

    Abstract: In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance t… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Project page at https://jianzongwu.github.io/projects/motionbooth

  39. arXiv:2406.17538  [pdf, other

    cs.CV

    SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition

    Authors: Guanghao Zhu, Lin Liu, Yuhao Hu, Haixin Sun, Fang Liu, Xiaohui Du, Ruqian Hao, Juanxiu Liu, Yong Liu, Hao Deng, Jing Zhang

    Abstract: Micro-expressions (MEs) are subtle facial movements that occur spontaneously when people try to conceal the real emotions. Micro-expression recognition (MER) is crucial in many fields, including criminal analysis and psychotherapy. However, MER is challenging since MEs have low intensity and ME datasets are small in size. To this end, a three-stream temporal-shift attention network based on self-k… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  40. arXiv:2406.17520  [pdf, other

    cs.CV cs.RO

    Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

    Authors: Zonglin Lyu, Juexiao Zhang, Mingxuan Lu, Yiming Li, Chen Feng

    Abstract: Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-bas… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  41. arXiv:2406.17338  [pdf, other

    eess.IV cs.CV cs.LG

    Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

    Authors: Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, Xin Wang

    Abstract: Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propos… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: MICCAI 2024

  42. arXiv:2406.17305  [pdf, other

    cs.CL

    Retrieval Augmented Instruction Tuning for Open NER with Large Language Models

    Authors: Tingyu Xie, Jian Zhang, Yan Zhang, Yuanyuan Liang, Qi Li, Hongwei Wang

    Abstract: The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognitio… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  43. arXiv:2406.17272  [pdf, ps, other

    cs.LG

    A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

    Authors: Van Tung Pham, Yist Lin, Tao Han, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

    Abstract: Recent works have shown promising results in connecting speech encoders to large language models (LLMs) for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  44. arXiv:2406.17184  [pdf, ps, other

    cs.LG stat.ML

    Minimax Optimality in Contextual Dynamic Pricing with General Valuation Models

    Authors: Xueping Gong, Jiheng Zhang

    Abstract: Dynamic pricing, the practice of adjusting prices based on contextual factors, has gained significant attention due to its impact on revenue maximization. In this paper, we address the contextual dynamic pricing problem, which involves pricing decisions based on observable product features and customer characteristics. We propose a novel algorithm that achieves improved regret bounds while minimiz… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 29 pages

  45. arXiv:2406.17159  [pdf, other

    eess.AS cs.MM cs.SD

    Exploring compressibility of transformer based text-to-music (TTM) models

    Authors: Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

    Abstract: State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the var… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Proceedings of INTERSPEECH 2024

  46. arXiv:2406.17115  [pdf, other

    cs.CV cs.AI

    Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

    Authors: Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

    Abstract: Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tas… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  47. arXiv:2406.17005  [pdf, other

    cs.CV

    PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

    Authors: Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, Jingnan Luo , et al. (12 additional authors not shown)

    Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024

  48. arXiv:2406.16710  [pdf, other

    cs.CV

    Portrait3D: 3D Head Generation from Single In-the-wild Portrait Image

    Authors: Jinkun Hao, Junshu Tang, Jiangning Zhang, Ran Yi, Yijia Hong, Moran Li, Weijian Cao, Yating Wang, Lizhuang Ma

    Abstract: While recent works have achieved great success on one-shot 3D common object generation, high quality and fidelity 3D head generation from a single image remains a great challenge. Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. To handle this challenging problem, we propose a novel framew… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: https://jinkun-hao.github.io/Portrait3D/

  49. arXiv:2406.16271  [pdf, other

    cs.CV

    Feature-prompting GBMSeg: One-Shot Reference Guided Training-Free Prompt Engineering for Glomerular Basement Membrane Segmentation

    Authors: Xueyu Liu, Guangze Shi, Rui Wang, Yexin Lai, Jianan Zhang, Lele Sun, Quan Yang, Yongfei Wu, MIng Li, Weixia Han, Wen Zheng

    Abstract: Assessment of the glomerular basement membrane (GBM) in transmission electron microscopy (TEM) is crucial for diagnosing chronic kidney disease (CKD). The lack of domain-independent automatic segmentation tools for the GBM necessitates an AI-based solution to automate the process. In this study, we introduce GBMSeg, a training-free framework designed to automatically segment the GBM in TEM images… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted for MICCAI2024

  50. arXiv:2406.16255  [pdf, other

    cs.LG cs.AI

    Uncertainty-Aware Reward-Free Exploration with General Function Approximation

    Authors: Junkai Zhang, Weitong Zhang, Dongruo Zhou, Quanquan Gu

    Abstract: Mastering multiple tasks through exploration and learning in an environment poses a significant challenge in reinforcement learning (RL). Unsupervised RL has been introduced to address this challenge by training policies with intrinsic rewards rather than extrinsic rewards. However, current intrinsic reward designs and unsupervised RL algorithms often overlook the heterogeneous nature of collected… ▽ More

    Submitted 29 June, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

    Comments: 32 pages, 5 figures, 4 tables, accepted by ICML 2024