Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 8,083 results for author: Zhang, Y

Searching in archive cs. Search in all archives.
.
  1. Coding-Enhanced Cooperative Jamming for Secret Communication in Fluid Antenna Systems

    Authors: Hao Xu, Kai-Kit Wong, Wee Kiat New, Guyue Li, Farshad Rostami Ghadi, Yongxu Zhu, Shi Jin, Chan-Byoung Chae, Yangyang Zhang

    Abstract: This letter investigates the secret communication problem for a fluid antenna system (FAS)-assisted wiretap channel, where the legitimate transmitter transmits an information-bearing signal to the legitimate receiver, and at the same time, transmits a jamming signal to interfere with the eavesdropper (Eve). Unlike the conventional jamming scheme, which usually transmits Gaussian noise that interfe… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 6 pages, 3 figures, this paper has been accepted by IEEE Communications Letters

  2. arXiv:2407.02350  [pdf, other

    cs.CV

    Conceptual Codebook Learning for Vision-Language Models

    Authors: Yi Zhang, Ke Yu, Siqi Wu, Zhihai He

    Abstract: In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  3. arXiv:2407.02228  [pdf, other

    cs.CV cs.AI

    MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders

    Authors: Baijiong Lin, Weisen Jiang, Pengguang Chen, Yu Zhang, Shu Liu, Ying-Cong Chen

    Abstract: Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-t… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  4. arXiv:2407.02182  [pdf, other

    cs.CV cs.RO eess.IV

    Occlusion-Aware Seamless Segmentation

    Authors: Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, Kailun Yang

    Abstract: Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Ble… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. The fresh dataset and the source code will be made publicly available at https://github.com/yihong-97/OASS

  5. arXiv:2407.02052  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

    Authors: Minghui Wu, Luzhen Xu, Jie Zhang, Haitao Tang, Yanyan Yue, Ruizhi Liao, Jintao Zhao, Zhengzhe Zhang, Yichi Wang, Haoyin Yan, Hongliang Yu, Tongle Ma, Jiachen Liu, Chongliang Wu, Yongchao Li, Yanyong Zhang, Xin Fang, Yue Zhang

    Abstract: This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted at ICASSP 2024

  6. arXiv:2407.02040  [pdf, other

    cs.CV cs.AI cs.MM

    ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

    Authors: Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Xiangyu Zhu, Zhen Lei, Lei Zhang

    Abstract: By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillatio… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024. Code available at https://github.com/theEricMa/ScaleDreamer

  7. arXiv:2407.01964  [pdf, other

    cs.CL

    Enabling Discriminative Reasoning in Large Language Models for Legal Judgment Prediction

    Authors: Chenlong Deng, Kelong Mao, Yuyao Zhang, Zhicheng Dou

    Abstract: Legal judgment prediction is essential for enhancing judicial efficiency. In this work, we identify that existing large language models (LLMs) underperform in this domain due to challenges in understanding case complexities and distinguishing between similar charges. To adapt LLMs for effective legal judgment prediction, we introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspire… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  8. arXiv:2407.01950  [pdf, other

    cs.RO cs.AI

    LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance

    Authors: Wenhao Yu, Jie Peng, Huanyu Yang, Junrui Zhang, Yifan Duan, Jianmin Ji, Yanyong Zhang

    Abstract: The conditional diffusion model has been demonstrated as an efficient tool for learning robot policies, owing to its advancement to accurately model the conditional distribution of policies. The intricate nature of real-world scenarios, characterized by dynamic obstacles and maze-like structures, underscores the complexity of robot local navigation decision-making as a conditional distribution pro… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 8 pages, 6 figures, accepted by IROS 2024

  9. arXiv:2407.01945  [pdf, other

    cs.CV

    Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

    Authors: Zhaoshuai Qi, Yifeng Hao, Rui Hu, Wenyou Chang, Jiaqi Yang, Yanning Zhang

    Abstract: Structured light-based method with a camera-projector pair (CPP) plays a vital role in indoor 3D reconstruction, especially for scenes with weak textures. Previous methods usually assume known intrinsics, which are pre-calibrated from known objects, or self-calibrated from multi-view observations. It is still challenging to reliably recover CPP intrinsics from only two views without any known obje… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  10. arXiv:2407.01910  [pdf, other

    cs.LG cs.AI cs.AR

    MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation

    Authors: Yongan Zhang, Zhongzhi Yu, Yonggan Fu, Cheng Wan, Yingyan, Lin

    Abstract: Large Language Models (LLMs) have recently shown promise in streamlining hardware design processes by encapsulating vast amounts of domain-specific data. In addition, they allow users to interact with the design processes through natural language instructions, thus making hardware design more accessible to developers. However, effectively leveraging LLMs in hardware design necessitates providing d… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted in ISLAD 2024

  11. arXiv:2407.01866  [pdf, other

    cs.CV cs.GR

    Image-GS: Content-Adaptive Image Representation via 2D Gaussians

    Authors: Yunxiang Zhang, Alexandr Kuznetsov, Akshay Jindal, Kenneth Chen, Anton Sochenov, Anton Kaplanyan, Qi Sun

    Abstract: Neural image representations have recently emerged as a promising technique for storing, streaming, and rendering visual data. Coupled with learning-based workflows, these novel representations have demonstrated remarkable visual fidelity and memory efficiency. However, existing neural image representations often rely on explicit uniform data structures without content adaptivity or computation-in… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  12. arXiv:2407.01863  [pdf, other

    cs.CL

    VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

    Authors: Qiucheng Wu, Handong Zhao, Michael Saxon, Trung Bui, William Yang Wang, Yang Zhang, Shiyu Chang

    Abstract: Vision language models (VLMs) are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, the ways that these capabilities combine are not always intuitive and warrant direct investigation. One understudied capability in VLMs is visual spatial planning -- the ability to comprehend the spatial arrangements of obje… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  13. arXiv:2407.01796  [pdf, other

    cs.CL

    Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

    Authors: Sirui Xia, Xintao Wang, Jiaqing Liang, Yifei Zhang, Weikang Zhou, Jiaji Deng, Fei Yu, Yanghua Xiao

    Abstract: Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) in knowledge-intensive tasks. Recently, Attributed Text Generation (ATG) has attracted growing attention, which provides citations to support the model's responses in RAG, so as to enhance the credibility of LLM-generated content and facilitate verification. Prior methods mainly adopt coarse-graine… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 15 pages,2 figures

  14. arXiv:2407.01791  [pdf, other

    cs.CV cs.AI

    μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

    Authors: Alejandro Lozano, Jeffrey Nirschl, James Burgess, Sanket Rajan Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung-Levy

    Abstract: Recent advances in microscopy have enabled the rapid generation of terabytes of image data in cell biology and biomedical research. Vision-language models (VLMs) offer a promising solution for large-scale biological image analysis, enhancing researchers' efficiency, identifying new image biomarkers, and accelerating hypothesis generation and scientific discovery. However, there is a lack of standa… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  15. arXiv:2407.01579  [pdf, other

    cs.CV

    Technical Report for CVPR 2024 WeatherProof Dataset Challenge: Semantic Segmentation on Paired Real Data

    Authors: Guojin Cao, Jiaxu Li, Jia He, Ying Min, Yunhao Zhang

    Abstract: This technical report presents the implementation details of 2nd winning for CVPR'24 UG2 WeatherProof Dataset Challenge. This challenge aims at semantic segmentation of images degraded by various degrees of weather from all around the world. We addressed this problem by introducing a pre-trained large-scale vision foundation model: InternImage, and trained it using images with different levels of… ▽ More

    Submitted 9 June, 2024; originally announced July 2024.

  16. arXiv:2407.01541  [pdf

    cs.NI

    Integration of Computer Networks and Artificial Neural Networks for an AI-based Network Operator

    Authors: Binbin Wu, Jingyu Xu, Yifan Zhang, Bo Liu, Yulu Gong, Jiaxin Huang

    Abstract: This paper proposes an integrated approach combining computer networks and artificial neural networks to construct an intelligent network operator, functioning as an AI model. State information from computer networks is transformed into embedded vectors, enabling the operator to efficiently recognize different pieces of information and accurately output appropriate operations for the computer netw… ▽ More

    Submitted 9 April, 2024; originally announced July 2024.

  17. arXiv:2407.01531  [pdf, other

    cs.RO cs.LG

    Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

    Authors: Yixiao Wang, Yifei Zhang, Mingxiao Huo, Ran Tian, Xiang Zhang, Yichen Xie, Chenfeng Xu, Pengliang Ji, Wei Zhan, Mingyu Ding, Masayoshi Tomizuka

    Abstract: The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). B… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  18. arXiv:2407.01494  [pdf, other

    cs.CV cs.SD eess.AS

    FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

    Authors: Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen

    Abstract: We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Project page: https://foleycrafter.github.io/

  19. arXiv:2407.01468  [pdf, other

    cs.RO

    Active Shadowing (ASD): Manipulating Visual Perception of Robotics Behaviors via Implicit Communication

    Authors: Andrew Boateng, Prakhar Bhartiya, Yu Zhang

    Abstract: Explicit communication is often valued for its directness during interaction. Implicit communication, on the other hand, is indirect in that its communicative content must be inferred. Implicit communication is considered more desirable in teaming situations that requires reduced interruptions for improved fluency. In this paper, we investigate another unique advantage of implicit communication: i… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  20. arXiv:2407.01299  [pdf, other

    cs.CV

    Preserving Full Degradation Details for Blind Image Super-Resolution

    Authors: Hongda Liu, Longguang Wang, Ye Zhang, Kaiwen Xue, Shunbo Zhou, Yulan Guo

    Abstract: The performance of image super-resolution relies heavily on the accuracy of degradation information, especially under blind settings. Due to absence of true degradation models in real-world scenarios, previous methods learn distinct representations by distinguishing different degradations in a batch. However, the most significant degradation differences may provide shortcuts for the learning of re… ▽ More

    Submitted 2 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: 18 pages, 11 figures, 4 tables

  21. arXiv:2407.01284  [pdf, other

    cs.AI cs.CL cs.CV cs.LG cs.SC

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Authors: Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang Zhang

    Abstract: Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduc… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Work in progress

  22. arXiv:2407.01209  [pdf, other

    cs.RO

    6-DoF Grasp Detection in Clutter with Enhanced Receptive Field and Graspable Balance Sampling

    Authors: Hanwen Wang, Ying Zhang, Yunlong Wang, Jian Li

    Abstract: 6-DoF grasp detection of small-scale grasps is crucial for robots to perform specific tasks. This paper focuses on enhancing the recognition capability of small-scale grasping, aiming to improve the overall accuracy of grasping prediction results and the generalization ability of the network. We propose an enhanced receptive field method that includes a multi-radii cylinder grouping module and a p… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  23. arXiv:2407.01009  [pdf, other

    cs.CL

    DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

    Authors: Jiabao Pan, Yan Zhang, Chen Zhang, Zuozhu Liu, Hongwei Wang, Haizhou Li

    Abstract: Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This pape… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  24. arXiv:2407.00981  [pdf, other

    cs.HC cs.CL

    VisEval: A Benchmark for Data Visualization in the Era of Large Language Models

    Authors: Nan Chen, Yuge Zhang, Jiahang Xu, Kan Ren, Yuqing Yang

    Abstract: Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However,… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  25. arXiv:2407.00952  [pdf, other

    cs.LG cs.CL cs.DC

    SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

    Authors: Zheng Lin, Xuanjie Hu, Yuxin Zhang, Zhe Chen, Zihan Fang, Xianhao Chen, Ang Li, Praneeth Vepakomma, Yue Gao

    Abstract: The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently h… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 9 pages, 3 figures

  26. arXiv:2407.00949  [pdf, ps, other

    cs.CV eess.IV

    SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection

    Authors: Yanheng Wang, Xiaohan Yu, Yongsheng Gao, Jianjun Sha, Jian Wang, Lianru Gao, Yonggang Zhang, Xianhui Rong

    Abstract: It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, tr… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  27. arXiv:2407.00933  [pdf, other

    cs.DC eess.SP

    Reconfigurable Intelligent Computational Surfaces for MEC-Assisted Autonomous Driving Networks: Design Optimization and Analysis

    Authors: Xueyao Zhang, Bo Yang, Zhiwen Yu, Xuelin Cao, George C. Alexandropoulos, Yan Zhang, Merouane Debbah, Chau Yuen

    Abstract: This paper investigates autonomous driving safety improvement via task offloading from cellular vehicles (CVs) to a multi-access edge computing (MEC) server using vehicle-to-infrastructure (V2I) links. Considering that the latter links can be reused by vehicle-to-vehicle (V2V) communications to improve spectrum utilization, the receiver of the V2I link may suffer from severe interference that can… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  28. Unified Dual-Intent Translation for Joint Modeling of Search and Recommendation

    Authors: Yuting Zhang, Yiqing Wu, Ruidong Han, Ying Sun, Yongchun Zhu, Xiang Li, Wei Lin, Fuzhen Zhuang, Zhulin An, Yongjun Xu

    Abstract: Recommendation systems, which assist users in discovering their preferred items among numerous options, have served billions of users across various online platforms. Intuitively, users' interactions with items are highly driven by their unchanging inherent intents (e.g., always preferring high-quality items) and changing demand intents (e.g., wanting a T-shirt in summer but a down jacket in winte… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  29. arXiv:2407.00906  [pdf, other

    cs.CV cs.LG

    GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection

    Authors: Yuming Zhang, Dongzhi Guan, Shouxin Zhang, Junhao Su, Yunzhi Han, Jiabin Liu

    Abstract: Safety issues at construction sites have long plagued the industry, posing risks to worker safety and causing economic damage due to potential hazards. With the advancement of artificial intelligence, particularly in the field of computer vision, the automation of safety monitoring on construction sites has emerged as a solution to this longstanding issue. Despite achieving impressive performance,… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  30. arXiv:2407.00869  [pdf, other

    cs.CL cs.AI

    Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks

    Authors: Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, Yang Zhang

    Abstract: We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, language models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious y… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  31. arXiv:2407.00753  [pdf, other

    eess.AS cs.SD

    FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

    Authors: Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang

    Abstract: While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed b… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024. 5 pages, 1 figure

  32. UWBAD: Towards Effective and Imperceptible Jamming Attacks Against UWB Ranging Systems with COTS Chips

    Authors: Yuqiao Yang, Zhongjie Wu, Yongzhao Zhang, Ting Chen, Jun Li, Jie Yang, Wenhao Liu, Xiaosong Zhang, Ruicong Shi, Jingwei Li, Yu Jiang, Zhuo Su

    Abstract: UWB ranging systems have been adopted in many critical and security sensitive applications due to its precise positioning and secure ranging capabilities. We present a practical jamming attack, namely UWBAD, against commercial UWB ranging systems, which exploits the vulnerability of the adoption of the normalized cross-correlation process in UWB ranging and can selectively and quickly block rangin… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security

  33. arXiv:2407.00653  [pdf, other

    cs.CL cs.AI

    Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

    Authors: Yifei Zhang, Xintao Wang, Jiaqing Liang, Sirui Xia, Lida Chen, Yanghua Xiao

    Abstract: Large Language Models (LLMs) have exhibited impressive proficiency in various natural language processing (NLP) tasks, which involve increasingly complex reasoning. Knowledge reasoning, a primary type of reasoning, aims at deriving new knowledge from existing one.While it has been widely studied in the context of knowledge graphs (KGs), knowledge reasoning in LLMs remains underexplored. In this pa… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  34. arXiv:2407.00634  [pdf, other

    cs.CV cs.LG

    Tarsier: Recipes for Training and Evaluating Large Video Description Models

    Authors: Jiawei Wang, Liping Yuan, Yuchen Zhang

    Abstract: Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a met… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  35. arXiv:2407.00617  [pdf, other

    cs.LG cs.AI cs.CL cs.GT

    Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

    Authors: Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu

    Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-th… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  36. arXiv:2407.00578  [pdf, other

    cs.RO

    UniQuad: A Unified and Versatile Quadrotor Platform Series for UAV Research and Application

    Authors: Yichen Zhang, Xinyi Chen, Peize Liu, Junzhe Wang, Hetai Zou, Shaojie Shen

    Abstract: As quadrotors take on an increasingly diverse range of roles, researchers often need to develop new hardware platforms tailored for specific tasks, introducing significant engineering overhead. In this article, we introduce the UniQuad series, a unified and versatile quadrotor platform series that offers high flexibility to adapt to a wide range of common tasks, excellent customizability for advan… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Submitted to 40th Anniversary of the IEEE Conference on Robotics and Automation (ICRA-X40)

  37. arXiv:2407.00577  [pdf, other

    cs.RO

    FALCON: Fast Autonomous Aerial Exploration using Coverage Path Guidance

    Authors: Yichen Zhang, Xinyi Chen, Chen Feng, Boyu Zhou, Shaojie Shen

    Abstract: This paper introduces FALCON, a novel Fast Autonomous expLoration framework using COverage path guidaNce, which aims at setting a new performance benchmark in the field of autonomous aerial exploration. Despite recent advancements in the domain, existing exploration planners often suffer from inefficiencies such as frequent revisitations of previously explored regions. FALCON effectively harnesses… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  38. arXiv:2407.00563  [pdf, other

    cs.RO

    An abstract theory of sensor eventification

    Authors: Yulin Zhang, Dylan A. Shell

    Abstract: Unlike traditional cameras, event cameras measure changes in light intensity and report differences. This paper examines the conditions necessary for other traditional sensors to admit eventified versions that provide adequate information despite outputting only changes. The requirements depend upon the regularity of the signal space, which we show may depend on several factors including structure… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 21 pages, 14 figures

    Journal ref: Robotics: Science and Systems 2024

  39. arXiv:2407.00499  [pdf, other

    cs.CL cs.AI cs.LG

    ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

    Authors: Zhiyuan Wang, Jinhao Duan, Lu Cheng, Yue Zhang, Qingni Wang, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu

    Abstract: Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 13 pages, 9 figures, 6 tables

  40. arXiv:2407.00478  [pdf, other

    cs.LG cs.AI

    Knowledge-Aware Parsimony Learning: A Perspective from Relational Graphs

    Authors: Quanming Yao, Yongqi Zhang, Yaqing Wang, Nan Yin, James Kwok, Qiang Yang

    Abstract: The scaling law, a strategy that involves the brute-force scaling of the training dataset and learnable parameters, has become a prevalent approach for developing stronger learning models. In this paper, we examine its rationale in terms of learning from relational graphs. We demonstrate that directly adhering to such a scaling law does not necessarily yield stronger models due to architectural in… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  41. arXiv:2407.00468  [pdf, other

    cs.CV cs.AI cs.CL

    MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

    Authors: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

    Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/

  42. arXiv:2407.00386  [pdf, other

    cs.NE cs.AI

    Multi-task multi-constraint differential evolution with elite-guided knowledge transfer for coal mine integrated energy system dispatching

    Authors: Canyun Dai, Xiaoyan Sun, Hejuan Hu, Wei Song, Yong Zhang, Dunwei Gong

    Abstract: The dispatch optimization of coal mine integrated energy system is challenging due to high dimensionality, strong coupling constraints, and multiobjective. Existing constrained multiobjective evolutionary algorithms struggle with locating multiple small and irregular feasible regions, making them inaplicable to this problem. To address this issue, we here develop a multitask evolutionary algorithm… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  43. arXiv:2407.00367  [pdf, other

    cs.CV

    SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

    Authors: Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, Yinda Zhang

    Abstract: Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 3D stereoscopic video generation, video diffusion, inpainting

  44. arXiv:2407.00203  [pdf, other

    cs.CV

    PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

    Authors: Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Zhongyi Shui, Kai Zhang, Jingxiong Li, Xingheng Lyu, Tao Lin, Lin Yang

    Abstract: Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology, serving as backbones for applications such as zero-shot image classification and Whole Slide Image (WSI) analysis. Additionally, they can function as vision encoders when combined with large language models (LLMs) to support broader capabilities. Current efforts to train pathology VLMs rely on pathology imag… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: 13 pages, 3 figures

  45. arXiv:2407.00088  [pdf, other

    cs.DC cs.AI

    T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

    Authors: Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang

    Abstract: The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native sup… ▽ More

    Submitted 25 June, 2024; originally announced July 2024.

  46. arXiv:2407.00023  [pdf, other

    cs.DC cs.LG

    Preble: Efficient Distributed Prompt Scheduling for LLM Serving

    Authors: Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang

    Abstract: Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices include domain-specific instructions, illustration of tool usages, and long context, such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests, and their attention computation results can be reused. However, today's LLM s… ▽ More

    Submitted 8 May, 2024; originally announced July 2024.

  47. arXiv:2406.20087  [pdf, other

    cs.LG cs.AI cs.CL cs.CY cs.HC

    ProgressGym: Alignment with a Millennium of Moral Progress

    Authors: Tianyi Qiu, Yang Zhang, Xuchuan Huang, Jasmine Xinze Li, Jiaming Ji, Yaodong Yang

    Abstract: Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigat… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  48. arXiv:2406.20076  [pdf, other

    cs.CV

    EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

    Authors: Yuxuan Zhang, Tianheng Cheng, Rui Hu, ei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

    Abstract: Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Preprint

  49. arXiv:2406.20015  [pdf, other

    cs.CL cs.AI

    ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models

    Authors: Yuxiang Zhang, Jing Chen, Junjie Wang, Yaxin Liu, Cheng Yang, Chufan Shi, Xinyu Zhu, Zihao Lin, Hanwen Wan, Yujiu Yang, Tetsuya Sakai, Tian Feng, Hayato Yamana

    Abstract: Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and bre… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  50. arXiv:2406.19972  [pdf, other

    cs.RO

    HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

    Authors: Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, Cewu Lu

    Abstract: Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-stud… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.