Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 488 results for author: Shi, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.12574  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

    Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Layla Isik, Yen-Ling Kuo, Tianmin Shu

    Abstract: Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can wat… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

    Comments: Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code: https://github.com/SCAI-JHU/MuMA-ToM

  2. arXiv:2408.12139  [pdf, ps, other

    cs.LG cs.AI

    DRExplainer: Quantifiable Interpretability in Drug Response Prediction with Directed Graph Convolutional Network

    Authors: Haoyuan Shi, Tao Xu, Xiaodi Li, Qian Gao, Junfeng Xia, Zhenyu Yue

    Abstract: Predicting the response of a cancer cell line to a therapeutic drug is pivotal for personalized medicine. Despite numerous deep learning methods that have been developed for drug response prediction, integrating diverse information about biological entities and predicting the directional response remain major challenges. Here, we propose a novel interpretable predictive model, DRExplainer, which l… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  3. arXiv:2408.11545  [pdf, other

    cs.CV

    UNetMamba: Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images

    Authors: Enze Zhu, Zhan Chen, Dingkai Wang, Hanru Shi, Xiaoxuan Liu, Lei Wang

    Abstract: The semantic segmentation of high-resolution remote sensing images plays a crucial role in downstream applications such as urban planning and disaster assessment. However, existing Transformer-based methods suffer from the constraint between accuracy and efficiency. To overcome this dilemma, we propose UNetMamba, a novel Mamba-based semantic segmentation model. It incorporates a Mamba Segmentation… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  4. arXiv:2408.11311  [pdf, other

    cs.AR quant-ph

    HiMA: Hierarchical Quantum Microarchitecture for Qubit-Scaling and Quantum Process-Level Parallelism

    Authors: Qi Zhou, Zi-Hao Mei, Han-Qing Shi, Liang-Liang Guo, Xiao-Yan Yang, Yun-Jie Wang, Xiao-Fan Xu, Cheng Xue, Wei-Cheng Kong, Jun-Chao Wang, Yu-Chun Wu, Zhao-Yun Chen, Guo-Ping Guo

    Abstract: Quantum computing holds immense potential for addressing a myriad of intricate challenges, which is significantly amplified when scaled to thousands of qubits. However, a major challenge lies in developing an efficient and scalable quantum control system. To address this, we propose a novel Hierarchical MicroArchitecture (HiMA) designed to facilitate qubit scaling and exploit quantum process-level… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  5. arXiv:2408.09949  [pdf, other

    cs.CV cs.CL

    C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval

    Authors: Zhigang Chen, Benjia Zhou, Yiqing Huang, Jun Wan, Yibo Hu, Hailin Shi, Yanyan Liang, Zhen Lei, Du Zhang

    Abstract: Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotatio… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  6. arXiv:2408.09787  [pdf, other

    cs.CL cs.CV cs.MM

    Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

    Authors: Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang

    Abstract: Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animatio… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: Accepted by SIGGRAPH Asia 2024, Project and Codes: https://github.com/HITsz-TMG/Anim-Director

  7. arXiv:2408.09251  [pdf, other

    cs.RO cs.AI cs.LG

    V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

    Authors: Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran

    Abstract: Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situ… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

  8. arXiv:2408.08921  [pdf, other

    cs.AI cs.CL cs.IR

    Graph Retrieval-Augmented Generation: A Survey

    Authors: Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang

    Abstract: Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as ``hallucination'', lack of domain-specific knowledge, and outdated information. However, the complex structure of relati… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: Ongoing work

  9. arXiv:2408.07654  [pdf, other

    cs.LG

    Graph Triple Attention Network: A Decoupled Perspective

    Authors: Xiaotang Wang, Yun Zhu, Haizhou Shi, Yongchao Liu, Chuntao Hong

    Abstract: Graph Transformers (GTs) have recently achieved significant success in the graph domain by effectively capturing both long-range dependencies and graph inductive biases. However, these methods face two primary challenges: (1) multi-view chaos, which results from coupling multi-view information (positional, structural, attribute), thereby impeding flexible usage and the interpretability of the prop… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  10. arXiv:2408.07071  [pdf

    physics.geo-ph cs.LG

    Approaches for enhancing extrapolability in process-based and data-driven models in hydrology

    Authors: Haiyang Shi

    Abstract: The application of process-based and data-driven hydrological models is crucial in modern hydrological research, especially for predicting key water cycle variables such as runoff, evapotranspiration (ET), and soil moisture. These models provide a scientific basis for water resource management, flood forecasting, and ecological protection. Process-based models simulate the physical mechanisms of w… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  11. arXiv:2408.04821  [pdf

    cs.RO

    VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving

    Authors: Keke Long, Haotian Shi, Jiaxi Liu, Xiaopeng Li

    Abstract: Motivated by the emergent reasoning capabilities of Vision Language Models (VLMs) and its potential to improve the comprehensibility of autonomous driving systems, this paper introduces a closed-loop autonomous driving controller called VLM-MPC, which combines a VLM for high-level decision-making and a Model Predictive Controller (MPC) for low-level vehicle control. The proposed VLM-MPC system is… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  12. arXiv:2408.04547  [pdf, other

    cs.MM

    Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation

    Authors: Haoxiang Shi, Ziqi Liang, Jun Yu

    Abstract: Emotion Prediction in Conversation (EPC) aims to forecast the emotions of forthcoming utterances by utilizing preceding dialogues. Previous EPC approaches relied on simple context modeling for emotion extraction, overlooking fine-grained emotion cues at the word level. Additionally, prior works failed to account for the intrinsic differences between modalities, resulting in redundant information.… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Accepted by INTERSPEECH 2024

  13. arXiv:2408.00744  [pdf, other

    cs.CV

    Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

    Authors: Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi

    Abstract: Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local r… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: ECCV 2024

  14. arXiv:2408.00486  [pdf, other

    cs.RO

    SF-TIM: A Simple Framework for Enhancing Quadrupedal Robot Jumping Agility by Combining Terrain Imagination and Measurement

    Authors: Ze Wang, Yang Li, Long Xu, Hao Shi, Zunwang Ma, Zhen Chu, Chao Li, Fei Gao, Kailun Yang, Kaiwei Wang

    Abstract: Dynamic jumping on high platforms and over gaps differentiates legged robots from wheeled counterparts. Compared to walking on rough terrains, dynamic locomotion on abrupt surfaces requires fusing proprioceptive and exteroceptive perception for explosive movements. In this paper, we propose SF-TIM (Simple Framework combining Terrain Imagination and Measurement), a single-policy method that enhance… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: A demo video has been made available at https://flysoaryun.github.io/SF-TIM

  15. arXiv:2407.19484  [pdf, ps, other

    cs.IT

    Error Correction Decoding Algorithms of RS Codes Based on An Earlier Termination Algorithm to Find The Error Locator Polynomial

    Authors: Zhengyi Jiang, Hao Shi, Zhongyi Huang, Linqi Song, Bo Bai, Gong Zhang, Hanxu Hou

    Abstract: Reed-Solomon (RS) codes are widely used to correct errors in storage systems. Finding the error locator polynomial is one of the key steps in the error correction procedure of RS codes. Modular Approach (MA) is an effective algorithm for solving the Welch-Berlekamp (WB) key-equation problem to find the error locator polynomial that needs $2t$ steps, where $t$ is the error correction capability. In… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  16. arXiv:2407.19420  [pdf, other

    cs.LG

    UniGAP: A Universal and Adaptive Graph Upsampling Approach to Mitigate Over-Smoothing in Node Classification Tasks

    Authors: Xiaotang Wang, Yun Zhu, Haizhou Shi, Yongchao Liu, Chuntao Hong

    Abstract: In the graph domain, deep graph networks based on Message Passing Neural Networks (MPNNs) or Graph Transformers often cause over-smoothing of node features, limiting their expressive capacity. Many upsampling techniques involving node and edge manipulation have been proposed to mitigate this issue. However, these methods often require extensive manual labor, resulting in suboptimal performance and… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  17. arXiv:2407.17695  [pdf, other

    cs.AI cs.CL

    Enhancing Agent Learning through World Dynamics Modeling

    Authors: Zhiyuan Sun, Haochen Shi, Marc-Alexandre Côté, Glen Berseth, Xingdi Yuan, Bang Liu

    Abstract: While large language models (LLMs) have been increasingly deployed across tasks in language understanding and interactive decision-making, their impressive performance is largely due to the comprehensive and in-depth domain knowledge embedded within them. However, the extent of this knowledge can vary across different domains. Existing methods often assume that LLMs already possess such comprehens… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  18. arXiv:2407.15686  [pdf, other

    cs.GR cs.CV

    Differentiable Convex Polyhedra Optimization from Multi-view Images

    Authors: Daxuan Ren, Haiyi Mei, Hezi Shi, Jianmin Zheng, Jianfei Cai, Lei Yang

    Abstract: This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differentiable computation of hyperplane intersection through duality transform with differentiable optimization for vertex positioning with three-plane intersection, en… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: ECCV2024 https://github.com/kimren227/DiffConvex

  19. arXiv:2407.13268  [pdf, other

    cs.AI cs.LG

    Mixture of Experts based Multi-task Supervise Learning from Crowds

    Authors: Tao Han, Huaixuan Shi, Xinyi Ding, Xiao Ma, Huamao Gu, Yili Fang

    Abstract: Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  20. arXiv:2407.12817  [pdf, other

    cs.CL cs.SD eess.AS

    Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition

    Authors: Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang

    Abstract: Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic f… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  21. arXiv:2407.09191  [pdf, other

    cs.CV cs.AI

    From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

    Authors: Hanrong Shi, Lin Li, Jun Xiao, Yueting Zhuang, Long Chen

    Abstract: Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CA… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: Accepted by IJCV

  22. arXiv:2407.02933  [pdf, other

    cs.RO

    Online Time-Informed Kinodynamic Motion Planning of Nonlinear Systems

    Authors: Fei Meng, Jianbang Liu, Haojie Shi, Han Ma, Hongliang Ren, Max Q. -H. Meng

    Abstract: Sampling-based kinodynamic motion planners (SKMPs) are powerful in finding collision-free trajectories for high-dimensional systems under differential constraints. Time-informed set (TIS) can provide the heuristic search domain to accelerate their convergence to the time-optimal solution. However, existing TIS approximation methods suffer from the curse of dimensionality, computational burden, and… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  23. arXiv:2407.02182  [pdf, other

    cs.CV cs.RO eess.IV

    Occlusion-Aware Seamless Segmentation

    Authors: Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, Kailun Yang

    Abstract: Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Ble… ▽ More

    Submitted 17 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. The fresh dataset and source code are available at https://github.com/yihong-97/OASS

  24. arXiv:2407.01418  [pdf, other

    cs.RO cs.AI cs.LG

    RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

    Authors: Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Cheston Tan, Yunzhu Li, Jiajun Wu

    Abstract: Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Robotics: Science and Systems (RSS), 2024. Project page: https://robo-pack.github.io/

    ACM Class: I.2.9; I.2.6; I.2.10

  25. arXiv:2406.18394  [pdf, other

    q-fin.CP cs.AI

    AlphaForge: A Framework to Mine and Dynamically Combine Formulaic Alpha Factors

    Authors: Hao Shi, Weili Song, Xinting Zhang, Jiahe Shi, Cuicui Luo, Xiang Ao, Hamid Arian, Luis Seco

    Abstract: The complexity of financial data, characterized by its variability and low signal-to-noise ratio, necessitates advanced methods in quantitative investment that prioritize both performance and interpretability.Transitioning from early manual extraction to genetic programming, the most advanced approach in the alpha factor mining domain currently employs reinforcement learning to mine a set of combi… ▽ More

    Submitted 19 August, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  26. arXiv:2406.15765  [pdf, other

    cs.LG cs.CL

    Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

    Authors: Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, Yingyan Celine Lin

    Abstract: Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  27. arXiv:2406.14696  [pdf, other

    eess.SY cs.AI

    Physically Analyzable AI-Based Nonlinear Platoon Dynamics Modeling During Traffic Oscillation: A Koopman Approach

    Authors: Kexin Tian, Haotian Shi, Yang Zhou, Sixu Li

    Abstract: Given the complexity and nonlinearity inherent in traffic dynamics within vehicular platoons, there exists a critical need for a modeling methodology with high accuracy while concurrently achieving physical analyzability. Currently, there are two predominant approaches: the physics model-based approach and the Artificial Intelligence (AI)--based approach. Knowing the facts that the physical-based… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  28. arXiv:2406.12846  [pdf, other

    cs.CV

    DrVideo: Document Retrieval Based Long Video Understanding

    Authors: Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai

    Abstract: Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 11 pages

  29. arXiv:2406.12550  [pdf, other

    cs.LG cs.AI

    Offline Imitation Learning with Model-based Reverse Augmentation

    Authors: Jie-Jing Shao, Hao-Sen Shi, Lan-Zhe Guo, Yu-Feng Li

    Abstract: In offline Imitation Learning (IL), one of the main challenges is the \textit{covariate shift} between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduce the supplementary data and identify th… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted by KDD2024

  30. arXiv:2406.12229  [pdf, other

    cs.AI cs.LG

    Spatially Resolved Gene Expression Prediction from Histology via Multi-view Graph Contrastive Learning with HSIC-bottleneck Regularization

    Authors: Changxi Chi, Hang Shi, Qi Zhu, Daoqiang Zhang, Wei Shao

    Abstract: The rapid development of spatial transcriptomics(ST) enables the measurement of gene expression at spatial resolution, making it possible to simultaneously profile the gene expression, spatial locations of spots, and the matched histopathological images. However, the cost for collecting ST data is much higher than acquiring histopathological images, and thus several studies attempt to predict the… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  31. arXiv:2406.11941  [pdf, other

    cs.LG cs.AI cs.RO

    Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

    Authors: Junwei You, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu, Sikai Chen, Bin Ran

    Abstract: Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS), enhancing road safety and traffic efficiency. While traditional methods have laid foundational work, modern deep learning techniques, particularly transformer-based models and generative approaches, have significantly improved prediction accuracy by capturing complex and non-lin… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  32. Interpretable modulated differentiable STFT and physics-informed balanced spectrum metric for freight train wheelset bearing cross-machine transfer fault diagnosis under speed fluctuations

    Authors: Chao He, Hongmei Shi, Ruixin Li, Jianbo Li, ZuJun Yu

    Abstract: The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentia… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Journal ref: Advanced Engineering Informatics, 2024

  33. arXiv:2406.11675  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

    Authors: Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, Hao Wang

    Abstract: Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters le… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: 27 pages, 3 figures, 9 tables; preprint, work in progress

  34. arXiv:2406.11303  [pdf, other

    cs.CV cs.AI cs.CL

    VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

    Authors: Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang

    Abstract: Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations,… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 38 pages, 44 figures

  35. arXiv:2406.11230  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

    Authors: Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

    Abstract: Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-contex… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  36. arXiv:2406.10885  [pdf, other

    cs.CL

    On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions

    Authors: Weiqi Wang, Tianqing Fang, Haochen Shi, Baixuan Xu, Wenxuan Ding, Liyu Zhang, Wei Fan, Jiaxin Bai, Haoran Li, Xin Liu, Yangqiu Song

    Abstract: Entity- and event-level conceptualization, as fundamental elements of human cognition, plays a pivotal role in generalizable reasoning. This process involves abstracting specific instances into higher-level concepts and forming abstract knowledge that can be applied in unfamiliar or novel situations, which can enhance models' inferential capabilities and support the effective transfer of knowledge… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  37. arXiv:2406.10701  [pdf, other

    cs.CL

    MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

    Authors: Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, Yangqiu Song

    Abstract: Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product i… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 8 pages, 5 figures

  38. arXiv:2406.07528  [pdf, other

    cs.LG

    QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

    Authors: Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia

    Abstract: The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition.… ▽ More

    Submitted 22 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

  39. arXiv:2406.05981  [pdf, other

    cs.LG cs.AI cs.CL

    ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

    Authors: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin

    Abstract: Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly pr… ▽ More

    Submitted 25 July, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

  40. arXiv:2406.04295  [pdf, other

    cs.CV

    Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

    Authors: Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang

    Abstract: Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditiona… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: GitHub: https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment

  41. arXiv:2406.04032  [pdf, other

    cs.CV

    Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

    Authors: Marianna Ohanyan, Hayk Manukyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

    Abstract: We present Zero-Painter, a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjus… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  42. arXiv:2406.01856  [pdf, ps, other

    cs.DS math.OC

    On Approximation of Robust Max-Cut and Related Problems using Randomized Rounding Algorithms

    Authors: Haoyan Shi, Sanjay Mehrotra

    Abstract: Goemans and Williamson proposed a randomized rounding algorithm for the MAX-CUT problem with a 0.878 approximation bound in expectation. The 0.878 approximation bound remains the best-known approximation bound for this APX-hard problem. Their approach was subsequently applied to other related problems such as Max-DiCut, MAX-SAT, and Max-2SAT, etc. We show that the randomized rounding algorithm can… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  43. arXiv:2406.00805  [pdf

    cs.LG physics.geo-ph

    Extrapolability Improvement of Machine Learning-Based Evapotranspiration Models via Domain-Adversarial Neural Networks

    Authors: Haiyang Shi

    Abstract: Machine learning-based hydrological prediction models, despite their high accuracy, face limitations in extrapolation capabilities when applied globally due to uneven data distribution. This study integrates Domain-Adversarial Neural Networks (DANN) to improve the geographical adaptability of evapotranspiration (ET) models. By employing DANN, we aim to mitigate distributional discrepancies between… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  44. arXiv:2405.19915  [pdf, other

    cs.AI

    P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

    Authors: Huihong Shi, Xin Cheng, Wendong Mao, Zhongfeng Wang

    Abstract: Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficien… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  45. arXiv:2405.18111  [pdf, other

    cs.CL

    ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator

    Authors: Junda Zhu, Lingyong Yan, Haibo Shi, Dawei Yin, Lei Sha

    Abstract: Large language models (LLMs) are proven to benefit a lot from retrieval-augmented generation (RAG) in alleviating hallucinations confronted with knowledge-intensive questions. RAG adopts information retrieval techniques to inject external knowledge from semantic-relevant documents as input contexts. However, due to today's Internet being flooded with numerous noisy and fabricating content, it is i… ▽ More

    Submitted 16 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: 18 pages, 7 figures

  46. arXiv:2405.17900  [pdf, other

    cs.CL

    Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

    Authors: Haoxiang Shi, Xulong Zhang, Ning Cheng, Yong Zhang, Jun Yu, Jing Xiao, Jianzong Wang

    Abstract: The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared informa… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)

  47. arXiv:2405.17777  [pdf, other

    cs.IR

    RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

    Authors: Jianzong Wang, Haoxiang Shi, Kaiyi Luo, Xulong Zhang, Ning Cheng, Jing Xiao

    Abstract: Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing techniq… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)

  48. arXiv:2405.17028  [pdf, other

    cs.SD eess.AS

    RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

    Authors: Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao

    Abstract: Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data

  49. arXiv:2405.16847  [pdf, other

    cs.CV cs.AI

    TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

    Authors: Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu

    Abstract: Autoregressive next-token prediction is a standard pretraining method for large-scale language models, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce \textbf{TokenUnify}, a novel pr… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  50. arXiv:2405.16533  [pdf, other

    cs.CL

    Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

    Authors: Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, Zhaochun Ren

    Abstract: Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, empowering them to solve practical tasks. Existing work typically empowers LLMs as tool users with a manually designed workflow, where the LLM plans a series of tools in a step-by-step manner, and sequentially executes each tool to obtain intermediate results until deriving the… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: Work in progress