Search | arXiv e-print repository

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Layla Isik, Yen-Ling Kuo, Tianmin Shu

Abstract: Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can wat… ▽ More Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM. △ Less

Submitted 22 August, 2024; originally announced August 2024.

Comments: Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code: https://github.com/SCAI-JHU/MuMA-ToM

arXiv:2408.12139 [pdf, ps, other]

DRExplainer: Quantifiable Interpretability in Drug Response Prediction with Directed Graph Convolutional Network

Authors: Haoyuan Shi, Tao Xu, Xiaodi Li, Qian Gao, Junfeng Xia, Zhenyu Yue

Abstract: Predicting the response of a cancer cell line to a therapeutic drug is pivotal for personalized medicine. Despite numerous deep learning methods that have been developed for drug response prediction, integrating diverse information about biological entities and predicting the directional response remain major challenges. Here, we propose a novel interpretable predictive model, DRExplainer, which l… ▽ More Predicting the response of a cancer cell line to a therapeutic drug is pivotal for personalized medicine. Despite numerous deep learning methods that have been developed for drug response prediction, integrating diverse information about biological entities and predicting the directional response remain major challenges. Here, we propose a novel interpretable predictive model, DRExplainer, which leverages a directed graph convolutional network to enhance the prediction in a directed bipartite network framework. DRExplainer constructs a directed bipartite network integrating multi-omics profiles of cell lines, the chemical structure of drugs and known drug response to achieve directed prediction. Then, DRExplainer identifies the most relevant subgraph to each prediction in this directed bipartite network by learning a mask, facilitating critical medical decision-making. Additionally, we introduce a quantifiable method for model interpretability that leverages a ground truth benchmark dataset curated from biological features. In computational experiments, DRExplainer outperforms state-of-the-art predictive methods and another graph-based explanation method under the same experimental setting. Finally, the case studies further validate the interpretability and the effectiveness of DRExplainer in predictive novel drug response. Our code is available at: https://github.com/vshy-dream/DRExplainer. △ Less

Submitted 22 August, 2024; originally announced August 2024.

arXiv:2408.11545 [pdf, other]

UNetMamba: Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images

Authors: Enze Zhu, Zhan Chen, Dingkai Wang, Hanru Shi, Xiaoxuan Liu, Lei Wang

Abstract: The semantic segmentation of high-resolution remote sensing images plays a crucial role in downstream applications such as urban planning and disaster assessment. However, existing Transformer-based methods suffer from the constraint between accuracy and efficiency. To overcome this dilemma, we propose UNetMamba, a novel Mamba-based semantic segmentation model. It incorporates a Mamba Segmentation… ▽ More The semantic segmentation of high-resolution remote sensing images plays a crucial role in downstream applications such as urban planning and disaster assessment. However, existing Transformer-based methods suffer from the constraint between accuracy and efficiency. To overcome this dilemma, we propose UNetMamba, a novel Mamba-based semantic segmentation model. It incorporates a Mamba Segmentation Decoder (MSD) that can efficiently decode the complex information within high-resolution images, and a Local Supervision Module (LSM), which is train-only but can significantly enhance the perception of local contents. Extensive experiments demonstrate that UNet-Mamba outperforms the state-of-the-art methods with the mIoU increased by 0.87% on LoveDA and 0.36% on ISPRS Vaihingen, while achieving high efficiency through light weight, low memory footprint and low computational cost. The source code will soon be publicly available at https://github.com/EnzeZhu2001/UNetMamba. △ Less

Submitted 21 August, 2024; originally announced August 2024.

arXiv:2408.11311 [pdf, other]

HiMA: Hierarchical Quantum Microarchitecture for Qubit-Scaling and Quantum Process-Level Parallelism

Authors: Qi Zhou, Zi-Hao Mei, Han-Qing Shi, Liang-Liang Guo, Xiao-Yan Yang, Yun-Jie Wang, Xiao-Fan Xu, Cheng Xue, Wei-Cheng Kong, Jun-Chao Wang, Yu-Chun Wu, Zhao-Yun Chen, Guo-Ping Guo

Abstract: Quantum computing holds immense potential for addressing a myriad of intricate challenges, which is significantly amplified when scaled to thousands of qubits. However, a major challenge lies in developing an efficient and scalable quantum control system. To address this, we propose a novel Hierarchical MicroArchitecture (HiMA) designed to facilitate qubit scaling and exploit quantum process-level… ▽ More Quantum computing holds immense potential for addressing a myriad of intricate challenges, which is significantly amplified when scaled to thousands of qubits. However, a major challenge lies in developing an efficient and scalable quantum control system. To address this, we propose a novel Hierarchical MicroArchitecture (HiMA) designed to facilitate qubit scaling and exploit quantum process-level parallelism. This microarchitecture is based on three core elements: (i) discrete qubit-level drive and readout, (ii) a process-based hierarchical trigger mechanism, and (iii) multiprocessing with a staggered triggering technique to enable efficient quantum process-level parallelism. We implement HiMA as a control system for a 72-qubit tunable superconducting quantum processing unit, serving a public quantum cloud computing platform, which is capable of expanding to 6144 qubits through three-layer cascading. In our benchmarking tests, HiMA achieves up to a 4.89x speedup under a 5-process parallel configuration. Consequently, to the best of our knowledge, we have achieved the highest CLOPS (Circuit Layer Operations Per Second), reaching up to 43,680, across all publicly available platforms. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2408.09949 [pdf, other]

C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval

Authors: Zhigang Chen, Benjia Zhou, Yiqing Huang, Jun Wan, Yibo Hu, Hailin Shi, Yanyan Liang, Zhen Lei, Du Zhang

Abstract: Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotatio… ▽ More Sign Language Representation Learning (SLRL) is crucial for a range of sign language-related downstream tasks such as Sign Language Translation (SLT) and Sign Language Retrieval (SLRet). Recently, many gloss-based and gloss-free SLRL methods have been proposed, showing promising performance. Among them, the gloss-free approach shows promise for strong scalability without relying on gloss annotations. However, it currently faces suboptimal solutions due to challenges in encoding the intricate, context-sensitive characteristics of sign language videos, mainly struggling to discern essential sign features using a non-monotonic video-text alignment strategy. Therefore, we introduce an innovative pretraining paradigm for gloss-free SLRL, called C${^2}$RL, in this paper. Specifically, rather than merely incorporating a non-monotonic semantic alignment of video and text to learn language-oriented sign features, we emphasize two pivotal aspects of SLRL: Implicit Content Learning (ICL) and Explicit Context Learning (ECL). ICL delves into the content of communication, capturing the nuances, emphasis, timing, and rhythm of the signs. In contrast, ECL focuses on understanding the contextual meaning of signs and converting them into equivalent sentences. Despite its simplicity, extensive experiments confirm that the joint optimization of ICL and ECL results in robust sign language representation and significant performance gains in gloss-free SLT and SLRet tasks. Notably, C${^2}$RL improves the BLEU-4 score by +5.3 on P14T, +10.6 on CSL-daily, +6.2 on OpenASL, and +1.3 on How2Sign. It also boosts the R@1 score by +8.3 on P14T, +14.4 on CSL-daily, and +5.9 on How2Sign. Additionally, we set a new baseline for the OpenASL dataset in the SLRet task. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2408.09787 [pdf, other]

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Authors: Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang

Abstract: Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animatio… ▽ More Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: Accepted by SIGGRAPH Asia 2024, Project and Codes: https://github.com/HITsz-TMG/Anim-Director

arXiv:2408.09251 [pdf, other]

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Authors: Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran

Abstract: Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situ… ▽ More Advancements in autonomous driving have increasingly focused on end-to-end (E2E) systems that manage the full spectrum of driving tasks, from environmental perception to vehicle navigation and control. This paper introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework with large vision-language models (VLMs). V2X-VLM is designed to enhance situational awareness, decision-making, and ultimate trajectory planning by integrating data from vehicle-mounted cameras, infrastructure sensors, and textual information. The strength of the comprehensive multimodel data fusion of the VLM enables precise and safe E2E trajectory planning in complex and dynamic driving scenarios. Validation on the DAIR-V2X dataset demonstrates that V2X-VLM outperforms existing state-of-the-art methods in cooperative autonomous driving. △ Less

Submitted 17 August, 2024; originally announced August 2024.

arXiv:2408.08921 [pdf, other]

Graph Retrieval-Augmented Generation: A Survey

Authors: Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, Siliang Tang

Abstract: Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as ``hallucination'', lack of domain-specific knowledge, and outdated information. However, the complex structure of relati… ▽ More Recently, Retrieval-Augmented Generation (RAG) has achieved remarkable success in addressing the challenges of Large Language Models (LLMs) without necessitating retraining. By referencing an external knowledge base, RAG refines LLM outputs, effectively mitigating issues such as ``hallucination'', lack of domain-specific knowledge, and outdated information. However, the complex structure of relationships among different entities in databases presents challenges for RAG systems. In response, GraphRAG leverages structural information across entities to enable more precise and comprehensive retrieval, capturing relational knowledge and facilitating more accurate, context-aware responses. Given the novelty and potential of GraphRAG, a systematic review of current technologies is imperative. This paper provides the first comprehensive overview of GraphRAG methodologies. We formalize the GraphRAG workflow, encompassing Graph-Based Indexing, Graph-Guided Retrieval, and Graph-Enhanced Generation. We then outline the core technologies and training methods at each stage. Additionally, we examine downstream tasks, application domains, evaluation methodologies, and industrial use cases of GraphRAG. Finally, we explore future research directions to inspire further inquiries and advance progress in the field. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: Ongoing work

arXiv:2408.07654 [pdf, other]

Graph Triple Attention Network: A Decoupled Perspective

Authors: Xiaotang Wang, Yun Zhu, Haizhou Shi, Yongchao Liu, Chuntao Hong

Abstract: Graph Transformers (GTs) have recently achieved significant success in the graph domain by effectively capturing both long-range dependencies and graph inductive biases. However, these methods face two primary challenges: (1) multi-view chaos, which results from coupling multi-view information (positional, structural, attribute), thereby impeding flexible usage and the interpretability of the prop… ▽ More Graph Transformers (GTs) have recently achieved significant success in the graph domain by effectively capturing both long-range dependencies and graph inductive biases. However, these methods face two primary challenges: (1) multi-view chaos, which results from coupling multi-view information (positional, structural, attribute), thereby impeding flexible usage and the interpretability of the propagation process. (2) local-global chaos, which arises from coupling local message passing with global attention, leading to issues of overfitting and over-globalizing. To address these challenges, we propose a high-level decoupled perspective of GTs, breaking them down into three components and two interaction levels: positional attention, structural attention, and attribute attention, alongside local and global interaction. Based on this decoupled perspective, we design a decoupled graph triple attention network named DeGTA, which separately computes multi-view attentions and adaptively integrates multi-view local and global information. This approach offers three key advantages: enhanced interpretability, flexible design, and adaptive integration of local and global information. Through extensive experiments, DeGTA achieves state-of-the-art performance across various datasets and tasks, including node classification and graph classification. Comprehensive ablation studies demonstrate that decoupling is essential for improving performance and enhancing interpretability. Our code is available at: https://github.com/wangxiaotang0906/DeGTA △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.07071 [pdf]

Approaches for enhancing extrapolability in process-based and data-driven models in hydrology

Authors: Haiyang Shi

Abstract: The application of process-based and data-driven hydrological models is crucial in modern hydrological research, especially for predicting key water cycle variables such as runoff, evapotranspiration (ET), and soil moisture. These models provide a scientific basis for water resource management, flood forecasting, and ecological protection. Process-based models simulate the physical mechanisms of w… ▽ More The application of process-based and data-driven hydrological models is crucial in modern hydrological research, especially for predicting key water cycle variables such as runoff, evapotranspiration (ET), and soil moisture. These models provide a scientific basis for water resource management, flood forecasting, and ecological protection. Process-based models simulate the physical mechanisms of watershed hydrological processes, while data-driven models leverage large datasets and advanced machine learning algorithms. This paper reviewed and compared methods for assessing and enhancing the extrapolability of both model types, discussing their prospects and limitations. Key strategies include the use of leave-one-out cross-validation and similarity-based methods to evaluate model performance in ungauged regions. Deep learning, transfer learning, and domain adaptation techniques are also promising in their potential to improve model predictions in data-sparse and extreme conditions. Interdisciplinary collaboration and continuous algorithmic advancements are also important to strengthen the global applicability and reliability of hydrological models. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2408.04821 [pdf]

VLM-MPC: Vision Language Foundation Model (VLM)-Guided Model Predictive Controller (MPC) for Autonomous Driving

Authors: Keke Long, Haotian Shi, Jiaxi Liu, Xiaopeng Li

Abstract: Motivated by the emergent reasoning capabilities of Vision Language Models (VLMs) and its potential to improve the comprehensibility of autonomous driving systems, this paper introduces a closed-loop autonomous driving controller called VLM-MPC, which combines a VLM for high-level decision-making and a Model Predictive Controller (MPC) for low-level vehicle control. The proposed VLM-MPC system is… ▽ More Motivated by the emergent reasoning capabilities of Vision Language Models (VLMs) and its potential to improve the comprehensibility of autonomous driving systems, this paper introduces a closed-loop autonomous driving controller called VLM-MPC, which combines a VLM for high-level decision-making and a Model Predictive Controller (MPC) for low-level vehicle control. The proposed VLM-MPC system is structurally divided into two asynchronous components: an upper-level VLM and a lower-level MPC. The upper layer VLM generates driving parameters for lower-level control based on front camera images, ego vehicle state, traffic environment conditions, and reference memory. The lower-level MPC controls the vehicle in real-time using these parameters, considering engine lag and providing state feedback to the entire system. Experiments based on the nuScenes dataset validated the effectiveness of the proposed VLM-MPC system across various scenarios (e.g., night, rain, intersections). Results showed that the VLM-MPC system consistently outperformed baseline models in terms of safety and driving comfort. By comparing behaviors under different weather conditions and scenarios, we demonstrated the VLM's ability to understand the environment and make reasonable inferences. △ Less

Submitted 8 August, 2024; originally announced August 2024.

arXiv:2408.04547 [pdf, other]

Emotional Cues Extraction and Fusion for Multi-modal Emotion Prediction and Recognition in Conversation

Authors: Haoxiang Shi, Ziqi Liang, Jun Yu

Abstract: Emotion Prediction in Conversation (EPC) aims to forecast the emotions of forthcoming utterances by utilizing preceding dialogues. Previous EPC approaches relied on simple context modeling for emotion extraction, overlooking fine-grained emotion cues at the word level. Additionally, prior works failed to account for the intrinsic differences between modalities, resulting in redundant information.… ▽ More Emotion Prediction in Conversation (EPC) aims to forecast the emotions of forthcoming utterances by utilizing preceding dialogues. Previous EPC approaches relied on simple context modeling for emotion extraction, overlooking fine-grained emotion cues at the word level. Additionally, prior works failed to account for the intrinsic differences between modalities, resulting in redundant information. To overcome these limitations, we propose an emotional cues extraction and fusion network, which consists of two stages: a modality-specific learning stage that utilizes word-level labels and prosody learning to construct emotion embedding spaces for each modality, and a two-step fusion stage for integrating multi-modal features. Moreover, the emotion features extracted by our model are also applicable to the Emotion Recognition in Conversation (ERC) task. Experimental results validate the efficacy of the proposed method, demonstrating superior performance on both IEMOCAP and MELD datasets. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2408.00744 [pdf, other]

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Authors: Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi

Abstract: Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local r… ▽ More Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git . △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: ECCV 2024

arXiv:2408.00486 [pdf, other]

SF-TIM: A Simple Framework for Enhancing Quadrupedal Robot Jumping Agility by Combining Terrain Imagination and Measurement

Authors: Ze Wang, Yang Li, Long Xu, Hao Shi, Zunwang Ma, Zhen Chu, Chao Li, Fei Gao, Kailun Yang, Kaiwei Wang

Abstract: Dynamic jumping on high platforms and over gaps differentiates legged robots from wheeled counterparts. Compared to walking on rough terrains, dynamic locomotion on abrupt surfaces requires fusing proprioceptive and exteroceptive perception for explosive movements. In this paper, we propose SF-TIM (Simple Framework combining Terrain Imagination and Measurement), a single-policy method that enhance… ▽ More Dynamic jumping on high platforms and over gaps differentiates legged robots from wheeled counterparts. Compared to walking on rough terrains, dynamic locomotion on abrupt surfaces requires fusing proprioceptive and exteroceptive perception for explosive movements. In this paper, we propose SF-TIM (Simple Framework combining Terrain Imagination and Measurement), a single-policy method that enhances quadrupedal robot jumping agility, while preserving their fundamental blind walking capabilities. In addition, we introduce a terrain-guided reward design specifically to assist quadrupedal robots in high jumping, improving their performance in this task. To narrow the simulation-to-reality gap in quadrupedal robot learning, we introduce a stable and high-speed elevation map generation framework, enabling zero-shot simulation-to-reality transfer of locomotion ability. Our algorithm has been deployed and validated on both the small-/large-size quadrupedal robots, demonstrating its effectiveness in real-world applications: the robot has successfully traversed various high platforms and gaps, showing the robustness of our proposed approach. A demo video has been made available at https://flysoaryun.github.io/SF-TIM. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: A demo video has been made available at https://flysoaryun.github.io/SF-TIM

arXiv:2407.19484 [pdf, ps, other]

Error Correction Decoding Algorithms of RS Codes Based on An Earlier Termination Algorithm to Find The Error Locator Polynomial

Authors: Zhengyi Jiang, Hao Shi, Zhongyi Huang, Linqi Song, Bo Bai, Gong Zhang, Hanxu Hou

Abstract: Reed-Solomon (RS) codes are widely used to correct errors in storage systems. Finding the error locator polynomial is one of the key steps in the error correction procedure of RS codes. Modular Approach (MA) is an effective algorithm for solving the Welch-Berlekamp (WB) key-equation problem to find the error locator polynomial that needs $2t$ steps, where $t$ is the error correction capability. In… ▽ More Reed-Solomon (RS) codes are widely used to correct errors in storage systems. Finding the error locator polynomial is one of the key steps in the error correction procedure of RS codes. Modular Approach (MA) is an effective algorithm for solving the Welch-Berlekamp (WB) key-equation problem to find the error locator polynomial that needs $2t$ steps, where $t$ is the error correction capability. In this paper, we first present a new MA algorithm that only requires $2e$ steps and then propose two fast decoding algorithms for RS codes based on our MA algorithm, where $e$ is the number of errors and $e\leq t$. We propose Improved-Frequency Domain Modular Approach (I-FDMA) algorithm that needs $2e$ steps to solve the error locator polynomial and present our first decoding algorithm based on the I-FDMA algorithm. We show that, compared with the existing methods based on MA algorithms, our I-FDMA algorithm can effectively reduce the decoding complexity of RS codes when $e<t$. Furthermore, we propose the $t_0$-Shortened I-FDMA ($t_0$-SI-FDMA) algorithm ($t_0$ is a predetermined even number less than $2t-1$) based on the new termination mechanism to solve the error number $e$ quickly. We propose our second decoding algorithm based on the SI-FDMA algorithm for RS codes and show that the multiplication complexity of our second decoding algorithm is lower than our first decoding algorithm (the I-FDMA decoding algorithm) when $2e<t_0+1$. △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.19420 [pdf, other]

UniGAP: A Universal and Adaptive Graph Upsampling Approach to Mitigate Over-Smoothing in Node Classification Tasks

Authors: Xiaotang Wang, Yun Zhu, Haizhou Shi, Yongchao Liu, Chuntao Hong

Abstract: In the graph domain, deep graph networks based on Message Passing Neural Networks (MPNNs) or Graph Transformers often cause over-smoothing of node features, limiting their expressive capacity. Many upsampling techniques involving node and edge manipulation have been proposed to mitigate this issue. However, these methods often require extensive manual labor, resulting in suboptimal performance and… ▽ More In the graph domain, deep graph networks based on Message Passing Neural Networks (MPNNs) or Graph Transformers often cause over-smoothing of node features, limiting their expressive capacity. Many upsampling techniques involving node and edge manipulation have been proposed to mitigate this issue. However, these methods often require extensive manual labor, resulting in suboptimal performance and lacking a universal integration strategy. In this study, we introduce UniGAP, a universal and adaptive graph upsampling technique for graph data. It provides a universal framework for graph upsampling, encompassing most current methods as variants. Moreover, UniGAP serves as a plug-in component that can be seamlessly and adaptively integrated with existing GNNs to enhance performance and mitigate the over-smoothing problem. Through extensive experiments, UniGAP demonstrates significant improvements over heuristic data augmentation methods across various datasets and metrics. We analyze how graph structure evolves with UniGAP, identifying key bottlenecks where over-smoothing occurs, and providing insights into how UniGAP addresses this issue. Lastly, we show the potential of combining UniGAP with large language models (LLMs) to further improve downstream performance. Our code is available at: https://github.com/wangxiaotang0906/UniGAP △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.17695 [pdf, other]

Enhancing Agent Learning through World Dynamics Modeling

Authors: Zhiyuan Sun, Haochen Shi, Marc-Alexandre Côté, Glen Berseth, Xingdi Yuan, Bang Liu

Abstract: While large language models (LLMs) have been increasingly deployed across tasks in language understanding and interactive decision-making, their impressive performance is largely due to the comprehensive and in-depth domain knowledge embedded within them. However, the extent of this knowledge can vary across different domains. Existing methods often assume that LLMs already possess such comprehens… ▽ More While large language models (LLMs) have been increasingly deployed across tasks in language understanding and interactive decision-making, their impressive performance is largely due to the comprehensive and in-depth domain knowledge embedded within them. However, the extent of this knowledge can vary across different domains. Existing methods often assume that LLMs already possess such comprehensive and in-depth knowledge of their environment, overlooking potential gaps in their understanding of actual world dynamics. To address this gap, we introduce Discover, Verify, and Evolve (DiVE), a framework that discovers world dynamics from a small number of demonstrations, verifies the correctness of these dynamics, and evolves new, advanced dynamics tailored to the current situation. Through extensive evaluations, we analyze the impact of each component on performance and compare the automatically generated dynamics from DiVE with human-annotated world dynamics. Our results demonstrate that LLMs guided by DiVE can make better decisions, achieving rewards comparable to human players in the Crafter environment. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.15686 [pdf, other]

Differentiable Convex Polyhedra Optimization from Multi-view Images

Authors: Daxuan Ren, Haiyi Mei, Hezi Shi, Jianmin Zheng, Jianfei Cai, Lei Yang

Abstract: This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differentiable computation of hyperplane intersection through duality transform with differentiable optimization for vertex positioning with three-plane intersection, en… ▽ More This paper presents a novel approach for the differentiable rendering of convex polyhedra, addressing the limitations of recent methods that rely on implicit field supervision. Our technique introduces a strategy that combines non-differentiable computation of hyperplane intersection through duality transform with differentiable optimization for vertex positioning with three-plane intersection, enabling gradient-based optimization without the need for 3D implicit fields. This allows for efficient shape representation across a range of applications, from shape parsing to compact mesh reconstruction. This work not only overcomes the challenges of previous approaches but also sets a new standard for representing shapes with convex polyhedra. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: ECCV2024 https://github.com/kimren227/DiffConvex

arXiv:2407.13268 [pdf, other]

Mixture of Experts based Multi-task Supervise Learning from Crowds

Authors: Tao Han, Huaixuan Shi, Xinyi Ding, Xiao Ma, Huamao Gu, Yili Fang

Abstract: Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise… ▽ More Existing truth inference methods in crowdsourcing aim to map redundant labels and items to the ground truth. They treat the ground truth as hidden variables and use statistical or deep learning-based worker behavior models to infer the ground truth. However, worker behavior models that rely on ground truth hidden variables overlook workers' behavior at the item feature level, leading to imprecise characterizations and negatively impacting the quality of truth inference. This paper proposes a new paradigm of multi-task supervised learning from crowds, which eliminates the need for modeling of items's ground truth in worker behavior models. Within this paradigm, we propose a worker behavior model at the item feature level called Mixture of Experts based Multi-task Supervised Learning from Crowds (MMLC). Two truth inference strategies are proposed within MMLC. The first strategy, named MMLC-owf, utilizes clustering methods in the worker spectral space to identify the projection vector of the oracle worker. Subsequently, the labels generated based on this vector are considered as the inferred truth. The second strategy, called MMLC-df, employs the MMLC model to fill the crowdsourced data, which can enhance the effectiveness of existing truth inference methods. Experimental results demonstrate that MMLC-owf outperforms state-of-the-art methods and MMLC-df enhances the quality of existing truth inference methods. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.12817 [pdf, other]

Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition

Authors: Yuchun Shu, Bo Hu, Yifeng He, Hao Shi, Longbiao Wang, Jianwu Dang

Abstract: Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic f… ▽ More Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic feature from the ASR encoder is also used to provide the correct pronunciation references. N-best candidates from ASR are aligned using the edit path, to confirm each other and recover some missing character errors. Furthermore, the cross-attention mechanism fuses the information between error correction references and the ASR hypothesis. The experimental results show that both the acoustic and confidence references help with error correction. The proposed system reduces the error rate by 21% compared with the ASR model. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2407.09191 [pdf, other]

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

Authors: Hanrong Shi, Lin Li, Jun Xiao, Yueting Zhuang, Long Chen

Abstract: Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CA… ▽ More Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Accepted by IJCV

arXiv:2407.02933 [pdf, other]

Online Time-Informed Kinodynamic Motion Planning of Nonlinear Systems

Authors: Fei Meng, Jianbang Liu, Haojie Shi, Han Ma, Hongliang Ren, Max Q. -H. Meng

Abstract: Sampling-based kinodynamic motion planners (SKMPs) are powerful in finding collision-free trajectories for high-dimensional systems under differential constraints. Time-informed set (TIS) can provide the heuristic search domain to accelerate their convergence to the time-optimal solution. However, existing TIS approximation methods suffer from the curse of dimensionality, computational burden, and… ▽ More Sampling-based kinodynamic motion planners (SKMPs) are powerful in finding collision-free trajectories for high-dimensional systems under differential constraints. Time-informed set (TIS) can provide the heuristic search domain to accelerate their convergence to the time-optimal solution. However, existing TIS approximation methods suffer from the curse of dimensionality, computational burden, and limited system applicable scope, e.g., linear and polynomial nonlinear systems. To overcome these problems, we propose a method by leveraging deep learning technology, Koopman operator theory, and random set theory. Specifically, we propose a Deep Invertible Koopman operator with control U model named DIKU to predict states forward and backward over a long horizon by modifying the auxiliary network with an invertible neural network. A sampling-based approach, ASKU, performing reachability analysis for the DIKU is developed to approximate the TIS of nonlinear control systems online. Furthermore, we design an online time-informed SKMP using a direct sampling technique to draw uniform random samples in the TIS. Simulation experiment results demonstrate that our method outperforms other existing works, approximating TIS in near real-time and achieving superior planning performance in several time-optimal kinodynamic motion planning problems. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02182 [pdf, other]

Occlusion-Aware Seamless Segmentation

Authors: Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, Kailun Yang

Abstract: Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Ble… ▽ More Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Blending Panoramic Amodal Seamless Segmentation, i.e., BlendPASS. Besides, we propose the first solution UnmaskFormer, aiming at unmasking the narrow FoV, occlusions, and domain gaps all at once. Specifically, UnmaskFormer includes the crucial designs of Unmasking Attention (UA) and Amodal-oriented Mix (AoMix). Our method achieves state-of-the-art performance on the BlendPASS dataset, reaching a remarkable mAPQ of 26.58% and mIoU of 43.66%. On public panoramic semantic segmentation datasets, i.e., SynPASS and DensePASS, our method outperforms previous methods and obtains 45.34% and 48.08% in mIoU, respectively. The fresh BlendPASS dataset and our source code are available at https://github.com/yihong-97/OASS. △ Less

Submitted 17 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024. The fresh dataset and source code are available at https://github.com/yihong-97/OASS

arXiv:2407.01418 [pdf, other]

RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

Authors: Bo Ai, Stephen Tian, Haochen Shi, Yixuan Wang, Cheston Tan, Yunzhu Li, Jiajun Wu

Abstract: Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network… ▽ More Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Robotics: Science and Systems (RSS), 2024. Project page: https://robo-pack.github.io/

ACM Class: I.2.9; I.2.6; I.2.10

arXiv:2406.18394 [pdf, other]

AlphaForge: A Framework to Mine and Dynamically Combine Formulaic Alpha Factors

Authors: Hao Shi, Weili Song, Xinting Zhang, Jiahe Shi, Cuicui Luo, Xiang Ao, Hamid Arian, Luis Seco

Abstract: The complexity of financial data, characterized by its variability and low signal-to-noise ratio, necessitates advanced methods in quantitative investment that prioritize both performance and interpretability.Transitioning from early manual extraction to genetic programming, the most advanced approach in the alpha factor mining domain currently employs reinforcement learning to mine a set of combi… ▽ More The complexity of financial data, characterized by its variability and low signal-to-noise ratio, necessitates advanced methods in quantitative investment that prioritize both performance and interpretability.Transitioning from early manual extraction to genetic programming, the most advanced approach in the alpha factor mining domain currently employs reinforcement learning to mine a set of combination factors with fixed weights. However, the performance of resultant alpha factors exhibits inconsistency, and the inflexibility of fixed factor weights proves insufficient in adapting to the dynamic nature of financial markets. To address this issue, this paper proposes a two-stage formulaic alpha generating framework AlphaForge, for alpha factor mining and factor combination. This framework employs a generative-predictive neural network to generate factors, leveraging the robust spatial exploration capabilities inherent in deep learning while concurrently preserving diversity. The combination model within the framework incorporates the temporal performance of factors for selection and dynamically adjusts the weights assigned to each component alpha factor. Experiments conducted on real-world datasets demonstrate that our proposed model outperforms contemporary benchmarks in formulaic alpha factor mining. Furthermore, our model exhibits a notable enhancement in portfolio returns within the realm of quantitative investment and real money investment. △ Less

Submitted 19 August, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.15765 [pdf, other]

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Authors: Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, Yingyan Celine Lin

Abstract: Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores… ▽ More Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially regarding how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sink in the initial token, which receives disproportionately large attention scores despite their lack of semantic importance, this work delves deeper into this phenomenon. We aim to provide a more profound understanding of the existence of attention sinks within LLMs and to uncover ways to enhance the achievable accuracy of LLMs by directly optimizing the attention distributions, without the need for weight finetuning. Specifically, this work begins with comprehensive visualizations of the attention distributions in LLMs during inference across various inputs and tasks. Based on these visualizations, to the best of our knowledge, we are the first to discover that (1) attention sinks occur not only at the start of sequences but also within later tokens of the input, and (2) not all attention sinks have a positive impact on the achievable accuracy of LLMs. Building upon our findings, we propose a training-free Attention Calibration Technique (ACT) that automatically optimizes the attention distributions on the fly during inference in an input-adaptive manner. Extensive experiments validate that ACT consistently enhances the accuracy of various LLMs across different applications. Specifically, ACT achieves an average improvement of up to 7.30% in accuracy across different datasets when applied to Llama-30B. Our code is available at https://github.com/GATECH-EIC/ACT. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.14696 [pdf, other]

Physically Analyzable AI-Based Nonlinear Platoon Dynamics Modeling During Traffic Oscillation: A Koopman Approach

Authors: Kexin Tian, Haotian Shi, Yang Zhou, Sixu Li

Abstract: Given the complexity and nonlinearity inherent in traffic dynamics within vehicular platoons, there exists a critical need for a modeling methodology with high accuracy while concurrently achieving physical analyzability. Currently, there are two predominant approaches: the physics model-based approach and the Artificial Intelligence (AI)--based approach. Knowing the facts that the physical-based… ▽ More Given the complexity and nonlinearity inherent in traffic dynamics within vehicular platoons, there exists a critical need for a modeling methodology with high accuracy while concurrently achieving physical analyzability. Currently, there are two predominant approaches: the physics model-based approach and the Artificial Intelligence (AI)--based approach. Knowing the facts that the physical-based model usually lacks sufficient modeling accuracy and potential function mismatches and the pure-AI-based method lacks analyzability, this paper innovatively proposes an AI-based Koopman approach to model the unknown nonlinear platoon dynamics harnessing the power of AI and simultaneously maintain physical analyzability, with a particular focus on periods of traffic oscillation. Specifically, this research first employs a deep learning framework to generate the embedding function that lifts the original space into the embedding space. Given the embedding space descriptiveness, the platoon dynamics can be expressed as a linear dynamical system founded by the Koopman theory. Based on that, the routine of linear dynamical system analysis can be conducted on the learned traffic linear dynamics in the embedding space. By that, the physical interpretability and analyzability of model-based methods with the heightened precision inherent in data-driven approaches can be synergized. Comparative experiments have been conducted with existing modeling approaches, which suggests our method's superiority in accuracy. Additionally, a phase plane analysis is performed, further evidencing our approach's effectiveness in replicating the complex dynamic patterns. Moreover, the proposed methodology is proven to feature the capability of analyzing the stability, attesting to the physical analyzability. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.12846 [pdf, other]

DrVideo: Document Retrieval Based Long Video Understanding

Authors: Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai

Abstract: Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long… ▽ More Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes). △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 11 pages

arXiv:2406.12550 [pdf, other]

Offline Imitation Learning with Model-based Reverse Augmentation

Authors: Jie-Jing Shao, Hao-Sen Shi, Lan-Zhe Guo, Yu-Feng Li

Abstract: In offline Imitation Learning (IL), one of the main challenges is the \textit{covariate shift} between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduce the supplementary data and identify th… ▽ More In offline Imitation Learning (IL), one of the main challenges is the \textit{covariate shift} between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduce the supplementary data and identify the latent expert-similar samples to augment the reliable samples during learning. Model-based solutions build forward dynamic models with conservatism quantification and then generate additional trajectories in the neighborhood of expert demonstrations. However, without reward supervision, these methods are often over-conservative in the out-of-expert-support regions, because only in states close to expert-observed states can there be a preferred action enabling policy optimization. To encourage more exploration on expert-unobserved states, we propose a novel model-based framework, called offline Imitation Learning with Self-paced Reverse Augmentation (SRA). Specifically, we build a reverse dynamic model from the offline demonstrations, which can efficiently generate trajectories leading to the expert-observed states in a self-paced style. Then, we use the subsequent reinforcement learning method to learn from the augmented trajectories and transit from expert-unobserved states to expert-observed states. This framework not only explores the expert-unobserved states but also guides maximizing long-term returns on these states, ultimately enabling generalization beyond the expert data. Empirical results show that our proposal could effectively mitigate the covariate shift and achieve the state-of-the-art performance on the offline imitation learning benchmarks. Project website: \url{https://www.lamda.nju.edu.cn/shaojj/KDD24_SRA/}. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted by KDD2024

arXiv:2406.12229 [pdf, other]

Spatially Resolved Gene Expression Prediction from Histology via Multi-view Graph Contrastive Learning with HSIC-bottleneck Regularization

Authors: Changxi Chi, Hang Shi, Qi Zhu, Daoqiang Zhang, Wei Shao

Abstract: The rapid development of spatial transcriptomics(ST) enables the measurement of gene expression at spatial resolution, making it possible to simultaneously profile the gene expression, spatial locations of spots, and the matched histopathological images. However, the cost for collecting ST data is much higher than acquiring histopathological images, and thus several studies attempt to predict the… ▽ More The rapid development of spatial transcriptomics(ST) enables the measurement of gene expression at spatial resolution, making it possible to simultaneously profile the gene expression, spatial locations of spots, and the matched histopathological images. However, the cost for collecting ST data is much higher than acquiring histopathological images, and thus several studies attempt to predict the gene expression on ST by leveraging their corresponding histopathological images. Most of the existing image-based gene prediction models treat the prediction task on each spot of ST data independently, which ignores the spatial dependency among spots. In addition, while the histology images share phenotypic characteristics with the ST data, it is still challenge to extract such common information to help align paired image and expression representations. To address the above issues, we propose a Multi-view Graph Contrastive Learning framework with HSIC-bottleneck Regularization(ST-GCHB) aiming at learning shared representation to help impute the gene expression of the queried imagingspots by considering their spatial dependency. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11941 [pdf, other]

Crossfusor: A Cross-Attention Transformer Enhanced Conditional Diffusion Model for Car-Following Trajectory Prediction

Authors: Junwei You, Haotian Shi, Keshu Wu, Keke Long, Sicheng Fu, Sikai Chen, Bin Ran

Abstract: Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS), enhancing road safety and traffic efficiency. While traditional methods have laid foundational work, modern deep learning techniques, particularly transformer-based models and generative approaches, have significantly improved prediction accuracy by capturing complex and non-lin… ▽ More Vehicle trajectory prediction is crucial for advancing autonomous driving and advanced driver assistance systems (ADAS), enhancing road safety and traffic efficiency. While traditional methods have laid foundational work, modern deep learning techniques, particularly transformer-based models and generative approaches, have significantly improved prediction accuracy by capturing complex and non-linear patterns in vehicle motion and traffic interactions. However, these models often overlook the detailed car-following behaviors and inter-vehicle interactions essential for real-world driving scenarios. This study introduces a Cross-Attention Transformer Enhanced Conditional Diffusion Model (Crossfusor) specifically designed for car-following trajectory prediction. Crossfusor integrates detailed inter-vehicular interactions and car-following dynamics into a robust diffusion framework, improving both the accuracy and realism of predicted trajectories. The model leverages a novel temporal feature encoding framework combining GRU, location-based attention mechanisms, and Fourier embedding to capture historical vehicle dynamics. It employs noise scaled by these encoded historical features in the forward diffusion process, and uses a cross-attention transformer to model intricate inter-vehicle dependencies in the reverse denoising process. Experimental results on the NGSIM dataset demonstrate that Crossfusor outperforms state-of-the-art models, particularly in long-term predictions, showcasing its potential for enhancing the predictive capabilities of autonomous driving systems. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11917 [pdf, other]

doi 10.1016/j.aei.2024.102568

Interpretable modulated differentiable STFT and physics-informed balanced spectrum metric for freight train wheelset bearing cross-machine transfer fault diagnosis under speed fluctuations

Authors: Chao He, Hongmei Shi, Ruixin Li, Jianbo Li, ZuJun Yu

Abstract: The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentia… ▽ More The service conditions of wheelset bearings has a direct impact on the safe operation of railway heavy haul freight trains as the key components. However, speed fluctuation of the trains and few fault samples are the two main problems that restrict the accuracy of bearing fault diagnosis. Therefore, a cross-machine transfer diagnosis (pyDSN) network coupled with interpretable modulated differentiable short-time Fourier transform (STFT) and physics-informed balanced spectrum quality metric is proposed to learn domain-invariant and discriminative features under time-varying speeds. Firstly, due to insufficiency in extracting extract frequency components of time-varying speed signals using fixed windows, a modulated differentiable STFT (MDSTFT) that is interpretable with STFT-informed theoretical support, is proposed to extract the robust time-frequency spectrum (TFS). During training process, multiple windows with different lengths dynamically change. Also, in addition to the classification metric and domain discrepancy metric, we creatively introduce a third kind of metric, referred to as the physics-informed metric, to enhance transferable TFS. A physics-informed balanced spectrum quality (BSQ) regularization loss is devised to guide an optimization direction for MDSTFT and model. With it, not only can model acquire high-quality TFS, but also a physics-restricted domain adaptation network can be also acquired, making it learn real-world physics knowledge, ultimately diminish the domain discrepancy across different datasets. The experiment is conducted in the scenario of migrating from the laboratory datasets to the freight train dataset, indicating that the hybrid-driven pyDSN outperforms existing methods and has practical value. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Journal ref: Advanced Engineering Informatics, 2024

arXiv:2406.11675 [pdf, other]

BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models

Authors: Yibin Wang, Haizhou Shi, Ligong Han, Dimitris Metaxas, Hao Wang

Abstract: Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters le… ▽ More Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 27 pages, 3 figures, 9 tables; preprint, work in progress

arXiv:2406.11303 [pdf, other]

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Authors: Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang

Abstract: Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations,… ▽ More Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 38 pages, 44 figures

arXiv:2406.11230 [pdf, other]

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Authors: Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

Abstract: Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-contex… ▽ More Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10885 [pdf, other]

On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions

Authors: Weiqi Wang, Tianqing Fang, Haochen Shi, Baixuan Xu, Wenxuan Ding, Liyu Zhang, Wei Fan, Jiaxin Bai, Haoran Li, Xin Liu, Yangqiu Song

Abstract: Entity- and event-level conceptualization, as fundamental elements of human cognition, plays a pivotal role in generalizable reasoning. This process involves abstracting specific instances into higher-level concepts and forming abstract knowledge that can be applied in unfamiliar or novel situations, which can enhance models' inferential capabilities and support the effective transfer of knowledge… ▽ More Entity- and event-level conceptualization, as fundamental elements of human cognition, plays a pivotal role in generalizable reasoning. This process involves abstracting specific instances into higher-level concepts and forming abstract knowledge that can be applied in unfamiliar or novel situations, which can enhance models' inferential capabilities and support the effective transfer of knowledge across various domains. Despite its significance, there is currently a lack of a systematic overview that comprehensively examines existing works in the definition, execution, and application of conceptualization to enhance reasoning tasks. In this paper, we address this gap by presenting the first comprehensive survey of 150+ papers, categorizing various definitions, resources, methods, and downstream applications related to conceptualization into a unified taxonomy, with a focus on the entity and event levels. Furthermore, we shed light on potential future directions in this field and hope to garner more attention from the community. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10701 [pdf, other]

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

Authors: Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, Yangqiu Song

Abstract: Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product i… ▽ More Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 million intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Additional experiments reveal that our obtained intentions significantly enhance large language models in two intention comprehension tasks. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 8 pages, 5 figures

arXiv:2406.07528 [pdf, other]

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Authors: Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia

Abstract: The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition.… ▽ More The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn't require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the $\infty$-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code can be found in https://github.com/dvlab-research/Q-LLM. △ Less

Submitted 22 August, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.05981 [pdf, other]

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Authors: Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin

Abstract: Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly pr… ▽ More Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM. △ Less

Submitted 25 July, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.04295 [pdf, other]

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

Authors: Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang

Abstract: Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditiona… ▽ More Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditional diffusion model, which is also trained on the source domain to transform target data into synthetic data as a source domain projection. This allows the source model to make predictions without weight adaptation. In this paper, we argue that the domains of the source model and the synthetic data in diffusion-driven TTA methods are not aligned. To adapt the source model to the synthetic domain of the unconditional diffusion model, we introduce a Synthetic-Domain Alignment (SDA) framework to fine-tune the source model with synthetic data. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This process mitigates the potential domain gap between the conditional and unconditional models. Extensive experiments across various models and benchmarks demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code is available at https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: GitHub: https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment

arXiv:2406.04032 [pdf, other]

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Authors: Marianna Ohanyan, Hayk Manukyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

Abstract: We present Zero-Painter, a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjus… ▽ More We present Zero-Painter, a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped Cross-Attention (ReGCA) blocks, ensuring precise alignment of generated objects with textual prompts and mask shapes. Our extensive experiments demonstrate that Zero-Painter surpasses current state-of-the-art methods in preserving textual details and adhering to mask shapes. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.01856 [pdf, ps, other]

On Approximation of Robust Max-Cut and Related Problems using Randomized Rounding Algorithms

Authors: Haoyan Shi, Sanjay Mehrotra

Abstract: Goemans and Williamson proposed a randomized rounding algorithm for the MAX-CUT problem with a 0.878 approximation bound in expectation. The 0.878 approximation bound remains the best-known approximation bound for this APX-hard problem. Their approach was subsequently applied to other related problems such as Max-DiCut, MAX-SAT, and Max-2SAT, etc. We show that the randomized rounding algorithm can… ▽ More Goemans and Williamson proposed a randomized rounding algorithm for the MAX-CUT problem with a 0.878 approximation bound in expectation. The 0.878 approximation bound remains the best-known approximation bound for this APX-hard problem. Their approach was subsequently applied to other related problems such as Max-DiCut, MAX-SAT, and Max-2SAT, etc. We show that the randomized rounding algorithm can also be used to achieve a 0.878 approximation bound for the robust and distributionally robust counterparts of the max-cut problem. We also show that the approximation bounds for the other problems are maintained for their robust and distributionally robust counterparts if the randomization projection framework is used. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00805 [pdf]

Extrapolability Improvement of Machine Learning-Based Evapotranspiration Models via Domain-Adversarial Neural Networks

Authors: Haiyang Shi

Abstract: Machine learning-based hydrological prediction models, despite their high accuracy, face limitations in extrapolation capabilities when applied globally due to uneven data distribution. This study integrates Domain-Adversarial Neural Networks (DANN) to improve the geographical adaptability of evapotranspiration (ET) models. By employing DANN, we aim to mitigate distributional discrepancies between… ▽ More Machine learning-based hydrological prediction models, despite their high accuracy, face limitations in extrapolation capabilities when applied globally due to uneven data distribution. This study integrates Domain-Adversarial Neural Networks (DANN) to improve the geographical adaptability of evapotranspiration (ET) models. By employing DANN, we aim to mitigate distributional discrepancies between different sites, significantly enhancing the model's extrapolation capabilities. Our results show that DANN improves ET prediction accuracy with an average increase in the Kling-Gupta Efficiency (KGE) of 0.2 to 0.3 compared to the traditional Leave-One-Out (LOO) method. DANN is particularly effective for isolated sites and transition zones between biomes, reducing data distribution discrepancies and avoiding low-accuracy predictions. By leveraging information from data-rich areas, DANN enhances the reliability of global-scale ET products, especially in ungauged regions. This study highlights the potential of domain adaptation techniques to improve the extrapolation and generalization capabilities of machine learning models in hydrological studies. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.19915 [pdf, other]

P$^2$-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

Authors: Huihong Shi, Xin Cheng, Wendong Mao, Zhongfeng Wang

Abstract: Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficien… ▽ More Vision Transformers (ViTs) have excelled in computer vision tasks but are memory-consuming and computation-intensive, challenging their deployment on resource-constrained devices. To tackle this limitation, prior works have explored ViT-tailored quantization algorithms but retained floating-point scaling factors, which yield non-negligible re-quantization overhead, limiting ViTs' hardware efficiency and motivating more hardware-friendly solutions. To this end, we propose \emph{P$^2$-ViT}, the first \underline{P}ower-of-Two (PoT) \underline{p}ost-training quantization and acceleration framework to accelerate fully quantized ViTs. Specifically, {as for quantization,} we explore a dedicated quantization scheme to effectively quantize ViTs with PoT scaling factors, thus minimizing the re-quantization overhead. Furthermore, we propose coarse-to-fine automatic mixed-precision quantization to enable better accuracy-efficiency trade-offs. {In terms of hardware,} we develop {a dedicated chunk-based accelerator} featuring multiple tailored sub-processors to individually handle ViTs' different types of operations, alleviating reconfigurable overhead. Additionally, we design {a tailored row-stationary dataflow} to seize the pipeline processing opportunity introduced by our PoT scaling factors, thereby enhancing throughput. Extensive experiments consistently validate P$^2$-ViT's effectiveness. {Particularly, we offer comparable or even superior quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. Besides, we achieve up to $\mathbf{10.1\times}$ speedup and $\mathbf{36.8\times}$ energy saving over GPU's Turing Tensor Cores, and up to $\mathbf{1.84\times}$ higher computation utilization efficiency against SOTA quantization-based ViT accelerators. Codes are available at \url{https://github.com/shihuihong214/P2-ViT}. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.18111 [pdf, other]

ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator

Authors: Junda Zhu, Lingyong Yan, Haibo Shi, Dawei Yin, Lei Sha

Abstract: Large language models (LLMs) are proven to benefit a lot from retrieval-augmented generation (RAG) in alleviating hallucinations confronted with knowledge-intensive questions. RAG adopts information retrieval techniques to inject external knowledge from semantic-relevant documents as input contexts. However, due to today's Internet being flooded with numerous noisy and fabricating content, it is i… ▽ More Large language models (LLMs) are proven to benefit a lot from retrieval-augmented generation (RAG) in alleviating hallucinations confronted with knowledge-intensive questions. RAG adopts information retrieval techniques to inject external knowledge from semantic-relevant documents as input contexts. However, due to today's Internet being flooded with numerous noisy and fabricating content, it is inevitable that RAG systems are vulnerable to these noises and prone to respond incorrectly. To this end, we propose to optimize the retrieval-augmented Generator with a Adversarial Tuning Multi-agent system (ATM). The ATM steers the Generator to have a robust perspective of useful documents for question answering with the help of an auxiliary Attacker agent. The Generator and the Attacker are tuned adversarially for several iterations. After rounds of multi-agent iterative tuning, the Generator can eventually better discriminate useful documents amongst fabrications. The experimental results verify the effectiveness of ATM and we also observe that the Generator can achieve better performance compared to state-of-the-art baselines. △ Less

Submitted 16 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 18 pages, 7 figures

arXiv:2405.17900 [pdf, other]

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Authors: Haoxiang Shi, Xulong Zhang, Ning Cheng, Yong Zhang, Jun Yu, Jing Xiao, Jianzong Wang

Abstract: The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared informa… ▽ More The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)

arXiv:2405.17777 [pdf, other]

RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Authors: Jianzong Wang, Haoxiang Shi, Kaiyi Luo, Xulong Zhang, Ning Cheng, Jing Xiao

Abstract: Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing techniq… ▽ More Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)

arXiv:2405.17028 [pdf, other]

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Authors: Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao

Abstract: Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information… ▽ More Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data

arXiv:2405.16847 [pdf, other]

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Authors: Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu

Abstract: Autoregressive next-token prediction is a standard pretraining method for large-scale language models, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce \textbf{TokenUnify}, a novel pr… ▽ More Autoregressive next-token prediction is a standard pretraining method for large-scale language models, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce \textbf{TokenUnify}, a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. We provide theoretical evidence demonstrating that TokenUnify mitigates cumulative errors in visual autoregression. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution, ideal for creating spatially correlated long sequences. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date and providing a unified benchmark for experimental validation. Leveraging the Mamba network inherently suited for long-sequence modeling on this dataset, TokenUnify not only reduces the computational complexity but also leads to a significant 45\% improvement in segmentation performance on downstream EM neuron segmentation tasks compared to existing methods. Furthermore, TokenUnify demonstrates superior scalability over MAE and traditional autoregressive methods, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at \url{https://github.com/ydchen0806/TokenUnify}. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16533 [pdf, other]

Chain of Tools: Large Language Model is an Automatic Multi-tool Learner

Authors: Zhengliang Shi, Shen Gao, Xiuyi Chen, Yue Feng, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen, Suzan Verberne, Zhaochun Ren

Abstract: Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, empowering them to solve practical tasks. Existing work typically empowers LLMs as tool users with a manually designed workflow, where the LLM plans a series of tools in a step-by-step manner, and sequentially executes each tool to obtain intermediate results until deriving the… ▽ More Augmenting large language models (LLMs) with external tools has emerged as a promising approach to extend their utility, empowering them to solve practical tasks. Existing work typically empowers LLMs as tool users with a manually designed workflow, where the LLM plans a series of tools in a step-by-step manner, and sequentially executes each tool to obtain intermediate results until deriving the final answer. However, they suffer from two challenges in realistic scenarios: (1) The handcrafted control flow is often ad-hoc and constraints the LLM to local planning; (2) The LLM is instructed to use only manually demonstrated tools or well-trained Python functions, which limits its generalization to new tools. In this work, we first propose Automatic Tool Chain (ATC), a framework that enables the LLM to act as a multi-tool user, which directly utilizes a chain of tools through programming. To scale up the scope of the tools, we next propose a black-box probing method. This further empowers the LLM as a tool learner that can actively discover and document tool usages, teaching themselves to properly master new tools. For a comprehensive evaluation, we build a challenging benchmark named ToolFlow, which diverges from previous benchmarks by its long-term planning scenarios and complex toolset. Experiments on both existing datasets and ToolFlow illustrate the superiority of our framework. Analysis on different settings also validates the effectiveness and the utility of our black-box probing algorithm. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: Work in progress

Showing 1–50 of 488 results for author: Shi, H