-
ByteCheckpoint: A Unified Checkpointing System for LLM Development
Authors:
Borui Wan,
Mingji Han,
Yiyao Sheng,
Zhichao Lai,
Mofan Zhang,
Junda Zhang,
Yanghua Peng,
Haibin Lin,
Xin Liu,
Chuan Wu
Abstract:
The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to facilitate checkpoint transferring within the training pipeline and across various tasks. Due to the immense size of LLMs, saving and loading checkpoints often incur intolerable minute-level stalls, signif…
▽ More
The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to facilitate checkpoint transferring within the training pipeline and across various tasks. Due to the immense size of LLMs, saving and loading checkpoints often incur intolerable minute-level stalls, significantly diminishing training efficiency. Besides, when transferring checkpoints across tasks, checkpoint resharding, defined as loading checkpoints into parallel configurations differing from those used for saving, is often required according to the characteristics and resource quota of specific tasks. Previous checkpointing systems [16,3,33,6] assume consistent parallel configurations, failing to address the complexities of checkpoint transformation during resharding. Furthermore, in the industry platform, developers create checkpoints from different training frameworks[23,36,21,11], each with its own unique storage and I/O logic. This diversity complicates the implementation of unified checkpoint management and optimization. To address these challenges, we introduce ByteCheckpoint, a PyTorch-native multi-framework LLM checkpointing system that supports automatic online checkpoint resharding. ByteCheckpoint employs a data/metadata disaggregated storage architecture, decoupling checkpoint storage from the adopted parallelism strategies and training frameworks. We design an efficient asynchronous tensor merging technique to settle the irregular tensor sharding problem and propose several I/O performance optimizations to significantly enhance the efficiency of checkpoint saving and loading. Experimental results demonstrate ByteCheckpoint's substantial advantages in reducing checkpoint saving (by up to 529.22X) and loading (by up to 3.51X) costs, compared to baseline methods.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning
Authors:
Baoyuan Wu,
Hongrui Chen,
Mingda Zhang,
Zihao Zhu,
Shaokui Wei,
Danni Yuan,
Mingli Zhu,
Ruotong Wang,
Li Liu,
Chao Shen
Abstract:
As an emerging approach to explore the vulnerability of deep neural networks (DNNs), backdoor learning has attracted increasing interest in recent years, and many seminal backdoor attack and defense algorithms are being developed successively or concurrently, in the status of a rapid arms race. However, mainly due to the diverse settings, and the difficulties of implementation and reproducibility…
▽ More
As an emerging approach to explore the vulnerability of deep neural networks (DNNs), backdoor learning has attracted increasing interest in recent years, and many seminal backdoor attack and defense algorithms are being developed successively or concurrently, in the status of a rapid arms race. However, mainly due to the diverse settings, and the difficulties of implementation and reproducibility of existing works, there is a lack of a unified and standardized benchmark of backdoor learning, causing unfair comparisons or unreliable conclusions (e.g., misleading, biased or even false conclusions). Consequently, it is difficult to evaluate the current progress and design the future development roadmap of this literature. To alleviate this dilemma, we build a comprehensive benchmark of backdoor learning called BackdoorBench. Our benchmark makes three valuable contributions to the research community. 1) We provide an integrated implementation of state-of-the-art (SOTA) backdoor learning algorithms (currently including 20 attack and 32 defense algorithms), based on an extensible modular-based codebase. 2) We conduct comprehensive evaluations with 5 poisoning ratios, based on 4 models and 4 datasets, leading to 11,492 pairs of attack-against-defense evaluations in total. 3) Based on above evaluations, we present abundant analysis from 10 perspectives via 18 useful analysis tools, and provide several inspiring insights about backdoor learning. We hope that our efforts could build a solid foundation of backdoor learning to facilitate researchers to investigate existing algorithms, develop more innovative algorithms, and explore the intrinsic mechanism of backdoor learning. Finally, we have created a user-friendly website at http://backdoorbench.com, which collects all important information of BackdoorBench, including codebase, docs, leaderboard, and model Zoo.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Garment Animation NeRF with Color Editing
Authors:
Renke Wang,
Meng Zhang,
Jun Li,
Jian Yan
Abstract:
Generating high-fidelity garment animations through traditional workflows, from modeling to rendering, is both tedious and expensive. These workflows often require repetitive steps in response to updates in character motion, rendering viewpoint changes, or appearance edits. Although recent neural rendering offers an efficient solution for computationally intensive processes, it struggles with rend…
▽ More
Generating high-fidelity garment animations through traditional workflows, from modeling to rendering, is both tedious and expensive. These workflows often require repetitive steps in response to updates in character motion, rendering viewpoint changes, or appearance edits. Although recent neural rendering offers an efficient solution for computationally intensive processes, it struggles with rendering complex garment animations containing fine wrinkle details and realistic garment-and-body occlusions, while maintaining structural consistency across frames and dense view rendering. In this paper, we propose a novel approach to directly synthesize garment animations from body motion sequences without the need for an explicit garment proxy. Our approach infers garment dynamic features from body motion, providing a preliminary overview of garment structure. Simultaneously, we capture detailed features from synthesized reference images of the garment's front and back, generated by a pre-trained image model. These features are then used to construct a neural radiance field that renders the garment animation video. Additionally, our technique enables garment recoloring by decomposing its visual elements. We demonstrate the generalizability of our method across unseen body motions and camera views, ensuring detailed structural consistency. Furthermore, we showcase its applicability to color editing on both real and synthetic garment data. Compared to existing neural rendering techniques, our method exhibits qualitative and quantitative improvements in garment dynamics and wrinkle detail modeling. Code is available at \url{https://github.com/wrk226/GarmentAnimationNeRF}.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Authors:
Xin Zhang,
Yanzhao Zhang,
Dingkun Long,
Wen Xie,
Ziqi Dai,
Jialong Tang,
Huan Lin,
Baosong Yang,
Pengjun Xie,
Fei Huang,
Meishan Zhang,
Wenjie Li,
Min Zhang
Abstract:
We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive lea…
▽ More
We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
Take A Step Back: Rethinking the Two Stages in Visual Reasoning
Authors:
Mingyu Zhang,
Jiting Cai,
Mingyu Liu,
Yue Xu,
Cewu Lu,
Yong-Lu Li
Abstract:
Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning…
▽ More
Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
\textsc{Perm}: A Parametric Representation for Multi-Style 3D Hair Modeling
Authors:
Chengan He,
Xin Sun,
Zhixin Shu,
Fujun Luan,
Sören Pirk,
Jorge Alejandro Amador Herrera,
Dominik L. Michels,
Tuanfeng Y. Wang,
Meng Zhang,
Holly Rushmeier,
Yi Zhou
Abstract:
We present \textsc{Perm}, a learned parametric model of human 3D hair designed to facilitate various hair-related applications. Unlike previous work that jointly models the global hair shape and local strand details, we propose to disentangle them using a PCA-based strand representation in the frequency domain, thereby allowing more precise editing and output control. Specifically, we leverage our…
▽ More
We present \textsc{Perm}, a learned parametric model of human 3D hair designed to facilitate various hair-related applications. Unlike previous work that jointly models the global hair shape and local strand details, we propose to disentangle them using a PCA-based strand representation in the frequency domain, thereby allowing more precise editing and output control. Specifically, we leverage our strand representation to fit and decompose hair geometry textures into low- to high-frequency hair structures. These decomposed textures are later parameterized with different generative models, emulating common stages in the hair modeling process. We conduct extensive experiments to validate the architecture design of \textsc{Perm}, and finally deploy the trained model as a generic prior to solve task-agnostic problems, further showcasing its flexibility and superiority in tasks such as 3D hair parameterization, hairstyle interpolation, single-view hair reconstruction, and hair-conditioned image generation. Our code and data will be available at: \url{https://github.com/c-he/perm}.
△ Less
Submitted 28 July, 2024;
originally announced July 2024.
-
DTFormer: A Transformer-Based Method for Discrete-Time Dynamic Graph Representation Learning
Authors:
Xi Chen,
Yun Xiong,
Siwei Zhang,
Jiawei Zhang,
Yao Zhang,
Shiyang Zhou,
Xixi Wu,
Mingyang Zhang,
Tengfei Liu,
Weiqiang Wang
Abstract:
Discrete-Time Dynamic Graphs (DTDGs), which are prevalent in real-world implementations and notable for their ease of data acquisition, have garnered considerable attention from both academic researchers and industry practitioners. The representation learning of DTDGs has been extensively applied to model the dynamics of temporally changing entities and their evolving connections. Currently, DTDG…
▽ More
Discrete-Time Dynamic Graphs (DTDGs), which are prevalent in real-world implementations and notable for their ease of data acquisition, have garnered considerable attention from both academic researchers and industry practitioners. The representation learning of DTDGs has been extensively applied to model the dynamics of temporally changing entities and their evolving connections. Currently, DTDG representation learning predominantly relies on GNN+RNN architectures, which manifest the inherent limitations of both Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs). GNNs suffer from the over-smoothing issue as the models architecture goes deeper, while RNNs struggle to capture long-term dependencies effectively. GNN+RNN architectures also grapple with scaling to large graph sizes and long sequences. Additionally, these methods often compute node representations separately and focus solely on individual node characteristics, thereby overlooking the behavior intersections between the two nodes whose link is being predicted, such as instances where the two nodes appear together in the same context or share common neighbors.
This paper introduces a novel representation learning method DTFormer for DTDGs, pivoting from the traditional GNN+RNN framework to a Transformer-based architecture. Our approach exploits the attention mechanism to concurrently process topological information within the graph at each timestamp and temporal dynamics of graphs along the timestamps, circumventing the aforementioned fundamental weakness of both GNNs and RNNs. Moreover, we enhance the model's expressive capability by incorporating the intersection relationships among nodes and integrating a multi-patching module. Extensive experiments conducted on six public dynamic graph benchmark datasets confirm our model's efficacy, achieving the SOTA performance.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
AMA-LSTM: Pioneering Robust and Fair Financial Audio Analysis for Stock Volatility Prediction
Authors:
Shengkun Wang,
Taoran Ji,
Jianfeng He,
Mariam Almutairi,
Dan Wang,
Linhan Wang,
Min Zhang,
Chang-Tien Lu
Abstract:
Stock volatility prediction is an important task in the financial industry. Recent advancements in multimodal methodologies, which integrate both textual and auditory data, have demonstrated significant improvements in this domain, such as earnings calls (Earnings calls are public available and often involve the management team of a public company and interested parties to discuss the company's ea…
▽ More
Stock volatility prediction is an important task in the financial industry. Recent advancements in multimodal methodologies, which integrate both textual and auditory data, have demonstrated significant improvements in this domain, such as earnings calls (Earnings calls are public available and often involve the management team of a public company and interested parties to discuss the company's earnings). However, these multimodal methods have faced two drawbacks. First, they often fail to yield reliable models and overfit the data due to their absorption of stochastic information from the stock market. Moreover, using multimodal models to predict stock volatility suffers from gender bias and lacks an efficient way to eliminate such bias. To address these aforementioned problems, we use adversarial training to generate perturbations that simulate the inherent stochasticity and bias, by creating areas resistant to random information around the input space to improve model robustness and fairness. Our comprehensive experiments on two real-world financial audio datasets reveal that this method exceeds the performance of current state-of-the-art solution. This confirms the value of adversarial training in reducing stochasticity and bias for stock volatility prediction tasks.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography
Authors:
Kailai Zhou,
Lijing Cai,
Yibo Wang,
Mengya Zhang,
Bihan Wen,
Qiu Shen,
Xun Cao
Abstract:
The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spect…
▽ More
The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spectral decomposition model guided enhancement framework, which consists of two steps: joint decomposition and prior-guided enhancement. Firstly, we leverage the complementarity between RGB and Low-resolution Multi-Spectral Images (Lr-MSI) to predict shading, reflectance, and material semantic priors. Subsequently, these priors are seamlessly integrated into the established HDRNet to promote dynamic range enhancement, color mapping, and grid expert learning, respectively. Additionally, we construct a high-quality Mobile-Spec dataset to support our research, and our experiments validate the effectiveness of Lr-MSI in the tone enhancement task. This work aims to establish a solid foundation for advancing spectral vision in mobile photography. The code is available at \url{https://github.com/CalayZhou/JDM-HDRNet}.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Simultaneous Trajectory Optimization and Contact Selection for Contact-rich Manipulation with High-Fidelity Geometry
Authors:
Mengchao Zhang,
Devesh K. Jha,
Arvind U. Raghunathan,
Kris Hauser
Abstract:
Contact-implicit trajectory optimization (CITO) is an effective method to plan complex trajectories for various contact-rich systems including manipulation and locomotion. CITO formulates a mathematical program with complementarity constraints (MPCC) that enforces that contact forces must be zero when points are not in contact. However, MPCC solve times increase steeply with the number of allowabl…
▽ More
Contact-implicit trajectory optimization (CITO) is an effective method to plan complex trajectories for various contact-rich systems including manipulation and locomotion. CITO formulates a mathematical program with complementarity constraints (MPCC) that enforces that contact forces must be zero when points are not in contact. However, MPCC solve times increase steeply with the number of allowable points of contact, which limits CITO's applicability to problems in which only a few, simple geometries are allowed to make contact. This paper introduces simultaneous trajectory optimization and contact selection (STOCS), as an extension of CITO that overcomes this limitation. The innovation of STOCS is to identify salient contact points and times inside the iterative trajectory optimization process. This effectively reduces the number of variables and constraints in each MPCC invocation. The STOCS framework, instantiated with key contact identification subroutines, renders the optimization of manipulation trajectories computationally tractable even for high-fidelity geometries consisting of tens of thousands of vertices.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal
Authors:
Yeying Jin,
Xin Li,
Jiadong Wang,
Yan Zhang,
Malu Zhang
Abstract:
Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lac…
▽ More
Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lacking nighttime raindrop scenarios. Consequently, algorithms trained on these datasets may struggle to perform effectively in raindrop-focused or nighttime scenarios. The absence of datasets specifically designed for raindrop-focused and nighttime raindrops constrains research in this area. In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. Raindrop Clarity comprises 15,186 high-quality pairs/triplets (raindrops, blur, and background) of images with raindrops and the corresponding clear background images. There are 5,442 daytime raindrop images and 9,744 nighttime raindrop images. Specifically, the 5,442 daytime images include 3,606 raindrop- and 1,836 background-focused images. While the 9,744 nighttime images contain 4,838 raindrop- and 4,906 background-focused images. Our dataset will enable the community to explore background-focused and raindrop-focused images, including challenges unique to daytime and nighttime conditions. Our data and code are available at: \url{https://github.com/jinyeying/RaindropClarity}
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
Authors:
Zhuowan Li,
Cheng Li,
Mingyang Zhang,
Qiaozhu Mei,
Michael Bendersky
Abstract:
Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and L…
▽ More
Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG's significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning
Authors:
Pin-Jie Lin,
Miaoran Zhang,
Marius Mosbach,
Dietrich Klakow
Abstract:
Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We comp…
▽ More
Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59% to 3.96%. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
SeqMIA: Sequential-Metric Based Membership Inference Attack
Authors:
Hao Li,
Zheng Li,
Siyuan Wu,
Chengrui Hu,
Yutong Ye,
Min Zhang,
Dengguo Feng,
Yang Zhang
Abstract:
Most existing membership inference attacks (MIAs) utilize metrics (e.g., loss) calculated on the model's final state, while recent advanced attacks leverage metrics computed at various stages, including both intermediate and final stages, throughout the model training. Nevertheless, these attacks often process multiple intermediate states of the metric independently, ignoring their time-dependent…
▽ More
Most existing membership inference attacks (MIAs) utilize metrics (e.g., loss) calculated on the model's final state, while recent advanced attacks leverage metrics computed at various stages, including both intermediate and final stages, throughout the model training. Nevertheless, these attacks often process multiple intermediate states of the metric independently, ignoring their time-dependent patterns. Consequently, they struggle to effectively distinguish between members and non-members who exhibit similar metric values, particularly resulting in a high false-positive rate.
In this study, we delve deeper into the new membership signals in the black-box scenario. We identify a new, more integrated membership signal: the Pattern of Metric Sequence, derived from the various stages of model training. We contend that current signals provide only partial perspectives of this new signal: the new one encompasses both the model's multiple intermediate and final states, with a greater emphasis on temporal patterns among them. Building upon this signal, we introduce a novel attack method called Sequential-metric based Membership Inference Attack (SeqMIA). Specifically, we utilize knowledge distillation to obtain a set of distilled models representing various stages of the target model's training. We then assess multiple metrics on these distilled models in chronological order, creating distilled metric sequence. We finally integrate distilled multi-metric sequences as a sequential multiformat and employ an attention-based RNN attack model for inference. Empirical results show SeqMIA outperforms all baselines, especially can achieve an order of magnitude improvement in terms of TPR @ 0.1% FPR. Furthermore, we delve into the reasons why this signal contributes to SeqMIA's high attack performance, and assess various defense mechanisms against SeqMIA.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
Feeling the Grass Grow: Making Midair Haptic Parameters Visible, Touchable and Controllable
Authors:
Mingxin Zhang,
Qirong Zhu,
Yasutoshi Makino,
Hiroyuki Shinoda
Abstract:
In this paper, we present an ultrasound mid-air haptic interaction system that integrates a designed visualization of haptic parameters while maintaining ease of control. The design of corresponding haptic parameters for real-world tactile textures is a complex task. Furthermore, users often face difficulties in simultaneously controlling multi-dimensional haptic parameters to achieve the desired…
▽ More
In this paper, we present an ultrasound mid-air haptic interaction system that integrates a designed visualization of haptic parameters while maintaining ease of control. The design of corresponding haptic parameters for real-world tactile textures is a complex task. Furthermore, users often face difficulties in simultaneously controlling multi-dimensional haptic parameters to achieve the desired vibration feedback. To address these challenges, the SLS optimization method facilitates user control of these multi-dimensional parameters through a simple one-dimensional slider. Concurrently, our system employs the "Growing Grass" metaphor to visualize haptic parameter adjustments in real-time. This approach combining visual and haptic sensations can bring richer experiences and generate a realistic sensation of touching a grassy surface. Our objective is to enhance users' intuitive comprehension of haptic parameters through this innovative system.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
Self-supervised transformer-based pre-training method with General Plant Infection dataset
Authors:
Zhengle Wang,
Ruifeng Wang,
Minjuan Wang,
Tianyun Lai,
Man Zhang
Abstract:
Pest and disease classification is a challenging issue in agriculture. The performance of deep learning models is intricately linked to training data diversity and quantity, posing issues for plant pest and disease datasets that remain underdeveloped. This study addresses these challenges by constructing a comprehensive dataset and proposing an advanced network architecture that combines Contrasti…
▽ More
Pest and disease classification is a challenging issue in agriculture. The performance of deep learning models is intricately linked to training data diversity and quantity, posing issues for plant pest and disease datasets that remain underdeveloped. This study addresses these challenges by constructing a comprehensive dataset and proposing an advanced network architecture that combines Contrastive Learning and Masked Image Modeling (MIM). The dataset comprises diverse plant species and pest categories, making it one of the largest and most varied in the field. The proposed network architecture demonstrates effectiveness in addressing plant pest and disease recognition tasks, achieving notable detection accuracy. This approach offers a viable solution for rapid, efficient, and cost-effective plant pest and disease detection, thereby reducing agricultural production costs. Our code and dataset will be publicly available to advance research in plant pest and disease recognition the GitHub repository at https://github.com/WASSER2545/GPID-22
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
Interior Object Geometry via Fitted Frames
Authors:
Stephen M. Pizer,
Zhiyuan Liu,
Junjie Zhao,
Nicholas Tapp-Hughes,
James Damon,
Miaomiao Zhang,
JS Marron,
Jared Vicory
Abstract:
We describe a representation targeted for anatomic objects which is designed to enable strong locational correspondence within object populations and thus to provide powerful object statistics. The method generates fitted frames on the boundary and in the interior of objects and produces alignment-free geometric features from them. It accomplishes this by understanding an object as the diffeomorph…
▽ More
We describe a representation targeted for anatomic objects which is designed to enable strong locational correspondence within object populations and thus to provide powerful object statistics. The method generates fitted frames on the boundary and in the interior of objects and produces alignment-free geometric features from them. It accomplishes this by understanding an object as the diffeomorphic deformation of an ellipsoid and using a skeletal representation fitted throughout the deformation to produce a model of the target object, where the object is provided initially in the form of a boundary mesh. Via classification performance on hippocampi shape between individuals with a disorder vs. others, we compare our method to two state-of-the-art methods for producing object representations that are intended to capture geometric correspondence across a population of objects and to yield geometric features useful for statistics, and we show improved classification performance by this new representation, which we call the evolutionary s-rep. The geometric features that are derived from each of the representations, especially via fitted frames, is discussed.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
LeKUBE: A Legal Knowledge Update BEnchmark
Authors:
Changyue Wang,
Weihang Su,
Hu Yiran,
Qingyao Ai,
Yueyue Wu,
Cheng Luo,
Yiqun Liu,
Min Zhang,
Shaoping Ma
Abstract:
Recent advances in Large Language Models (LLMs) have significantly shaped the applications of AI in multiple fields, including the studies of legal intelligence. Trained on extensive legal texts, including statutes and legal documents, the legal LLMs can capture important legal knowledge/concepts effectively and provide important support for downstream legal applications such as legal consultancy.…
▽ More
Recent advances in Large Language Models (LLMs) have significantly shaped the applications of AI in multiple fields, including the studies of legal intelligence. Trained on extensive legal texts, including statutes and legal documents, the legal LLMs can capture important legal knowledge/concepts effectively and provide important support for downstream legal applications such as legal consultancy. Yet, the dynamic nature of legal statutes and interpretations also poses new challenges to the use of LLMs in legal applications. Particularly, how to update the legal knowledge of LLMs effectively and efficiently has become an important research problem in practice. Existing benchmarks for evaluating knowledge update methods are mostly designed for the open domain and cannot address the specific challenges of the legal domain, such as the nuanced application of new legal knowledge, the complexity and lengthiness of legal regulations, and the intricate nature of legal reasoning. To address this gap, we introduce the Legal Knowledge Update BEnchmark, i.e. LeKUBE, which evaluates knowledge update methods for legal LLMs across five dimensions. Specifically, we categorize the needs of knowledge updates in the legal domain with the help of legal professionals, and then hire annotators from law schools to create synthetic updates to the Chinese Criminal and Civil Code as well as sets of questions of which the answers would change after the updates. Through a comprehensive evaluation of state-of-the-art knowledge update methods, we reveal a notable gap between existing knowledge update methods and the unique needs of the legal domain, emphasizing the need for further research and development of knowledge update mechanisms tailored for legal LLMs.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
TorchGT: A Holistic System for Large-scale Graph Transformer Training
Authors:
Meng Zhang,
Jie Sun,
Qinghao Hu,
Peng Sun,
Zeke Wang,
Yonggang Wen,
Tianwei Zhang
Abstract:
Graph Transformer is a new architecture that surpasses GNNs in graph learning. While there emerge inspiring algorithm advancements, their practical adoption is still limited, particularly on real-world graphs involving up to millions of nodes. We observe existing graph transformers fail on large-scale graphs mainly due to heavy computation, limited scalability and inferior model quality. Motivated…
▽ More
Graph Transformer is a new architecture that surpasses GNNs in graph learning. While there emerge inspiring algorithm advancements, their practical adoption is still limited, particularly on real-world graphs involving up to millions of nodes. We observe existing graph transformers fail on large-scale graphs mainly due to heavy computation, limited scalability and inferior model quality. Motivated by these observations, we propose TorchGT, the first efficient, scalable, and accurate graph transformer training system. TorchGT optimizes training at different levels. At algorithm level, by harnessing the graph sparsity, TorchGT introduces a Dual-interleaved Attention which is computation-efficient and accuracy-maintained. At runtime level, TorchGT scales training across workers with a communication-light Cluster-aware Graph Parallelism. At kernel level, an Elastic Computation Reformation further optimizes the computation by reducing memory access latency in a dynamic way. Extensive experiments demonstrate that TorchGT boosts training by up to 62.7x and supports graph sequence lengths of up to 1M.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
DisenSemi: Semi-supervised Graph Classification via Disentangled Representation Learning
Authors:
Yifan Wang,
Xiao Luo,
Chong Chen,
Xian-Sheng Hua,
Ming Zhang,
Wei Ju
Abstract:
Graph classification is a critical task in numerous multimedia applications, where graphs are employed to represent diverse types of multimedia data, including images, videos, and social networks. Nevertheless, in real-world scenarios, labeled graph data can be limited or scarce. To address this issue, we focus on the problem of semi-supervised graph classification, which involves both supervised…
▽ More
Graph classification is a critical task in numerous multimedia applications, where graphs are employed to represent diverse types of multimedia data, including images, videos, and social networks. Nevertheless, in real-world scenarios, labeled graph data can be limited or scarce. To address this issue, we focus on the problem of semi-supervised graph classification, which involves both supervised and unsupervised models learning from labeled and unlabeled data. In contrast to recent approaches that transfer the entire knowledge from the unsupervised model to the supervised one, we argue that an effective transfer should only retain the relevant semantics that align well with the supervised task. In this paper, we propose a novel framework named DisenSemi, which learns disentangled representation for semi-supervised graph classification. Specifically, a disentangled graph encoder is proposed to generate factor-wise graph representations for both supervised and unsupervised models. Then we train two models via supervised objective and mutual information (MI)-based constraints respectively. To ensure the meaningful transfer of knowledge from the unsupervised encoder to the supervised one, we further define an MI-based disentangled consistency regularization between two models and identify the corresponding rationale that aligns well with the current graph classification task. Experimental results on a range of publicly accessible datasets reveal the effectiveness of our DisenSemi.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
Rényi-infinity constrained sampling with $d^3$ membership queries
Authors:
Yunbum Kook,
Matthew S. Zhang
Abstract:
Uniform sampling over a convex body is a fundamental algorithmic problem, yet the convergence in KL or Rényi divergence of most samplers remains poorly understood. In this work, we propose a constrained proximal sampler, a principled and simple algorithm that possesses elegant convergence guarantees. Leveraging the uniform ergodicity of this sampler, we show that it converges in the Rényi-infinity…
▽ More
Uniform sampling over a convex body is a fundamental algorithmic problem, yet the convergence in KL or Rényi divergence of most samplers remains poorly understood. In this work, we propose a constrained proximal sampler, a principled and simple algorithm that possesses elegant convergence guarantees. Leveraging the uniform ergodicity of this sampler, we show that it converges in the Rényi-infinity divergence ($\mathcal R_\infty$) with no query complexity overhead when starting from a warm start. This is the strongest of commonly considered performance metrics, implying rates in $\{\mathcal R_q, \mathsf{KL}\}$ convergence as special cases.
By applying this sampler within an annealing scheme, we propose an algorithm which can approximately sample $\varepsilon$-close to the uniform distribution on convex bodies in $\mathcal R_\infty$-divergence with $\widetilde{\mathcal{O}}(d^3\, \text{polylog} \frac{1}{\varepsilon})$ query complexity. This improves on all prior results in $\{\mathcal R_q, \mathsf{KL}\}$-divergences, without resorting to any algorithmic modifications or post-processing of the sample. It also matches the prior best known complexity in total variation distance.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
Authors:
Zhong-Zhi Li,
Ming-Liang Zhang,
Fei Yin,
Zhi-Long Ji,
Jin-Feng Bai,
Zhen-Ru Pan,
Fan-Hu Zeng,
Jian Xu,
Jia-Xin Zhang,
Cheng-Lin Liu
Abstract:
Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 ed…
▽ More
Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
Authors:
Hongrong Cheng,
Miao Zhang,
Javen Qinfeng Shi
Abstract:
As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede…
▽ More
As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
TEXasGAN: Tactile Texture Exploration and Synthesis System Using Generative Adversarial Network
Authors:
Mingxin Zhang,
Shun Terui,
Yasutoshi Makino,
Hiroyuki Shinoda
Abstract:
To create more realistic experiences in human-virtual object interactions, texture rendering has become a research hotspot in recent years. Different frequency components of designed vibrations can activate texture-related sensations due to similar receptors. However, designing specific vibrations for numerous real-world materials is impractical. Therefore, this study proposes a human-in-the-loop…
▽ More
To create more realistic experiences in human-virtual object interactions, texture rendering has become a research hotspot in recent years. Different frequency components of designed vibrations can activate texture-related sensations due to similar receptors. However, designing specific vibrations for numerous real-world materials is impractical. Therefore, this study proposes a human-in-the-loop vibration generation model based on user preferences. To enable users to easily control the generation of vibration samples with large parameter spaces, we introduce an optimization model based on Differential Subspace Search (DSS) and Generative Adversarial Network (GAN). With DSS, users can use a one-dimensional slider to easily modify the high-dimensional latent space so that the GAN can generate desired vibrations. We trained the generative model using a open dataset of tactile vibration data and selected five types of vibrations as target samples for the generation experiment. Extensive user experiments were conducted using the generated and real samples. The results indicate that our system can generate distinguishable samples that match the target characteristics. Moreover, the results also reveal a correlation between subjects' ability to distinguish real samples and their ability to distinguish generated samples.
△ Less
Submitted 25 July, 2024; v1 submitted 16 July, 2024;
originally announced July 2024.
-
TLRN: Temporal Latent Residual Networks For Large Deformation Image Registration
Authors:
Nian Wu,
Jiarui Xing,
Miaomiao Zhang
Abstract:
This paper presents a novel approach, termed {\em Temporal Latent Residual Network (TLRN)}, to predict a sequence of deformation fields in time-series image registration. The challenge of registering time-series images often lies in the occurrence of large motions, especially when images differ significantly from a reference (e.g., the start of a cardiac cycle compared to the peak stretching phase…
▽ More
This paper presents a novel approach, termed {\em Temporal Latent Residual Network (TLRN)}, to predict a sequence of deformation fields in time-series image registration. The challenge of registering time-series images often lies in the occurrence of large motions, especially when images differ significantly from a reference (e.g., the start of a cardiac cycle compared to the peak stretching phase). To achieve accurate and robust registration results, we leverage the nature of motion continuity and exploit the temporal smoothness in consecutive image frames. Our proposed TLRN highlights a temporal residual network with residual blocks carefully designed in latent deformation spaces, which are parameterized by time-sequential initial velocity fields. We treat a sequence of residual blocks over time as a dynamic training system, where each block is designed to learn the residual function between desired deformation features and current input accumulated from previous time frames. We validate the effectivenss of TLRN on both synthetic data and real-world cine cardiac magnetic resonance (CMR) image videos. Our experimental results shows that TLRN is able to achieve substantially improved registration accuracy compared to the state-of-the-art. Our code is publicly available at https://github.com/nellie689/TLRN.
△ Less
Submitted 23 July, 2024; v1 submitted 15 July, 2024;
originally announced July 2024.
-
Protecting Data Buyer Privacy in Data Markets
Authors:
Minxing Zhang,
Jian Pei
Abstract:
Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third parties, neglecting the challenges associated with protecting the privacy of data buyers. In this article, we address this gap by modeling the intri…
▽ More
Data markets serve as crucial platforms facilitating data discovery, exchange, sharing, and integration among data users and providers. However, the paramount concern of privacy has predominantly centered on protecting privacy of data owners and third parties, neglecting the challenges associated with protecting the privacy of data buyers. In this article, we address this gap by modeling the intricacies of data buyer privacy protection and investigating the delicate balance between privacy and purchase cost. Through comprehensive experimentation, our results yield valuable insights, shedding light on the efficacy and efficiency of our proposed approaches.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
Authors:
Lecheng Kong,
Jiarui Feng,
Hao Liu,
Chengsong Huang,
Jiaxin Huang,
Yixin Chen,
Muhan Zhang
Abstract:
Foundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph…
▽ More
Foundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph data into a language format for LLM-based prediction or still train a GNN model with LLM as an assistant. The former can handle unlimited tasks, while the latter captures graph structure much better -- yet, no existing work can achieve both simultaneously. In this paper, we identify three key desirable properties of a GFM: self-supervised pretraining, fluidity in tasks, and graph awareness. To account for these properties, we extend the conventional language modeling to the graph domain and propose a novel generative graph language model GOFA to solve the problem. The model interleaves randomly initialized GNN layers into a frozen pre-trained LLM so that the semantic and structural modeling abilities are organically combined. GOFA is pre-trained on newly proposed graph-level next-word prediction, question-answering, and structural tasks to obtain the above GFM properties. The pre-trained model is further fine-tuned on downstream tasks to obtain task-solving ability. The fine-tuned model is evaluated on various downstream tasks, demonstrating a strong ability to solve structural and contextual problems in zero-shot scenarios. The code is available at https://github.com/JiaruiFeng/GOFA.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
SLoRD: Structural Low-Rank Descriptors for Shape Consistency in Vertebrae Segmentation
Authors:
Xin You,
Yixin Lou,
Minghui Zhang,
Chuyan Zhang,
Jie Yang,
Yun Gu
Abstract:
Automatic and precise segmentation of vertebrae from CT images is crucial for various clinical applications. However, due to a lack of explicit and strict constraints, existing methods especially for single-stage methods, still suffer from the challenge of intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. For multi-stage methods, ver…
▽ More
Automatic and precise segmentation of vertebrae from CT images is crucial for various clinical applications. However, due to a lack of explicit and strict constraints, existing methods especially for single-stage methods, still suffer from the challenge of intra-vertebrae segmentation inconsistency, which refers to multiple label predictions inside a singular vertebra. For multi-stage methods, vertebrae detection serving as the first step, is affected by the pathology and mental implants. Thus, incorrect detections cause biased patches before segmentation, then lead to inconsistent labeling and segmentation. In our work, motivated by the perspective of instance segmentation, we try to label individual and complete binary masks to address this limitation. Specifically, a contour-based network is proposed based on Structural Low-Rank Descriptors for shape consistency, termed SLoRD. These contour descriptors are acquired in a data-driven manner in advance. For a more precise representation of contour descriptors, we adopt the spherical coordinate system and devise the spherical centroid. Besides, the contour loss is designed to impose explicit consistency constraints, facilitating regressed contour points close to vertebral boundaries. Quantitative and qualitative evaluations on VerSe 2019 demonstrate the superior performance of our framework over other single-stage and multi-stage state-of-the-art (SOTA) methods.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
Authors:
Zheng Wang,
Boxiao Jin,
Zhongzhi Yu,
Minjia Zhang
Abstract:
How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-c…
▽ More
How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll benchmarks, we compare our method with other KV cache compression techniques, including H2O and CaM, showing that our method achieves superior performance across tasks with both 50% and 35% KV cache budgets.
△ Less
Submitted 20 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection
Authors:
Mingjin Zhang,
Yuchun Wang,
Jie Guo,
Yunsong Li,
Xinbo Gao,
Jing Zhang
Abstract:
The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared i…
▽ More
The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared images. Unlike a visible light camera, a thermal imager reveals an object's temperature distribution by capturing infrared radiation. Small targets often show a subtle temperature transition at the object's boundaries. To address this issue, we propose the IRSAM model for IRSTD, which improves SAM's encoder-decoder architecture to learn better feature representation of infrared small objects. Specifically, we design a Perona-Malik diffusion (PMD)-based block and incorporate it into multiple levels of SAM's encoder to help it capture essential structural features while suppressing noise. Additionally, we devise a Granularity-Aware Decoder (GAD) to fuse the multi-granularity feature from the encoder to capture structural information that may be lost in long-distance modeling. Extensive experiments on the public datasets, including NUAA-SIRST, NUDT-SIRST, and IRSTD-1K, validate the design choice of IRSAM and its significant superiority over representative state-of-the-art methods. The source code are available at: github.com/IPIC-Lab/IRSAM.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
A deep graph model for the signed interaction prediction in biological network
Authors:
Shuyi Jin,
Mengji Zhang,
Meijie Wang,
Lun Yu
Abstract:
In pharmaceutical research, the strategy of drug repurposing accelerates the development of new therapies while reducing R&D costs. Network pharmacology lays the theoretical groundwork for identifying new drug indications, and deep graph models have become essential for their precision in mapping complex biological networks. Our study introduces an advanced graph model that utilizes graph convolut…
▽ More
In pharmaceutical research, the strategy of drug repurposing accelerates the development of new therapies while reducing R&D costs. Network pharmacology lays the theoretical groundwork for identifying new drug indications, and deep graph models have become essential for their precision in mapping complex biological networks. Our study introduces an advanced graph model that utilizes graph convolutional networks and tensor decomposition to effectively predict signed chemical-gene interactions. This model demonstrates superior predictive performance, especially in handling the polar relations in biological networks. Our research opens new avenues for drug discovery and repurposing, especially in understanding the mechanism of actions of drugs.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram
Authors:
Ming-Liang Zhang,
Zhong-Zhi Li,
Fei Yin,
Liang Lin,
Cheng-Lin Liu
Abstract:
Geometry problem solving (GPS) requires capacities of multi-modal understanding, multi-hop reasoning and theorem knowledge application. In this paper, we propose a neural-symbolic model for plane geometry problem solving (PGPS), named PGPSNet-v2, with three key steps: modal fusion, reasoning process and knowledge verification. In modal fusion, we leverage textual clauses to express fine-grained st…
▽ More
Geometry problem solving (GPS) requires capacities of multi-modal understanding, multi-hop reasoning and theorem knowledge application. In this paper, we propose a neural-symbolic model for plane geometry problem solving (PGPS), named PGPSNet-v2, with three key steps: modal fusion, reasoning process and knowledge verification. In modal fusion, we leverage textual clauses to express fine-grained structural and semantic content of geometry diagram, and fuse diagram with textual problem efficiently through structural-semantic pre-training. For reasoning, we design an explicable solution program to describe the geometric reasoning process, and employ a self-limited decoder to generate solution program autoregressively. To reduce solution errors, a multi-level theorem verifier is proposed to eliminate solutions that do not match geometric principles, alleviating the hallucination of the neural model. We also construct a large-scale geometry problem dataset called PGPS9K, containing fine-grained annotations of textual clauses, solution program and involved knowledge tuples. Extensive experiments on datasets Geometry3K and PGPS9K show that our PGPSNet solver outperforms existing symbolic and neural solvers in GPS performance, while maintaining good explainability and reliability, and the solver components (fusion, reasoning, verification) are all justified effective.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
Authors:
Mingfang Zhang,
Yifei Huang,
Ruicong Liu,
Yoichi Sato
Abstract:
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motio…
▽ More
Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images
Authors:
Zhangyang Qi,
Yunhan Yang,
Mengchen Zhang,
Long Xing,
Xiaoyang Wu,
Tong Wu,
Dahua Lin,
Xihui Liu,
Jiaqi Wang,
Hengshuang Zhao
Abstract:
Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts…
▽ More
Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts. Imagine you can get a toy through 3D AIGC but with undesired accessories and dressing. To tackle this challenge, we propose a novel pipeline called Tailor3D, which swiftly creates customized 3D assets from editable dual-side images. We aim to emulate a tailor's ability to locally change objects or perform overall style transfer. Unlike creating 3D assets from multiple views, using dual-side images eliminates conflicts on overlapping areas that occur when editing individual views. Specifically, it begins by editing the front view, then generates the back view of the object through multi-view diffusion. Afterward, it proceeds to edit the back views. Finally, a Dual-sided LRM is proposed to seamlessly stitch together the front and back 3D features, akin to a tailor sewing together the front and back of a garment. The Dual-sided LRM rectifies imperfect consistencies between the front and back views, enhancing editing capabilities and reducing memory burdens while seamlessly integrating them into a unified 3D representation with the LoRA Triplane Transformer. Experimental results demonstrate Tailor3D's effectiveness across various 3D generation and editing tasks, including 3D generative fill and style transfer. It provides a user-friendly, efficient solution for editing 3D assets, with each editing step taking only seconds to complete.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation
Authors:
Xinying Guo,
Mingyuan Zhang,
Haozhe Xie,
Chenyang Gu,
Ziwei Liu
Abstract:
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning. This new task requires an intricate integration of control and generation to realistically synthesize crowd dynamics under specific spatial and semantic constraints, whose challenges are yet to be fully explored. On the one hand, existing h…
▽ More
Crowd Motion Generation is essential in entertainment industries such as animation and games as well as in strategic fields like urban simulation and planning. This new task requires an intricate integration of control and generation to realistically synthesize crowd dynamics under specific spatial and semantic constraints, whose challenges are yet to be fully explored. On the one hand, existing human motion generation models typically focus on individual behaviors, neglecting the complexities of collective behaviors. On the other hand, recent methods for multi-person motion generation depend heavily on pre-defined scenarios and are limited to a fixed, small number of inter-person interactions, thus hampering their practicality. To overcome these challenges, we introduce CrowdMoGen, a zero-shot text-driven framework that harnesses the power of Large Language Model (LLM) to incorporate the collective intelligence into the motion generation framework as guidance, thereby enabling generalizable planning and generation of crowd motions without paired training data. Our framework consists of two key components: 1) Crowd Scene Planner that learns to coordinate motions and dynamics according to specific scene contexts or introduced perturbations, and 2) Collective Motion Generator that efficiently synthesizes the required collective motions based on the holistic plans. Extensive quantitative and qualitative experiments have validated the effectiveness of our framework, which not only fills a critical gap by providing scalable and generalizable solutions for Crowd Motion Generation task but also achieves high levels of realism and flexibility.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study
Authors:
Shihan Dou,
Haoxiang Jia,
Shenxi Wu,
Huiyuan Zheng,
Weikang Zhou,
Muling Wu,
Mingxu Chai,
Jessica Fan,
Caishuang Huang,
Yunbo Tao,
Yan Liu,
Enyu Zhou,
Ming Zhang,
Yuhao Zhou,
Yueming Wu,
Rui Zheng,
Ming Wen,
Rongxiang Weng,
Jingang Wang,
Xunliang Cai,
Tao Gui,
Xipeng Qiu,
Qi Zhang,
Xuanjing Huang
Abstract:
The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundar…
▽ More
The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of these existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and four popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. Furthermore, to better understand the performance of LLMs in real-world projects, we manually created a real-world benchmark comprising 140 code generation tasks. Our analysis highlights distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Experimental results demonstrate that our approach can significantly mitigate bugs and increase the passing rate by 29.2% after two iterations, indicating substantial potential for LLMs to handle more complex problems.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning
Authors:
Xiaojie Li,
Yibo Yang,
Jianlong Wu,
Bernard Ghanem,
Liqiang Nie,
Min Zhang
Abstract:
Few-shot class-incremental learning (FSCIL) confronts the challenge of integrating new classes into a model with minimal training samples while preserving the knowledge of previously learned classes. Traditional methods widely adopt static adaptation relying on a fixed parameter space to learn from data that arrive sequentially, prone to overfitting to the current session. Existing dynamic strateg…
▽ More
Few-shot class-incremental learning (FSCIL) confronts the challenge of integrating new classes into a model with minimal training samples while preserving the knowledge of previously learned classes. Traditional methods widely adopt static adaptation relying on a fixed parameter space to learn from data that arrive sequentially, prone to overfitting to the current session. Existing dynamic strategies require the expansion of the parameter space continually, leading to increased complexity. To address these challenges, we integrate the recently proposed selective state space model (SSM) into FSCIL. Concretely, we propose a dual selective SSM projector that dynamically adjusts the projection parameters based on the intermediate features for dynamic adaptation. The dual design enables the model to maintain the robust features of base classes, while adaptively learning distinctive feature shifts for novel classes. Additionally, we develop a class-sensitive selective scan mechanism to guide dynamic adaptation. It minimizes the disruption to base-class representations caused by training on novel data, and meanwhile, forces the selective scan to perform in distinct patterns between base and novel classes. Experiments on miniImageNet, CUB-200, and CIFAR-100 demonstrate that our framework outperforms the existing state-of-the-art methods. The code is available at https://github.com/xiaojieli0903/Mamba-FSCIL.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Vision-Braille: An End-to-End Tool for Chinese Braille Image-to-Text Translation
Authors:
Alan Wu,
Ye Yuan,
Ming Zhang
Abstract:
Visually impaired people are a large group who can only use braille for reading and writing. However, the lack of special educational resources is the bottleneck for educating them. Educational equity is a reflection of the level of social civilization, cultural equality, and individual dignity. Facilitating and improving lifelong learning channels for the visually impaired is of great significanc…
▽ More
Visually impaired people are a large group who can only use braille for reading and writing. However, the lack of special educational resources is the bottleneck for educating them. Educational equity is a reflection of the level of social civilization, cultural equality, and individual dignity. Facilitating and improving lifelong learning channels for the visually impaired is of great significance. Their written braille homework or exam papers cannot be understood by sighted teachers, because of the lack of a highly accurate braille translation system, especially in Chinese which has tone marks. braille writers often omit tone marks to save space, leading to confusion when braille with the same consonants and vowels is translated into Chinese. Previous algorithms were insufficient in extracting contextual information, resulting in low accuracy of braille translations into Chinese. This project informatively fine-tuned the mT5 model with an Encoder-decoder architecture for braille to Chinese character conversion. This research created a training set of braille and corresponding Chinese text from the Leipzig Corpora. This project significantly reduced the confusion in braille, achieving $62.4$ and $62.3$ BLEU scores in the validation and test sets, with a curriculum learning fine-tuning method. By incorporating the braille recognition algorithm, this project is the first publicly available braille translation system and can benefit lots of visually impaired students and families who are preparing for the Chinese College Test and help to propel their college dreams in the future. There is a demo on our homepage\footnote{\url{https://vision-braille.com/}}.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
SCATTER: Algorithm-Circuit Co-Sparse Photonic Accelerator with Thermal-Tolerant, Power-Efficient In-situ Light Redistribution
Authors:
Ziang Yin,
Nicholas Gangi,
Meng Zhang,
Jeff Zhang,
Rena Huang,
Jiaqi Gu
Abstract:
Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads. However, limited reconfigurability, high electrical-optical conversion cost, and thermal sensitivity limit the deployment of current optical analog computing engines to support power-restricted, performance-sensitive AI workloads at scale. Sparsity provides a great…
▽ More
Photonic computing has emerged as a promising solution for accelerating computation-intensive artificial intelligence (AI) workloads. However, limited reconfigurability, high electrical-optical conversion cost, and thermal sensitivity limit the deployment of current optical analog computing engines to support power-restricted, performance-sensitive AI workloads at scale. Sparsity provides a great opportunity for hardware-efficient AI accelerators. However, current dense photonic accelerators fail to fully exploit the power-saving potential of algorithmic sparsity. It requires sparsity-aware hardware specialization with a fundamental re-design of photonic tensor core topology and cross-layer device-circuit-architecture-algorithm co-optimization aware of hardware non-ideality and power bottleneck. To trim down the redundant power consumption while maximizing robustness to thermal variations, we propose SCATTER, a novel algorithm-circuit co-sparse photonic accelerator featuring dynamically reconfigurable signal path via thermal-tolerant, power-efficient in-situ light redistribution and power gating. A power-optimized, crosstalk-aware dynamic sparse training framework is introduced to explore row-column structured sparsity and ensure marginal accuracy loss and maximum power efficiency. The extensive evaluation shows that our cross-stacked optimized accelerator SCATTER achieves a 511X area reduction and 12.4X power saving with superior crosstalk tolerance that enables unprecedented circuit layout compactness and on-chip power efficiency.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Ternary Spike-based Neuromorphic Signal Processing System
Authors:
Shuai Wang,
Dehao Zhang,
Ammar Belatreche,
Yichen Xiao,
Hongyu Qing,
Wenjie We,
Malu Zhang,
Yang Yang
Abstract:
Deep Neural Networks (DNNs) have been successfully implemented across various signal processing fields, resulting in significant enhancements in performance. However, DNNs generally require substantial computational resources, leading to significant economic costs and posing challenges for their deployment on resource-constrained edge devices. In this study, we take advantage of spiking neural net…
▽ More
Deep Neural Networks (DNNs) have been successfully implemented across various signal processing fields, resulting in significant enhancements in performance. However, DNNs generally require substantial computational resources, leading to significant economic costs and posing challenges for their deployment on resource-constrained edge devices. In this study, we take advantage of spiking neural networks (SNNs) and quantization technologies to develop an energy-efficient and lightweight neuromorphic signal processing system. Our system is characterized by two principal innovations: a threshold-adaptive encoding (TAE) method and a quantized ternary SNN (QT-SNN). The TAE method can efficiently encode time-varying analog signals into sparse ternary spike trains, thereby reducing energy and memory demands for signal processing. QT-SNN, compatible with ternary spike trains from the TAE method, quantifies both membrane potentials and synaptic weights to reduce memory requirements while maintaining performance. Extensive experiments are conducted on two typical signal-processing tasks: speech and electroencephalogram recognition. The results demonstrate that our neuromorphic signal processing system achieves state-of-the-art (SOTA) performance with a 94% reduced memory requirement. Furthermore, through theoretical energy consumption analysis, our system shows 7.5x energy saving compared to other SNN works. The efficiency and efficacy of the proposed system highlight its potential as a promising avenue for energy-efficient signal processing.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
Authors:
Haozhe Zhao,
Xiaojian Ma,
Liang Chen,
Shuzheng Si,
Rujie Wu,
Kaikai An,
Peiyu Yu,
Minjia Zhang,
Qing Li,
Baobao Chang
Abstract:
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct a…
▽ More
This paper presents UltraEdit, a large-scale (approximately 4 million editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrush, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data. The dataset, code, and models can be found in https://ultra-editing.github.io.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning
Authors:
Min Zhang,
Jianye Hao,
Xian Fu,
Peilong Han,
Hao Zhang,
Lei Shi,
Hongyao Tang,
Yan Zheng
Abstract:
In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed lig…
▽ More
In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed light on their capabilities and limitations in this domain. To this end, based on the characteristics of embodied task planning, we first develop a systematic evaluation framework, which encapsulates four crucial capabilities of MFMs: object understanding, spatio-temporal perception, task understanding, and embodied reasoning. Following this, we propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios, typical yet diverse task types, task instances of varying difficulties, and rich test case types ranging from multiple embodied question answering to embodied task reasoning. Finally, we offer a simple and easy-to-use automatic evaluation platform that enables the automated testing of multiple MFMs on the proposed benchmark. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance. The MFE-ETP is a high-quality, large-scale, and challenging benchmark relevant to real-world tasks.
△ Less
Submitted 30 July, 2024; v1 submitted 6 July, 2024;
originally announced July 2024.
-
Incremental Multiview Point Cloud Registration
Authors:
Xiaoya Cheng,
Yu Liu,
Maojun Zhang,
Shen Yan
Abstract:
In this paper, we present a novel approach for multiview point cloud registration. Different from previous researches that typically employ a global scheme for multiview registration, we propose to adopt an incremental pipeline to progressively align scans into a canonical coordinate system. Specifically, drawing inspiration from image-based 3D reconstruction, our approach first builds a sparse sc…
▽ More
In this paper, we present a novel approach for multiview point cloud registration. Different from previous researches that typically employ a global scheme for multiview registration, we propose to adopt an incremental pipeline to progressively align scans into a canonical coordinate system. Specifically, drawing inspiration from image-based 3D reconstruction, our approach first builds a sparse scan graph with scan retrieval and geometric verification. Then, we perform incremental registration via initialization, next scan selection and registration, Track create and continue, and Bundle Adjustment. Additionally, for detector-free matchers, we incorporate a Track refinement process. This process primarily constructs a coarse multiview registration and refines the model by adjusting the positions of the keypoints on the Track. Experiments demonstrate that the proposed framework outperforms existing multiview registration methods on three benchmark datasets. The code is available at https://github.com/Choyaa/IncreMVR.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Machine Learning for Complex Systems with Abnormal Pattern by Exception Maximization Outlier Detection Method
Authors:
Zhikun Zhang,
Yiting Duan,
Xiangjun Wang,
Mingyuan Zhang
Abstract:
This paper proposes a novel fast online methodology for outlier detection called the exception maximization outlier detection method(EMODM), which employs probabilistic models and statistical algorithms to detect abnormal patterns from the outputs of complex systems. The EMODM is based on a two-state Gaussian mixture model and demonstrates strong performance in probability anomaly detection workin…
▽ More
This paper proposes a novel fast online methodology for outlier detection called the exception maximization outlier detection method(EMODM), which employs probabilistic models and statistical algorithms to detect abnormal patterns from the outputs of complex systems. The EMODM is based on a two-state Gaussian mixture model and demonstrates strong performance in probability anomaly detection working on real-time raw data rather than using special prior distribution information. We confirm this using the synthetic data from two numerical cases. For the real-world data, we have detected the short circuit pattern of the circuit system using EMODM by the current and voltage output of a three-phase inverter. The EMODM also found an abnormal period due to COVID-19 in the insured unemployment data of 53 regions in the United States from 2000 to 2024. The application of EMODM to these two real-life datasets demonstrated the effectiveness and accuracy of our algorithm.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Planning with Large Language Models for Conversational Agents
Authors:
Zhigen Li,
Jianxiang Peng,
Yanmeng Wang,
Tianhao Shen,
Minghui Zhang,
Linxi Su,
Shang Wu,
Yihang Wu,
Yuqian Wang,
Ye Wang,
Wei Hu,
Jianfeng Li,
Shaojun Wang,
Jing Xiao,
Deyi Xiong
Abstract:
Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be un…
▽ More
Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be unified with controllability, proactivity, and low manual annotation. To bridge this gap, we propose a new framework for planning-based conversational agents (PCA) powered by large language models (LLMs), which only requires humans to define tasks and goals for the LLMs. Before conversation, LLM plans the core and necessary SOP for dialogue offline. During the conversation, LLM plans the best action path online referring to the SOP, and generates responses to achieve process controllability. Subsequently, we propose a semi-automatic dialogue data creation framework and curate a high-quality dialogue dataset (PCA-D). Meanwhile, we develop multiple variants and evaluation metrics for PCA, e.g., planning with Monte Carlo Tree Search (PCA-M), which searches for the optimal dialogue action while satisfying SOP constraints and achieving the proactive of the dialogue. Experiment results show that LLMs finetuned on PCA-D can significantly improve the performance and generalize to unseen domains. PCA-M outperforms other CoT and ToT baselines in terms of conversation controllability, proactivity, task success rate, and overall logical coherence, and is applicable in industry dialogue scenarios. The dataset and codes are available at XXXX.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Feature-Specific Coefficients of Determination in Tree Ensembles
Authors:
Zhongli Jiang,
Dabao Zhang,
Min Zhang
Abstract:
Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses,…
▽ More
Tree ensemble methods provide promising predictions with models difficult to interpret. Recent introduction of Shapley values for individualized feature contributions, accompanied with several fast computing algorithms for predicted values, shows intriguing results. However, individualizing coefficients of determination, aka $R^2$, for each feature is challenged by the underlying quadratic losses, although these coefficients allow us to comparatively assess single feature's contribution to tree ensembles. Here we propose an efficient algorithm, Q-SHAP, that reduces the computational complexity to polynomial time when calculating Shapley values related to quadratic losses. Our extensive simulation studies demonstrate that this approach not only enhances computational efficiency but also improves estimation accuracy of feature-specific coefficients of determination.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Foundations and Frontiers of Graph Learning Theory
Authors:
Yu Huang,
Min Zhou,
Menglin Yang,
Zhen Wang,
Muhan Zhang,
Jie Wang,
Hong Xie,
Hao Wang,
Defu Lian,
Enhong Chen
Abstract:
Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the…
▽ More
Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the theoretical analysis framework to distill the core concepts, helps understand the key principles that drive the functionality better and guide further development. Given this surge in interest, this article provides a comprehensive summary of the theoretical foundations and breakthroughs concerning the approximation and learning behaviors intrinsic to prevalent graph learning models. Encompassing discussions on fundamental aspects such as expressiveness power, generalization, optimization, and unique phenomena such as over-smoothing and over-squashing, this piece delves into the theoretical foundations and frontier driving the evolution of graph learning. In addition, this article also presents several challenges and further initiates discussions on possible solutions.
△ Less
Submitted 7 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation
Authors:
Zhibin Lan,
Liqiang Niu,
Fandong Meng,
Jie Zhou,
Min Zhang,
Jinsong Su
Abstract:
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has be…
▽ More
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose \textit{Translatotron-V(ision)}, an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9\% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
AcuVR: Enhancing Acupuncture Training Workflow with Virtual Reality
Authors:
Menghe Zhang,
Chen Chen,
Matin Yarmand,
Anish Rajeshkumar,
Nadir Weibel
Abstract:
Acupuncture is a widely adopted medical practice that involves inserting thin needles into specific points on the body to alleviate pain and treat various health conditions. Current learning practices heavily rely on 2D atlases and practice on peers, which are notably less intuitive and pose risks, particularly in sensitive areas such as the eyes. To address these challenges, we introduce AcuVR, a…
▽ More
Acupuncture is a widely adopted medical practice that involves inserting thin needles into specific points on the body to alleviate pain and treat various health conditions. Current learning practices heavily rely on 2D atlases and practice on peers, which are notably less intuitive and pose risks, particularly in sensitive areas such as the eyes. To address these challenges, we introduce AcuVR, a Virtual Reality (VR) based system designed to add a layer of interactivity and realism. This innovation aims to reduce the risks associated with practicing acupuncture techniques while offering more effective learning strategies. Furthermore, AcuVR incorporates medical imaging and standardized anatomy models, enabling the simulation of customized acupuncture scenarios. This feature represents a significant advancement beyond the limitations of conventional resources such as atlases and textbooks, facilitating a more immersive and personalized learning experience. The evaluation study with eight acupuncture students and practitioners revealed high participant satisfaction and pointed to the effectiveness and potential of AcuVR as a valuable addition to acupuncture training.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
LaMoD: Latent Motion Diffusion Model For Myocardial Strain Generation
Authors:
Jiarui Xing,
Nivetha Jayakumar,
Nian Wu,
Yu Wang,
Frederick H. Epstein,
Miaomiao Zhang
Abstract:
Motion and deformation analysis of cardiac magnetic resonance (CMR) imaging videos is crucial for assessing myocardial strain of patients with abnormal heart functions. Recent advances in deep learning-based image registration algorithms have shown promising results in predicting motion fields from routinely acquired CMR sequences. However, their accuracy often diminishes in regions with subtle ap…
▽ More
Motion and deformation analysis of cardiac magnetic resonance (CMR) imaging videos is crucial for assessing myocardial strain of patients with abnormal heart functions. Recent advances in deep learning-based image registration algorithms have shown promising results in predicting motion fields from routinely acquired CMR sequences. However, their accuracy often diminishes in regions with subtle appearance change, with errors propagating over time. Advanced imaging techniques, such as displacement encoding with stimulated echoes (DENSE) CMR, offer highly accurate and reproducible motion data but require additional image acquisition, which poses challenges in busy clinical flows. In this paper, we introduce a novel Latent Motion Diffusion model (LaMoD) to predict highly accurate DENSE motions from standard CMR videos. More specifically, our method first employs an encoder from a pre-trained registration network that learns latent motion features (also considered as deformation-based shape features) from image sequences. Supervised by the ground-truth motion provided by DENSE, LaMoD then leverages a probabilistic latent diffusion model to reconstruct accurate motion from these extracted features. Experimental results demonstrate that our proposed method, LaMoD, significantly improves the accuracy of motion analysis in standard CMR images; hence improving myocardial strain analysis in clinical settings for cardiac patients. Our code will be publicly available on upon acceptance.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.