-
Generative Technology for Human Emotion Recognition: A Scope Review
Authors:
Fei Ma,
Yucheng Yuan,
Yifan Xie,
Hongwei Ren,
Ivan Liu,
Ying He,
Fuji Ren,
Fei Richard Yu,
Shiguang Ni
Abstract:
Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progre…
▽ More
Affective computing stands at the forefront of artificial intelligence (AI), seeking to imbue machines with the ability to comprehend and respond to human emotions. Central to this field is emotion recognition, which endeavors to identify and interpret human emotional states from different modalities, such as speech, facial images, text, and physiological signals. In recent years, important progress has been made in generative models, including Autoencoder, Generative Adversarial Network, Diffusion Model, and Large Language Model. These models, with their powerful data generation capabilities, emerge as pivotal tools in advancing emotion recognition. However, up to now, there remains a paucity of systematic efforts that review generative technology for emotion recognition. This survey aims to bridge the gaps in the existing literature by conducting a comprehensive analysis of over 320 research papers until June 2024. Specifically, this survey will firstly introduce the mathematical principles of different generative models and the commonly used datasets. Subsequently, through a taxonomy, it will provide an in-depth analysis of how generative techniques address emotion recognition based on different modalities in several aspects, including data augmentation, feature extraction, semi-supervised learning, cross-domain, etc. Finally, the review will outline future research directions, emphasizing the potential of generative models to advance the field of emotion recognition and enhance the emotional intelligence of AI systems.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Augmenting LLMs to Repair Obsolete Test Cases with Static Collector and Neural Reranker
Authors:
Jun Liu,
Jiwei Yan,
Yuanyuan Xie,
Jun Yan,
Jian Zhang
Abstract:
During software evolution, it is advocated that test code should co-evolve with production code. In real development scenarios, test updating may lag behind production code changing, which may cause the project to fail to compile or bring other troubles. Existing techniques based on pre-trained language models can be adopted to repair obsolete tests caused by such unsynchronized code changes, espe…
▽ More
During software evolution, it is advocated that test code should co-evolve with production code. In real development scenarios, test updating may lag behind production code changing, which may cause the project to fail to compile or bring other troubles. Existing techniques based on pre-trained language models can be adopted to repair obsolete tests caused by such unsynchronized code changes, especially syntactic-related ones. However, the lack of target-oriented contextual information affects repair accuracy on large-scale projects. Starting from an obsoleted test, the key challenging task is precisely identifying and constructing Test-Repair-Oriented Contexts (TROCtx) from the whole repository within a limited token size.
In this paper, we propose SynBCIATR (Syntactic-Breaking-Change-Induced Automated Test Repair), a novel approach to automatically repair obsolete test cases via precise and concise TROCtx construction. Inspired by developers' programming practices of the task, we design three types of TROCtx: class contexts, usage contexts, and environment contexts. For every type of TROCtx, SynBCIATR automatically collects the changed-token-related code information through static analysis techniques. Then it generates reranking queries to identify the most relevant TROCtxs, which will be taken as the repair-required key context and be input to the Large Language Model for the final test repair.
To evaluate the effectiveness of SynBCIATR, we construct a benchmark dataset that contains diverse syntactic breaking changes. The experimental results show that SynBCIATR outperforms baseline approaches both on textual- and intent-matching metrics. With the augmentation of TROCtx constructed by SynBCIATR, hallucinations are reduced by 57.1%.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Domain Generalizable Knowledge Tracing via Concept Aggregation and Relation-Based Attention
Authors:
Yuquan Xie,
Wanqi Yang,
Jinyu Wei,
Ming Yang,
Yang Gao
Abstract:
Knowledge Tracing (KT) is a critical task in online education systems, aiming to monitor students' knowledge states throughout a learning period. Common KT approaches involve predicting the probability of a student correctly answering the next question based on their exercise history. However, these methods often suffer from performance degradation when faced with the scarcity of student interacti…
▽ More
Knowledge Tracing (KT) is a critical task in online education systems, aiming to monitor students' knowledge states throughout a learning period. Common KT approaches involve predicting the probability of a student correctly answering the next question based on their exercise history. However, these methods often suffer from performance degradation when faced with the scarcity of student interactions in new education systems. To address this, we leverage student interactions from existing education systems to mitigate performance degradation caused by limited training data. Nevertheless, these interactions exhibit significant differences since they are derived from different education systems. To address this issue, we propose a domain generalization approach for knowledge tracing, where existing education systems are considered source domains, and new education systems with limited data are considered target domains. Additionally, we design a domain-generalizable knowledge tracing framework (DGKT) that can be applied to any KT model. Specifically, we present a concept aggregation approach designed to reduce conceptual disparities within sequences of student interactions from diverse domains. To further mitigate domain discrepancies, we introduce a novel normalization module called Sequence Instance Normalization (SeqIN). Moreover, to fully leverage exercise information, we propose a new knowledge tracing model tailored for the domain generalization KT task, named Domain-Generalizable Relation-based Knowledge Tracing (DGRKT). Extensive experiments across five benchmark datasets demonstrate that the proposed method performs well despite limited training data.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning
Authors:
Yixiao Wang,
Yifei Zhang,
Mingxiao Huo,
Ran Tian,
Xiang Zhang,
Yichen Xie,
Chenfeng Xu,
Pengliang Ji,
Wei Zhan,
Mingyu Ding,
Masayoshi Tomizuka
Abstract:
The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). B…
▽ More
The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in https://forrest-110.github.io/sparse_diffusion_policy/.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Heterogeneous Graph Contrastive Learning with Spectral Augmentation
Authors:
Jing Zhang,
Xiaoqian Jiang,
Yingjie Xie,
Cangqi Zhou
Abstract:
Heterogeneous graphs can well describe the complex entity relationships in the real world. For example, online shopping networks contain multiple physical types of consumers and products, as well as multiple relationship types such as purchasing and favoriting. More and more scholars pay attention to this research because heterogeneous graph representation learning shows strong application potenti…
▽ More
Heterogeneous graphs can well describe the complex entity relationships in the real world. For example, online shopping networks contain multiple physical types of consumers and products, as well as multiple relationship types such as purchasing and favoriting. More and more scholars pay attention to this research because heterogeneous graph representation learning shows strong application potential in real-world scenarios. However, the existing heterogeneous graph models use data augmentation techniques to enhance the use of graph structure information, which only captures the graph structure information from the spatial topology, ignoring the information displayed in the spectrum dimension of the graph structure. To address the issue that heterogeneous graph representation learning methods fail to model spectral information, this paper introduces a spectral-enhanced graph contrastive learning model (SHCL) and proposes a spectral augmentation algorithm for the first time in heterogeneous graph neural networks. The proposed model learns an adaptive topology augmentation scheme through the heterogeneous graph itself, disrupting the structural information of the heterogeneous graph in the spectrum dimension, and ultimately improving the learning effect of the model. Experimental results on multiple real-world datasets demonstrate substantial advantages of the proposed model.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Stochastic First-Order Methods with Non-smooth and Non-Euclidean Proximal Terms for Nonconvex High-Dimensional Stochastic Optimization
Authors:
Yue Xie,
Jiawen Bi,
Hongcheng Liu
Abstract:
When the nonconvex problem is complicated by stochasticity, the sample complexity of stochastic first-order methods may depend linearly on the problem dimension, which is undesirable for large-scale problems. In this work, we propose dimension-insensitive stochastic first-order methods (DISFOMs) to address nonconvex optimization with expected-valued objective function. Our algorithms allow for non…
▽ More
When the nonconvex problem is complicated by stochasticity, the sample complexity of stochastic first-order methods may depend linearly on the problem dimension, which is undesirable for large-scale problems. In this work, we propose dimension-insensitive stochastic first-order methods (DISFOMs) to address nonconvex optimization with expected-valued objective function. Our algorithms allow for non-Euclidean and non-smooth distance functions as the proximal terms. Under mild assumptions, we show that DISFOM using minibatches to estimate the gradient enjoys sample complexity of $ \mathcal{O} ( (\log d) / ε^4 ) $ to obtain an $ε$-stationary point. Furthermore, we prove that DISFOM employing variance reduction can sharpen this bound to $\mathcal{O} ( (\log d)^{2/3}/ε^{10/3} )$, which perhaps leads to the best-known sample complexity result in terms of $d$. We provide two choices of the non-smooth distance functions, both of which allow for closed-form solutions to the proximal step. Numerical experiments are conducted to illustrate the dimension insensitive property of the proposed frameworks.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues
Authors:
Yuxin Xie,
Tao Zhou,
Yi Zhou,
Geng Chen
Abstract:
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key compon…
▽ More
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance.
△ Less
Submitted 28 June, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis
Authors:
Vu Minh Hieu Phan,
Yutong Xie,
Bowen Zhang,
Yuankai Qi,
Zhibin Liao,
Antonios Perperidis,
Son Lam Phung,
Johan W. Verjans,
Minh-Son To
Abstract:
Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image…
▽ More
Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image synthesis, particularly in synthesizing structural details. This paper empirically demonstrates that, lacking strong inductive biases, Transformer can converge to non-optimal solutions in the absence of paired data. To address this, we introduce UNet Structured Transformer (UNest), a novel architecture incorporating structural inductive biases for unpaired medical image synthesis. We leverage the foundational Segment-Anything Model to precisely extract the foreground structure and perform structural attention within the main anatomy. This guides the model to learn key anatomical regions, thus improving structural synthesis under the lack of supervision in unpaired training. Evaluated on two public datasets, spanning three modalities, i.e., MR, CT, and PET, UNest improves recent methods by up to 19.30% across six medical image synthesis tasks. Our code is released at https://github.com/HieuPhan33/MICCAI2024-UNest.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Pre-Trained Vision-Language Models as Partial Annotators
Authors:
Qian-Wei Wang,
Yuqiu Xie,
Letian Zhang,
Zimo Liu,
Shu-Tao Xia
Abstract:
Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge…
▽ More
Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge distillation. However, annotating samples is laborious, while a large number of unlabeled samples can be easily obtained. In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks. Specifically, based on CLIP, we annotate image samples with multiple prompt templates to obtain multiple candidate labels to form the noisy partial label dataset, and design a collaborative consistency regularization algorithm to solve this problem. Our method simultaneously trains two neural networks, which collaboratively purify training labels for each other and obtain pseudo-labels for self-training, while adopting prototypical similarity alignment and noisy supervised contrastive learning to optimize model representation. In experiments, our method achieves performances far beyond zero-shot inference without introducing additional label information, and outperforms other weakly supervised learning and few-shot fine-tuning methods, and obtains smaller deployed models. Our code is available at: \url{https://anonymous.4open.science/r/Co-Reg-8CF9}.
△ Less
Submitted 23 May, 2024;
originally announced June 2024.
-
A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge
Authors:
Xiaopeng Wang,
Yi Lu,
Xin Qi,
Zhiyong Wang,
Yuankun Xie,
Shuchen Shi,
Ruibo Fu
Abstract:
This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-…
▽ More
This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers. Evaluation included both mono-lingual and cross-lingual synthesis across all seven languages, with subjective tests assessing naturalness and speaker similarity. Our system uses the VITS2 architecture, augmented with a multi-lingual ID and a BERT model to enhance contextual language comprehension. In Track 1, where no additional data usage was permitted, our model achieved a Speaker Similarity score of 4.02. In Track 2, which allowed the use of extra data, it attained a Speaker Similarity score of 4.17.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
WAVE: Weight Template for Adaptive Initialization of Variable-sized Models
Authors:
Fu Feng,
Yucheng Xie,
Jing Wang,
Xin Geng
Abstract:
The expansion of model parameters underscores the significance of pre-trained models; however, the constraints encountered during model deployment necessitate models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address the initialization problem when target models are incompatible with pre-trained models. We tackle this issue from a multitasking p…
▽ More
The expansion of model parameters underscores the significance of pre-trained models; however, the constraints encountered during model deployment necessitate models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address the initialization problem when target models are incompatible with pre-trained models. We tackle this issue from a multitasking perspective and introduce \textbf{WAVE}, which incorporates a set of shared \textbf{W}eight templates for \textbf{A}daptive initialization of \textbf{V}ariable-siz\textbf{E}d Models. During initialization, target models will initialize the corresponding weight scalers tailored to their model size, which are sufficient to learn the connection rules of weight templates based on the Kronecker product from a limited amount of data. For the construction of the weight templates, WAVE utilizes the \textit{Learngene} framework, which structurally condenses common knowledge from ancestry models into weight templates as the learngenes through knowledge distillation. This process allows the integration of pre-trained models' knowledge into structured knowledge according to the rules of weight templates. We provide a comprehensive benchmark for the learngenes, and extensive experiments demonstrate the efficacy of WAVE. The results show that WAVE achieves state-of-the-art performance when initializing models with various depth and width, and even outperforms the direct pre-training of $n$ entire models, particularly for smaller models, saving approximately $n\times$ and $5\times$ in computational and storage resources, respectively. WAVE simultaneously achieves the most efficient knowledge transfer across a series of datasets, specifically achieving an average improvement of 1.8\% and 1.2\% on 7 downstream datasets.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
ObjectNLQ @ Ego4D Episodic Memory Challenge 2024
Authors:
Yisen Feng,
Haoyu Zhang,
Yuquan Xie,
Zaijing Li,
Meng Liu,
Liqiang Nie
Abstract:
In this report, we present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024. Both challenges require the localization of actions within long video sequences using textual queries. To enhance localization accuracy, our method not only processes the temporal information of videos but also identifies fine-grained objects spatial…
▽ More
In this report, we present our approach for the Natural Language Query track and Goal Step track of the Ego4D Episodic Memory Benchmark at CVPR 2024. Both challenges require the localization of actions within long video sequences using textual queries. To enhance localization accuracy, our method not only processes the temporal information of videos but also identifies fine-grained objects spatially within the frames. To this end, we introduce a novel approach, termed ObjectNLQ, which incorporates an object branch to augment the video representation with detailed object information, thereby improving grounding efficiency. ObjectNLQ achieves a mean R@1 of 23.15, ranking 2nd in the Natural Language Queries Challenge, and gains 33.00 in terms of the metric R@1, IoU=0.3, ranking 3rd in the Goal Step Challenge. Our code will be released at https://github.com/Yisen-Feng/ObjectNLQ.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
HCQA @ Ego4D EgoSchema Challenge 2024
Authors:
Haoyu Zhang,
Yuquan Xie,
Yisen Feng,
Zaijing Li,
Meng Liu,
Liqiang Nie
Abstract:
In this report, we present our champion solution for Ego4D EgoSchema Challenge in CVPR 2024. To deeply integrate the powerful egocentric captioning model and question reasoning model, we propose a novel Hierarchical Comprehension scheme for egocentric video Question Answering, named HCQA. It consists of three stages: Fine-grained Caption Generation, Context-driven Summarization, and Inference-guid…
▽ More
In this report, we present our champion solution for Ego4D EgoSchema Challenge in CVPR 2024. To deeply integrate the powerful egocentric captioning model and question reasoning model, we propose a novel Hierarchical Comprehension scheme for egocentric video Question Answering, named HCQA. It consists of three stages: Fine-grained Caption Generation, Context-driven Summarization, and Inference-guided Answering. Given a long-form video, HCQA captures local detailed visual information and global summarised visual information via Fine-grained Caption Generation and Context-driven Summarization, respectively. Then in Inference-guided Answering, HCQA utilizes this hierarchical information to reason and answer given question. On the EgoSchema blind test set, HCQA achieves 75% accuracy in answering over 5,000 human curated multiple-choice questions. Our code will be released at https://github.com/Hyu-Zhang/HCQA.
△ Less
Submitted 22 June, 2024;
originally announced June 2024.
-
Orangutan: A Multiscale Brain Emulation-Based Artificial Intelligence Framework for Dynamic Environments
Authors:
Yong Xie
Abstract:
Achieving General Artificial Intelligence (AGI) has long been a grand challenge in the field of AI, and brain-inspired computing is widely acknowledged as one of the most promising approaches to realize this goal. This paper introduces a novel brain-inspired AI framework, Orangutan. It simulates the structure and computational mechanisms of biological brains on multiple scales, encompassing multi-…
▽ More
Achieving General Artificial Intelligence (AGI) has long been a grand challenge in the field of AI, and brain-inspired computing is widely acknowledged as one of the most promising approaches to realize this goal. This paper introduces a novel brain-inspired AI framework, Orangutan. It simulates the structure and computational mechanisms of biological brains on multiple scales, encompassing multi-compartment neuron architectures, diverse synaptic connection modalities, neural microcircuits, cortical columns, and brain regions, as well as biochemical processes including facilitation, feedforward inhibition, short-term potentiation, and short-term depression, all grounded in solid neuroscience. Building upon these highly integrated brain-like mechanisms, I have developed a sensorimotor model that simulates human saccadic eye movements during object observation. The model's algorithmic efficacy was validated through testing with the observation of handwritten digit images.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Convolutional dynamical sampling and some new results
Authors:
Longxiu Huang,
A. Martina Neuman,
Sui Tang,
Yuying Xie
Abstract:
In this work, we explore the dynamical sampling problem on $\ell^2(\mathbb{Z})$ driven by a convolution operator defined by a convolution kernel. This problem is inspired by the need to recover a bandlimited heat diffusion field from space-time samples and its discrete analogue. In this book chapter, we review recent results in the finite-dimensional case and extend these findings to the infinite-…
▽ More
In this work, we explore the dynamical sampling problem on $\ell^2(\mathbb{Z})$ driven by a convolution operator defined by a convolution kernel. This problem is inspired by the need to recover a bandlimited heat diffusion field from space-time samples and its discrete analogue. In this book chapter, we review recent results in the finite-dimensional case and extend these findings to the infinite-dimensional case, focusing on the study of the density of space-time sampling sets.
△ Less
Submitted 4 July, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
CodeRAG-Bench: Can Retrieval Augment Code Generation?
Authors:
Zora Zhiruo Wang,
Akari Asai,
Xinyan Velocity Yu,
Frank F. Xu,
Yiqing Xie,
Graham Neubig,
Daniel Fried
Abstract:
While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving…
▽ More
While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing
Authors:
Xinbo Zhao,
Yingxue Zhang,
Xin Zhang,
Yu Yang,
Yiqun Xie,
Yanhua Li,
Jun Luo
Abstract:
Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urba…
▽ More
Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models
Authors:
Jiayi Lin,
Yutao Xie,
Yue Yu,
Yibiao Yang,
Lei Zhang
Abstract:
Recently, large code generation models trained in a self-supervised manner on extensive unlabeled programming language data have achieved remarkable success. While these models acquire vast amounts of code knowledge, they perform poorly on code understanding tasks, such as code search and clone detection, as they are specifically trained for generation. Pre-training a larger encoder-only architect…
▽ More
Recently, large code generation models trained in a self-supervised manner on extensive unlabeled programming language data have achieved remarkable success. While these models acquire vast amounts of code knowledge, they perform poorly on code understanding tasks, such as code search and clone detection, as they are specifically trained for generation. Pre-training a larger encoder-only architecture model from scratch on massive code data can improve understanding performance. However, this approach is costly and time-consuming, making it suboptimal. In this paper, we pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks, significantly reducing training costs. We examine effective strategies for enabling decoder-only models to acquire robust code representations. Furthermore, we introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance in understanding tasks such as code search and clone detection. Our analysis shows that our method effectively reduces the distance between semantically identical samples in the representation space. These findings suggest the potential for unifying code understanding and generation tasks using a decoder-only structured model.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
A Scalable and Effective Alternative to Graph Transformers
Authors:
Kaan Sancak,
Zhigang Hua,
Jin Fang,
Yan Xie,
Andrey Malevich,
Bo Long,
Muhammed Fatih Balin,
Ümit V. Çatalyürek
Abstract:
Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic comple…
▽ More
Graph Neural Networks (GNNs) have shown impressive performance in graph representation learning, but they face challenges in capturing long-range dependencies due to their limited expressive power. To address this, Graph Transformers (GTs) were introduced, utilizing self-attention mechanism to effectively model pairwise node relationships. Despite their advantages, GTs suffer from quadratic complexity w.r.t. the number of nodes in the graph, hindering their applicability to large graphs. In this work, we present Graph-Enhanced Contextual Operator (GECO), a scalable and effective alternative to GTs that leverages neighborhood propagation and global convolutions to effectively capture local and global dependencies in quasilinear time. Our study on synthetic datasets reveals that GECO reaches 169x speedup on a graph with 2M nodes w.r.t. optimized attention. Further evaluations on diverse range of benchmarks showcase that GECO scales to large graphs where traditional GTs often face memory and time limitations. Notably, GECO consistently achieves comparable or superior quality compared to baselines, improving the SOTA up to 4.5%, and offering a scalable and effective solution for large-scale graph learning.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models
Authors:
Shangqing Tu,
Yuanchun Wang,
Jifan Yu,
Yuyang Xie,
Yaran Shi,
Xiaozhi Wang,
Jing Zhang,
Lei Hou,
Juanzi Li
Abstract:
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Recently, various Retrieval-Augmented Large Language Models (RALLMs) are proposed to address this shortcoming. However, existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this pap…
▽ More
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Recently, various Retrieval-Augmented Large Language Models (RALLMs) are proposed to address this shortcoming. However, existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAG workflows in conjunction with LLMs. Our toolkit, which supports popular built-in RAG workflows and allows for the incorporation of customized testing data on the specific domain, is designed to be user-friendly, modular, and extensible. We conduct an evaluation of 21 RALLMs across three task levels and two representative domains, revealing significant variations in the effectiveness of RALLMs across different tasks and domains. Our analysis emphasizes the importance of considering both task and domain requirements when choosing a RAG workflow and LLM combination. We are committed to continuously maintaining our platform at https://github.com/THU-KEG/R-Eval to facilitate both the industry and the researchers.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Authors:
Bowen Jiang,
Yangxinyu Xie,
Zhuoqun Hao,
Xiaomeng Wang,
Tanwi Mallick,
Weijie J. Su,
Camillo J. Taylor,
Dan Roth
Abstract:
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syll…
▽ More
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
HAIChart: Human and AI Paired Visualization System
Authors:
Yupeng Xie,
Yuyu Luo,
Guoliang Li,
Nan Tang
Abstract:
The growing importance of data visualization in business intelligence and data science emphasizes the need for tools that can efficiently generate meaningful visualizations from large datasets. Existing tools fall into two main categories: human-powered tools (e.g., Tableau and PowerBI), which require intensive expert involvement, and AI-powered automated tools (e.g., Draco and Table2Charts), whic…
▽ More
The growing importance of data visualization in business intelligence and data science emphasizes the need for tools that can efficiently generate meaningful visualizations from large datasets. Existing tools fall into two main categories: human-powered tools (e.g., Tableau and PowerBI), which require intensive expert involvement, and AI-powered automated tools (e.g., Draco and Table2Charts), which often fall short of guessing specific user needs. In this paper, we aim to achieve the best of both worlds. Our key idea is to initially auto-generate a set of high-quality visualizations to minimize manual effort, then refine this process iteratively with user feedback to more closely align with their needs. To this end, we present HAIChart, a reinforcement learning-based framework designed to iteratively recommend good visualizations for a given dataset by incorporating user feedback. Specifically, we propose a Monte Carlo Graph Search-based visualization generation algorithm paired with a composite reward function to efficiently explore the visualization space and automatically generate good visualizations. We devise a visualization hints mechanism to actively incorporate user feedback, thus progressively refining the visualization generation module. We further prove that the top-k visualization hints selection problem is NP-hard and design an efficient algorithm. We conduct both quantitative evaluations and user studies, showing that HAIChart significantly outperforms state-of-the-art human-powered tools (21% better at Recall and 1.8 times faster) and AI-powered automatic tools (25.1% and 14.9% better in terms of Hit@3 and R10@30, respectively).
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
PIG: Prompt Images Guidance for Night-Time Scene Parsing
Authors:
Zhifeng Xie,
Rui Qiu,
Sen Wang,
Xin Tan,
Yuan Xie,
Lizhuang Ma
Abstract:
Night-time scene parsing aims to extract pixel-level semantic information in night images, aiding downstream tasks in understanding scene object distribution. Due to limited labeled night image datasets, unsupervised domain adaptation (UDA) has become the predominant method for studying night scenes. UDA typically relies on paired day-night image pairs to guide adaptation, but this approach hamper…
▽ More
Night-time scene parsing aims to extract pixel-level semantic information in night images, aiding downstream tasks in understanding scene object distribution. Due to limited labeled night image datasets, unsupervised domain adaptation (UDA) has become the predominant method for studying night scenes. UDA typically relies on paired day-night image pairs to guide adaptation, but this approach hampers dataset construction and restricts generalization across night scenes in different datasets. Moreover, UDA, focusing on network architecture and training strategies, faces difficulties in handling classes with few domain similarities. In this paper, we leverage Prompt Images Guidance (PIG) to enhance UDA with supplementary night knowledge. We propose a Night-Focused Network (NFNet) to learn night-specific features from both target domain images and prompt images. To generate high-quality pseudo-labels, we propose Pseudo-label Fusion via Domain Similarity Guidance (FDSG). Classes with fewer domain similarities are predicted by NFNet, which excels in parsing night features, while classes with more domain similarities are predicted by UDA, which has rich labeled semantics. Additionally, we propose two data augmentation strategies: the Prompt Mixture Strategy (PMS) and the Alternate Mask Strategy (AMS), aimed at mitigating the overfitting of the NFNet to a few prompt images. We conduct extensive experiments on four night-time datasets: NightCity, NightCity+, Dark Zurich, and ACDC. The results indicate that utilizing PIG can enhance the parsing accuracy of UDA.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
TACCO: Task-guided Co-clustering of Clinical Concepts and Patient Visits for Disease Subtyping based on EHR Data
Authors:
Ziyang Zhang,
Hejie Cui,
Ran Xu,
Yuzhang Xie,
Joyce C. Ho,
Carl Yang
Abstract:
The growing availability of well-organized Electronic Health Records (EHR) data has enabled the development of various machine learning models towards disease risk prediction. However, existing risk prediction methods overlook the heterogeneity of complex diseases, failing to model the potential disease subtypes regarding their corresponding patient visits and clinical concept subgroups. In this w…
▽ More
The growing availability of well-organized Electronic Health Records (EHR) data has enabled the development of various machine learning models towards disease risk prediction. However, existing risk prediction methods overlook the heterogeneity of complex diseases, failing to model the potential disease subtypes regarding their corresponding patient visits and clinical concept subgroups. In this work, we introduce TACCO, a novel framework that jointly discovers clusters of clinical concepts and patient visits based on a hypergraph modeling of EHR data. Specifically, we develop a novel self-supervised co-clustering framework that can be guided by the risk prediction task of specific diseases. Furthermore, we enhance the hypergraph model of EHR data with textual embeddings and enforce the alignment between the clusters of clinical concepts and patient visits through a contrastive objective. Comprehensive experiments conducted on the public MIMIC-III dataset and Emory internal CRADLE dataset over the downstream clinical tasks of phenotype classification and cardiovascular risk prediction demonstrate an average 31.25% performance improvement compared to traditional ML baselines and a 5.26% improvement on top of the vanilla hypergraph model without our co-clustering mechanism. In-depth model analysis, clustering results analysis, and clinical case studies further validate the improved utilities and insightful interpretations delivered by TACCO. Code is available at https://github.com/PericlesHat/TACCO.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions
Authors:
Hua Shen,
Tiffany Knearem,
Reshmi Ghosh,
Kenan Alkiek,
Kundan Krishna,
Yachuan Liu,
Ziqiao Ma,
Savvas Petridis,
Yi-Hao Peng,
Li Qiwei,
Sushrita Rakshit,
Chenglei Si,
Yutong Xie,
Jeffrey P. Bigham,
Frank Bentley,
Joyce Chai,
Zachary Lipton,
Qiaozhu Mei,
Rada Mihalcea,
Michael Terry,
Diyi Yang,
Meredith Ringel Morris,
Paul Resnick,
David Jurgens
Abstract:
Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve th…
▽ More
Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions.
△ Less
Submitted 17 June, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
Authors:
Benjamin Biggs,
Arjun Seshadri,
Yang Zou,
Achin Jain,
Aditya Golatkar,
Yusheng Xie,
Alessandro Achille,
Ashwin Swaminathan,
Stefano Soatto
Abstract:
We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup…
▽ More
We present Diffusion Soup, a compartmentalization method for Text-to-Image Generation that averages the weights of diffusion models trained on sharded data. By construction, our approach enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging. We show that Diffusion Soup samples from a point in weight space that approximates the geometric mean of the distributions of constituent datasets, which offers anti-memorization guarantees and enables zero-shot style mixing. Empirically, Diffusion Soup outperforms a paragon model trained on the union of all data shards and achieves a 30% improvement in Image Reward (.34 $\to$ .44) on domain sharded data, and a 59% improvement in IR (.37 $\to$ .59) on aesthetic data. In both cases, souping also prevails in TIFA score (respectively, 85.5 $\to$ 86.5 and 85.6 $\to$ 86.8). We demonstrate robust unlearning -- removing any individual domain shard only lowers performance by 1% in IR (.45 $\to$ .44) -- and validate our theoretical insights on anti-memorization using real data. Finally, we showcase Diffusion Soup's ability to blend the distinct styles of models finetuned on different shards, resulting in the zero-shot generation of hybrid styles.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Authors:
Yi Lu,
Yuankun Xie,
Ruibo Fu,
Zhengqi Wen,
Jianhua Tao,
Zhiyong Wang,
Xin Qi,
Xuefei Liu,
Yongwei Li,
Yukun Liu,
Xiaopeng Wang,
Shuchen Shi
Abstract:
With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to…
▽ More
With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to-end generation process, skipping the final step of vocoder processing. This poses a significant challenge for current audio deepfake detection (ADD) models based on vocoder artifacts. To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform. We propose Codecfake dataset, which is generated by seven representative neural codec methods. Experiment results show that codec-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models on the Codecfake test set.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models
Authors:
Yu Liu,
Lang Gao,
Mingxin Yang,
Yu Xie,
Ping Chen,
Xiaojin Zhang,
Wei Chen
Abstract:
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challe…
▽ More
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.
△ Less
Submitted 24 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
BvSP: Broad-view Soft Prompting for Few-Shot Aspect Sentiment Quad Prediction
Authors:
Yinhao Bai,
Yalan Xie,
Xiaoyi Liu,
Yuhua Zhao,
Zhixin Han,
Mengting Hu,
Hang Gao,
Renhong Cheng
Abstract:
Aspect sentiment quad prediction (ASQP) aims to predict four aspect-based elements, including aspect term, opinion term, aspect category, and sentiment polarity. In practice, unseen aspects, due to distinct data distribution, impose many challenges for a trained neural model. Motivated by this, this work formulates ASQP into the few-shot scenario, which aims for fast adaptation in real application…
▽ More
Aspect sentiment quad prediction (ASQP) aims to predict four aspect-based elements, including aspect term, opinion term, aspect category, and sentiment polarity. In practice, unseen aspects, due to distinct data distribution, impose many challenges for a trained neural model. Motivated by this, this work formulates ASQP into the few-shot scenario, which aims for fast adaptation in real applications. Therefore, we first construct a few-shot ASQP dataset (FSQP) that contains richer categories and is more balanced for the few-shot study. Moreover, recent methods extract quads through a generation paradigm, which involves converting the input sentence into a templated target sequence. However, they primarily focus on the utilization of a single template or the consideration of different template orders, thereby overlooking the correlations among various templates. To tackle this issue, we further propose a Broadview Soft Prompting (BvSP) method that aggregates multiple templates with a broader view by taking into account the correlation between the different templates. Specifically, BvSP uses the pre-trained language model to select the most relevant k templates with Jensen-Shannon divergence. BvSP further introduces soft prompts to guide the pre-trained language model using the selected templates. Then, we aggregate the results of multi-templates by voting mechanism. Empirical results demonstrate that BvSP significantly outperforms the stateof-the-art methods under four few-shot settings and other public datasets. Our code and dataset are available at https://github.com/byinhao/BvSP.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Nonlinear time-series embedding by monotone variational inequality
Authors:
Jonathan Y. Zhou,
Yao Xie
Abstract:
In the wild, we often encounter collections of sequential data such as electrocardiograms, motion capture, genomes, and natural language, and sequences may be multichannel or symbolic with nonlinear dynamics. We introduce a new method to learn low-dimensional representations of nonlinear time series without supervision and can have provable recovery guarantees. The learned representation can be us…
▽ More
In the wild, we often encounter collections of sequential data such as electrocardiograms, motion capture, genomes, and natural language, and sequences may be multichannel or symbolic with nonlinear dynamics. We introduce a new method to learn low-dimensional representations of nonlinear time series without supervision and can have provable recovery guarantees. The learned representation can be used for downstream machine-learning tasks such as clustering and classification. The method is based on the assumption that the observed sequences arise from a common domain, but each sequence obeys its own autoregressive models that are related to each other through low-rank regularization. We cast the problem as a computationally efficient convex matrix parameter recovery problem using monotone Variational Inequality and encode the common domain assumption via low-rank constraint across the learned representations, which can learn the geometry for the entire domain as well as faithful representations for the dynamics of each individual sequence using the domain information in totality. We show the competitive performance of our method on real-world time-series data with the baselines and demonstrate its effectiveness for symbolic text modeling and RNA sequence clustering.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization
Authors:
Yi Gu,
Zhendong Wang,
Yueqin Yin,
Yujia Xie,
Mingyuan Zhou
Abstract:
Aligning large language models with human preferences has emerged as a critical focus in language modeling research. Yet, integrating preference learning into Text-to-Image (T2I) generative models is still relatively uncharted territory. The Diffusion-DPO technique made initial strides by employing pairwise preference learning in diffusion models tailored for specific text prompts. We introduce Di…
▽ More
Aligning large language models with human preferences has emerged as a critical focus in language modeling research. Yet, integrating preference learning into Text-to-Image (T2I) generative models is still relatively uncharted territory. The Diffusion-DPO technique made initial strides by employing pairwise preference learning in diffusion models tailored for specific text prompts. We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively. This approach leverages both prompt-image pairs with identical prompts and those with semantically related content across various modalities. Furthermore, we have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low reproducibility, and limited interpretability prevalent in current evaluations of human preference alignment. Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0, achieving superior results in both automated evaluations of human preferences and style alignment. Our code is available at https://github.com/yigu1008/Diffusion-RPO
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows
Authors:
Xingjian Zhang,
Yutong Xie,
Jin Huang,
Jinge Ma,
Zhaoying Pan,
Qijia Liu,
Ziyang Xiong,
Tolga Ergen,
Dongsub Shim,
Honglak Lee,
Qiaozhu Mei
Abstract:
Scientific innovation relies on detailed workflows, which include critical steps such as analyzing literature, generating ideas, validating these ideas, interpreting results, and inspiring follow-up research. However, scientific publications that document these workflows are extensive and unstructured. This makes it difficult for both human researchers and AI systems to effectively navigate and ex…
▽ More
Scientific innovation relies on detailed workflows, which include critical steps such as analyzing literature, generating ideas, validating these ideas, interpreting results, and inspiring follow-up research. However, scientific publications that document these workflows are extensive and unstructured. This makes it difficult for both human researchers and AI systems to effectively navigate and explore the space of scientific innovation. To address this issue, we introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of Scientific Workflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications -- context, key idea, method, outcome, and projected impact -- which correspond to five key steps in the research workflow. These structured summaries facilitate a variety of downstream tasks and analyses. The quality of the LLM-extracted summaries is validated by comparing them with human annotations. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset, which make various types of predictions and recommendations along the scientific workflow. MASSW holds significant potential for researchers to create and benchmark new AI methods for optimizing scientific workflows and fostering scientific innovation in the field. Our dataset is openly available at \url{https://github.com/xingjian-zhang/massw}.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Async Learned User Embeddings for Ads Delivery Optimization
Authors:
Mingwei Tang,
Meng Liu,
Hong Li,
Junjie Yang,
Chenglin Wei,
Boyang Li,
Dai Li,
Rengan Xu,
Yifan Xu,
Zehua Zhang,
Xiangyu Wang,
Linfeng Liu,
Yuelei Xie,
Chengye Liu,
Labib Fawaz,
Li Li,
Hongnan Wang,
Bill Zhu,
Sri Reddy
Abstract:
In recommendation systems, high-quality user embeddings can capture subtle preferences, enable precise similarity calculations, and adapt to changing preferences over time to maintain relevance. The effectiveness of recommendation systems depends on the quality of user embedding. We propose to asynchronously learn high fidelity user embeddings for billions of users each day from sequence based mul…
▽ More
In recommendation systems, high-quality user embeddings can capture subtle preferences, enable precise similarity calculations, and adapt to changing preferences over time to maintain relevance. The effectiveness of recommendation systems depends on the quality of user embedding. We propose to asynchronously learn high fidelity user embeddings for billions of users each day from sequence based multimodal user activities through a Transformer-like large scale feature learning module. The async learned user representations embeddings (ALURE) are further converted to user similarity graphs through graph learning and then combined with user realtime activities to retrieval highly related ads candidates for the ads delivery system. Our method shows significant gains in both offline and online experiments.
△ Less
Submitted 23 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Transformer Conformal Prediction for Time Series
Authors:
Junghwan Lee,
Chen Xu,
Yao Xie
Abstract:
We present a conformal prediction method for time series using the Transformer architecture to capture long-memory and long-range dependencies. Specifically, we use the Transformer decoder as a conditional quantile estimator to predict the quantiles of prediction residuals, which are used to estimate the prediction interval. We hypothesize that the Transformer decoder benefits the estimation of th…
▽ More
We present a conformal prediction method for time series using the Transformer architecture to capture long-memory and long-range dependencies. Specifically, we use the Transformer decoder as a conditional quantile estimator to predict the quantiles of prediction residuals, which are used to estimate the prediction interval. We hypothesize that the Transformer decoder benefits the estimation of the prediction interval by learning temporal dependencies across past prediction residuals. Our comprehensive experiments using simulated and real data empirically demonstrate the superiority of the proposed method compared to the existing state-of-the-art conformal prediction methods.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Stochastic full waveform inversion with deep generative prior for uncertainty quantification
Authors:
Yuke Xie,
Hervé Chauris,
Nicolas Desassis
Abstract:
To obtain high-resolution images of subsurface structures from seismic data, seismic imaging techniques such as Full Waveform Inversion (FWI) serve as crucial tools. However, FWI involves solving a nonlinear and often non-unique inverse problem, presenting challenges such as local minima trapping and inadequate handling of inherent uncertainties. In addressing these challenges, we propose leveragi…
▽ More
To obtain high-resolution images of subsurface structures from seismic data, seismic imaging techniques such as Full Waveform Inversion (FWI) serve as crucial tools. However, FWI involves solving a nonlinear and often non-unique inverse problem, presenting challenges such as local minima trapping and inadequate handling of inherent uncertainties. In addressing these challenges, we propose leveraging deep generative models as the prior distribution of geophysical parameters for stochastic Bayesian inversion. This approach integrates the adjoint state gradient for efficient back-propagation from the numerical solution of partial differential equations. Additionally, we introduce explicit and implicit variational Bayesian inference methods. The explicit method computes variational distribution density using a normalizing flow-based neural network, enabling computation of the Bayesian posterior of parameters. Conversely, the implicit method employs an inference network attached to a pretrained generative model to estimate density, incorporating an entropy estimator. Furthermore, we also experimented with the Stein Variational Gradient Descent (SVGD) method as another variational inference technique, using particles. We compare these variational Bayesian inference methods with conventional Markov chain Monte Carlo (McMC) sampling. Each method is able to quantify uncertainties and to generate seismic data-conditioned realizations of subsurface geophysical parameters. This framework provides insights into subsurface structures while accounting for inherent uncertainties.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
Authors:
Jianbo Dong,
Bin Luo,
Jun Zhang,
Pengcheng Zhang,
Fei Feng,
Yikai Zhu,
Ang Liu,
Zian Chen,
Yi Shi,
Hairong Jiao,
Gang Lu,
Yu Guan,
Ennan Zhai,
Wencong Xiao,
Hanyu Zhao,
Man Yuan,
Siran Yang,
Xiang Li,
Jiamang Wang,
Rui Men,
Jianwei Zhang,
Huang Zhong,
Dennis Cai,
Yuan Xie,
Binzhang Fu
Abstract:
The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the…
▽ More
The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
FOX: Coverage-guided Fuzzing as Online Stochastic Control
Authors:
Dongdong She,
Adam Storek,
Yuchong Xie,
Seoyoung Kweon,
Prashast Srivastava,
Suman Jana
Abstract:
Fuzzing is an effective technique for discovering software vulnerabilities by generating random test inputs and executing them against the target program. However, fuzzing large and complex programs remains challenging due to difficulties in uncovering deeply hidden vulnerabilities. This paper addresses the limitations of existing coverage-guided fuzzers, focusing on the scheduler and mutator comp…
▽ More
Fuzzing is an effective technique for discovering software vulnerabilities by generating random test inputs and executing them against the target program. However, fuzzing large and complex programs remains challenging due to difficulties in uncovering deeply hidden vulnerabilities. This paper addresses the limitations of existing coverage-guided fuzzers, focusing on the scheduler and mutator components. Existing schedulers suffer from information sparsity and the inability to handle fine-grained feedback metrics. The mutators are agnostic of target program branches, leading to wasted computation and slower coverage exploration. To overcome these issues, we propose an end-to-end online stochastic control formulation for coverage-guided fuzzing. Our approach incorporates a novel scheduler and custom mutator that can adapt to branch logic, maximizing aggregate edge coverage achieved over multiple stages. The scheduler utilizes fine-grained branch distance measures to identify frontier branches, where new coverage is likely to be achieved. The mutator leverages branch distance information to perform efficient and targeted seed mutations, leading to robust progress with minimal overhead. We present FOX, a proof-of-concept implementation of our control-theoretic approach, and compare it to industry-standard coverage-guided fuzzers. 6 CPU-years of extensive evaluations on the FuzzBench dataset and complex real-world programs (a total of 38 test programs) demonstrate that FOX outperforms existing state-of-the-art fuzzers, achieving average coverage improvements up to 26.45% in real-world standalone programs and 6.59% in FuzzBench programs over the state-of-the-art AFL++. In addition, it uncovers 20 unique bugs in popular real-world applications including eight that are previously unknown, showcasing real-world security impact.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection
Authors:
Xiaopeng Wang,
Ruibo Fu,
Zhengqi Wen,
Zhiyong Wang,
Yuankun Xie,
Yukun Liu,
Jianhua Tao,
Xuefei Liu,
Yongwei Li,
Xin Qi,
Yi Lu,
Shuchen Shi
Abstract:
The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation…
▽ More
The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA.
△ Less
Submitted 9 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy
Authors:
Yuankun Xie,
Ruibo Fu,
Zhengqi Wen,
Zhiyong Wang,
Xiaopeng Wang,
Haonnan Cheng,
Long Ye,
Jianhua Tao
Abstract:
With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis an…
▽ More
With the proliferation of deepfake audio, there is an urgent need to investigate their attribution. Current source tracing methods can effectively distinguish in-distribution (ID) categories. However, the rapid evolution of deepfake algorithms poses a critical challenge in the accurate identification of out-of-distribution (OOD) novel deepfake algorithms. In this paper, we propose Real Emphasis and Fake Dispersion (REFD) strategy for audio deepfake algorithm recognition, demonstrating its effectiveness in discriminating ID samples while identifying OOD samples. For effective OOD detection, we first explore current post-hoc OOD methods and propose NSD, a novel OOD approach in identifying novel deepfake algorithms through the similarity consideration of both feature and logits scores. REFD achieves 86.83% F1-score as a single system in Audio Deepfake Detection Challenge 2023 Track3, showcasing its state-of-the-art performance.
△ Less
Submitted 8 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Generalized Fake Audio Detection via Deep Stable Learning
Authors:
Zhiyong Wang,
Ruibo Fu,
Zhengqi Wen,
Yuankun Xie,
Yukun Liu,
Xiaopeng Wang,
Xuefei Liu,
Yongwei Li,
Jianhua Tao,
Yi Lu,
Xin Qi,
Shuchen Shi
Abstract:
Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate t…
▽ More
Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate the training process. In this work, we propose a stable learning-based training scheme that involves a Sample Weight Learning (SWL) module, addressing distribution shift by decorrelating all selected features via learning weights from training samples. The proposed portable plug-in-like SWL is easy to apply to multiple base models and generalizes them without using extra data during training. Experiments conducted on the ASVspoof datasets clearly demonstrate the effectiveness of SWL in generalizing different models across three evaluation datasets from different distributions.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
FastLGS: Speeding up Language Embedded Gaussians with Feature Grid Mapping
Authors:
Yuzhou Ji,
He Zhu,
Junshu Tang,
Wuyi Liu,
Zhizhong Zhang,
Yuan Xie,
Lizhuang Ma,
Xin Tan
Abstract:
The semantically interactive radiance field has always been an appealing task for its potential to facilitate user-friendly and automated real-world 3D scene understanding applications. However, it is a challenging task to achieve high quality, efficiency and zero-shot ability at the same time with semantics in radiance fields. In this work, we present FastLGS, an approach that supports real-time…
▽ More
The semantically interactive radiance field has always been an appealing task for its potential to facilitate user-friendly and automated real-world 3D scene understanding applications. However, it is a challenging task to achieve high quality, efficiency and zero-shot ability at the same time with semantics in radiance fields. In this work, we present FastLGS, an approach that supports real-time open-vocabulary query within 3D Gaussian Splatting (3DGS) under high resolution. We propose the semantic feature grid to save multi-view CLIP features which are extracted based on Segment Anything Model (SAM) masks, and map the grids to low dimensional features for semantic field training through 3DGS. Once trained, we can restore pixel-aligned CLIP embeddings through feature grids from rendered features for open-vocabulary queries. Comparisons with other state-of-the-art methods prove that FastLGS can achieve the first place performance concerning both speed and accuracy, where FastLGS is 98x faster than LERF and 4x faster than LangSplat. Meanwhile, experiments show that FastLGS is adaptive and compatible with many downstream tasks, such as 3D segmentation and 3D object inpainting, which can be easily applied to other 3D manipulation systems.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models
Authors:
Wenzhuo Tang,
Haitao Mao,
Danial Dervovic,
Ivan Brugere,
Saumitra Mishra,
Yuying Xie,
Jiliang Tang
Abstract:
Models for natural language and images benefit from data scaling behavior: the more data fed into the model, the better they perform. This 'better with more' phenomenon enables the effectiveness of large-scale pre-training on vast amounts of data. However, current graph pre-training methods struggle to scale up data due to heterogeneity across graphs. To achieve effective data scaling, we aim to d…
▽ More
Models for natural language and images benefit from data scaling behavior: the more data fed into the model, the better they perform. This 'better with more' phenomenon enables the effectiveness of large-scale pre-training on vast amounts of data. However, current graph pre-training methods struggle to scale up data due to heterogeneity across graphs. To achieve effective data scaling, we aim to develop a general model that is able to capture diverse data patterns of graphs and can be utilized to adaptively help the downstream tasks. To this end, we propose UniAug, a universal graph structure augmentor built on a diffusion model. We first pre-train a discrete diffusion model on thousands of graphs across domains to learn the graph structural patterns. In the downstream phase, we provide adaptive enhancement by conducting graph structure augmentation with the help of the pre-trained diffusion model via guided generation. By leveraging the pre-trained diffusion model for structure augmentation, we consistently achieve performance improvements across various downstream tasks in a plug-and-play manner. To the best of our knowledge, this study represents the first demonstration of a data-scaling graph structure augmentor on graphs across domains.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Authors:
Bowen Jiang,
Yangxinyu Xie,
Xiaomeng Wang,
Weijie J. Su,
Camillo J. Taylor,
Tanwi Mallick
Abstract:
Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. This quality is essential for effective problem-solving, as it ensures that solutions are well-founded and systematically derived. Despite the advancements of large language models (LLMs) in generating human-like text with remarkable accuracy, they…
▽ More
Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. This quality is essential for effective problem-solving, as it ensures that solutions are well-founded and systematically derived. Despite the advancements of large language models (LLMs) in generating human-like text with remarkable accuracy, they present biases inherited from the training data, inconsistency across different contexts, and difficulty understanding complex scenarios involving multiple layers of context. Therefore, recent research attempts to leverage the strength of multiple agents working collaboratively with various types of data and tools for enhanced consistency and reliability. To that end, this paper aims to understand whether multi-modal and multi-agent systems are advancing toward rationality by surveying the state-of-the-art works, identifying advancements over single-agent and single-modal systems in terms of rationality, and discussing open problems and future directions. We maintain an open repository at https://github.com/bowen-upenn/MMMA_Rationality.
△ Less
Submitted 18 June, 2024; v1 submitted 31 May, 2024;
originally announced June 2024.
-
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment
Authors:
Yueqin Yin,
Zhendong Wang,
Yujia Xie,
Weizhu Chen,
Mingyuan Zhou
Abstract:
Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing p…
▽ More
Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model in conjunction with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with insights from historical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks, including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization, and outperforms offline self-play methods like SPIN. Our code is available at https://github.com/yinyueqin/SAPO
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
svds-C: A Multi-Thread C Code for Computing Truncated Singular Value Decomposition
Authors:
Xu Feng,
Wenjian Yu,
Yuyang Xie
Abstract:
This article presents svds-C, an open-source and high-performance C program for accurately and robustly computing truncated SVD, e.g. computing several largest singular values and corresponding singular vectors. We have re-implemented the algorithm of svds in Matlab in C based on MKL or OpenBLAS and multi-thread computing to obtain the parallel program named svds-C. svds-C running on shared-memory…
▽ More
This article presents svds-C, an open-source and high-performance C program for accurately and robustly computing truncated SVD, e.g. computing several largest singular values and corresponding singular vectors. We have re-implemented the algorithm of svds in Matlab in C based on MKL or OpenBLAS and multi-thread computing to obtain the parallel program named svds-C. svds-C running on shared-memory computer consumes less time and memory than svds thanks to careful implementation of multi-thread parallelization and memory management. Numerical experiments on different test cases which are synthetically generated or directly from real world datasets show that, svds-C runs remarkably faster than svds with averagely 4.7X and at most 12X speedup for 16-thread parallel computing on a computer with Intel CPU, while preserving same accuracy and consuming about half memory space. Experimental results also demonstrate that svds-C has similar advantages over svds on the computer with AMD CPU, and outperforms other state-of-the-art algorithms for truncated SVD on computing time and robustness.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Recent Advances of Foundation Language Models-based Continual Learning: A Survey
Authors:
Yutao Yang,
Jie Zhou,
Xuanwen Ding,
Tianyu Huai,
Shunyu Liu,
Qin Chen,
Liang He,
Yuan Xie
Abstract:
Recently, foundation language models (LMs) have marked significant achievements in the domains of natural language processing (NLP) and computer vision (CV). Unlike traditional neural network models, foundation LMs obtain a great ability for transfer learning by acquiring rich commonsense knowledge through pre-training on extensive unsupervised datasets with a vast number of parameters. However, t…
▽ More
Recently, foundation language models (LMs) have marked significant achievements in the domains of natural language processing (NLP) and computer vision (CV). Unlike traditional neural network models, foundation LMs obtain a great ability for transfer learning by acquiring rich commonsense knowledge through pre-training on extensive unsupervised datasets with a vast number of parameters. However, they still can not emulate human-like continuous learning due to catastrophic forgetting. Consequently, various continual learning (CL)-based methodologies have been developed to refine LMs, enabling them to adapt to new tasks without forgetting previous knowledge. However, a systematic taxonomy of existing approaches and a comparison of their performance are still lacking, which is the gap that our survey aims to fill. We delve into a comprehensive review, summarization, and classification of the existing literature on CL-based approaches applied to foundation language models, such as pre-trained language models (PLMs), large language models (LLMs) and vision-language models (VLMs). We divide these studies into offline CL and online CL, which consist of traditional methods, parameter-efficient-based methods, instruction tuning-based methods and continual pre-training methods. Offline CL encompasses domain-incremental learning, task-incremental learning, and class-incremental learning, while online CL is subdivided into hard task boundary and blurry task boundary settings. Additionally, we outline the typical datasets and metrics employed in CL research and provide a detailed analysis of the challenges and future work for LMs-based continual learning.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Kernel-based optimally weighted conformal prediction intervals
Authors:
Jonghyeok Lee,
Chen Xu,
Yao Xie
Abstract:
Conformal prediction has been a popular distribution-free framework for uncertainty quantification. In this paper, we present a novel conformal prediction method for time-series, which we call Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI). Specifically, KOWCPI adapts the classic Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learn…
▽ More
Conformal prediction has been a popular distribution-free framework for uncertainty quantification. In this paper, we present a novel conformal prediction method for time-series, which we call Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI). Specifically, KOWCPI adapts the classic Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learns optimal data-adaptive weights. Theoretically, we tackle the challenge of establishing a conditional coverage guarantee for non-exchangeable data under strong mixing conditions on the non-conformity scores. We demonstrate the superior performance of KOWCPI on real time-series against state-of-the-art methods, where KOWCPI achieves narrower confidence intervals without losing coverage.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
MCGMapper: Light-Weight Incremental Structure from Motion and Visual Localization With Planar Markers and Camera Groups
Authors:
Yusen Xie,
Zhenmin Huang,
Kai Chen,
Lei Zhu,
Jun Ma
Abstract:
Structure from Motion (SfM) and visual localization in indoor texture-less scenes and industrial scenarios present prevalent yet challenging research topics. Existing SfM methods designed for natural scenes typically yield low accuracy or map-building failures due to insufficient robust feature extraction in such settings. Visual markers, with their artificially designed features, can effectively…
▽ More
Structure from Motion (SfM) and visual localization in indoor texture-less scenes and industrial scenarios present prevalent yet challenging research topics. Existing SfM methods designed for natural scenes typically yield low accuracy or map-building failures due to insufficient robust feature extraction in such settings. Visual markers, with their artificially designed features, can effectively address these issues. Nonetheless, existing marker-assisted SfM methods encounter problems like slow running speed and difficulties in convergence; and also, they are governed by the strong assumption of unique marker size. In this paper, we propose a novel SfM framework that utilizes planar markers and multiple cameras with known extrinsics to capture the surrounding environment and reconstruct the marker map. In our algorithm, the initial poses of markers and cameras are calculated with Perspective-n-Points (PnP) in the front-end, while bundle adjustment methods customized for markers and camera groups are designed in the back-end to optimize the 6-DOF pose directly. Our algorithm facilitates the reconstruction of large scenes with different marker sizes, and its accuracy and speed of map building are shown to surpass existing methods. Our approach is suitable for a wide range of scenarios, including laboratories, basements, warehouses, and other industrial settings. Furthermore, we incorporate representative scenarios into simulations and also supply our datasets with pose labels to address the scarcity of quantitative ground-truth datasets in this research field. The datasets and source code are available on GitHub.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
A Classifier-Free Incremental Learning Framework for Scalable Medical Image Segmentation
Authors:
Xiaoyang Chen,
Hao Zheng,
Yifang Xie,
Yuncong Ma,
Tengfei Li
Abstract:
Current methods for developing foundation models in medical image segmentation rely on two primary assumptions: a fixed set of classes and the immediate availability of a substantial and diverse training dataset. However, this can be impractical due to the evolving nature of imaging technology and patient demographics, as well as labor-intensive data curation, limiting their practical applicabilit…
▽ More
Current methods for developing foundation models in medical image segmentation rely on two primary assumptions: a fixed set of classes and the immediate availability of a substantial and diverse training dataset. However, this can be impractical due to the evolving nature of imaging technology and patient demographics, as well as labor-intensive data curation, limiting their practical applicability and scalability. To address these challenges, we introduce a novel segmentation paradigm enabling the segmentation of a variable number of classes within a single classifier-free network, featuring an architecture independent of class number. This network is trained using contrastive learning and produces discriminative feature representations that facilitate straightforward interpretation. Additionally, we integrate this strategy into a knowledge distillation-based incremental learning framework, facilitating the gradual assimilation of new information from non-stationary data streams while avoiding catastrophic forgetting. Our approach provides a unified solution for tackling both class- and domain-incremental learning scenarios. We demonstrate the flexibility of our method in handling varying class numbers within a unified network and its capacity for incremental learning. Experimental results on an incompletely annotated, multi-modal, multi-source dataset for medical image segmentation underscore its superiority over state-of-the-art alternative approaches.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances
Authors:
Jie Wang,
March Boedihardjo,
Yao Xie
Abstract:
Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data int…
▽ More
Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data into $1$ dimensions before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the SDP solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing.
△ Less
Submitted 29 May, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.