-
Generating grid maps via the snake model
Authors:
Zhiwei Wei,
Nai Yang,
Wenjia Xu,
Su Ding
Abstract:
The grid map, often referred to as the tile map, stands as a vital tool in geospatial visualization, possessing unique attributes that differentiate it from more commonly known techniques such as choropleths and cartograms. It transforms geographic regions into grids, which requires the displacement of both region centroids and boundary nodes to establish a coherent grid arrangement. However, exis…
▽ More
The grid map, often referred to as the tile map, stands as a vital tool in geospatial visualization, possessing unique attributes that differentiate it from more commonly known techniques such as choropleths and cartograms. It transforms geographic regions into grids, which requires the displacement of both region centroids and boundary nodes to establish a coherent grid arrangement. However, existing approaches typically displace region centroids and boundary nodes separately, potentially resulting in self-intersected boundaries and compromised relative orientation relations between regions. In this paper, we introduce a novel approach that leverages the Snake displacement algorithm from cartographic generalization to concurrently displace region centroids and boundary nodes. The revised Constrained Delaunay triangulation (CDT) is employed to represent the relations between regions and serves as a structural foundation for the Snake algorithm. Forces for displacing the region centroids into a grid-like pattern are then computed. These forces are iteratively applied within the Snake model until a satisfactory new boundary is achieved. Subsequently, the grid map is created by aligning the grids with the newly generated boundary, utilizing a one-to-one match algorithm to assign each region to a specific grid. Experimental results demonstrate that the proposed approach excels in maintaining the relative orientation and global shape of regions, albeit with a potential increase in local location deviations. We also present two strategies aligned with existing approaches to generate diverse grid maps for user preferences. Further details and resources are available on our project website: https://github.com/TrentonWei/DorlingMap.git.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
A Lung Nodule Dataset with Histopathology-based Cancer Type Annotation
Authors:
Muwei Jian,
Hongyu Chen,
Zaiyong Zhang,
Nan Yang,
Haorang Zhang,
Lifu Ma,
Wenjing Xu,
Huixiang Zhi
Abstract:
Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accuratel…
▽ More
Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accurately predicting multiple cancer types. This limitation can be attributed to the scarcity of publicly available datasets annotated with expert-level cancer type information. This research aims to bridge this gap by providing publicly accessible datasets and reliable tools for medical diagnosis, facilitating a finer categorization of different types of lung diseases so as to offer precise treatment recommendations. To achieve this objective, we curated a diverse dataset of lung Computed Tomography (CT) images, comprising 330 annotated nodules (nodules are labeled as bounding boxes) from 95 distinct patients. The quality of the dataset was evaluated using a variety of classical classification and detection models, and these promising results demonstrate that the dataset has a feasible application and further facilitate intelligent auxiliary diagnosis.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
EarDA: Towards Accurate and Data-Efficient Earable Activity Sensing
Authors:
Shengzhe Lyu,
Yongliang Chen,
Di Duan,
Renqi Jia,
Weitao Xu
Abstract:
In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from e…
▽ More
In the realm of smart sensing with the Internet of Things, earable devices are empowered with the capability of multi-modality sensing and intelligence of context-aware computing, leading to its wide usage in Human Activity Recognition (HAR). Nonetheless, unlike the movements captured by Inertial Measurement Unit (IMU) sensors placed on the upper or lower body, those motion signals obtained from earable devices show significant changes in amplitudes and patterns, especially in the presence of dynamic and unpredictable head movements, posing a significant challenge for activity classification. In this work, we present EarDA, an adversarial-based domain adaptation system to extract the domain-independent features across different sensor locations. Moreover, while most deep learning methods commonly rely on training with substantial amounts of labeled data to offer good accuracy, the proposed scheme can release the potential usage of publicly available smartphone-based IMU datasets. Furthermore, we explore the feasibility of applying a filter-based data processing method to mitigate the impact of head movement. EarDA, the proposed system, enables more data-efficient and accurate activity sensing. It achieves an accuracy of 88.8% under HAR task, demonstrating a significant 43% improvement over methods without domain adaptation. This clearly showcases its effectiveness in mitigating domain gaps.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
Authors:
Yuhang Ma,
Wenting Xu,
Jiji Tang,
Qinfeng Jin,
Rongsheng Zhang,
Zeng Zhao,
Changjie Fan,
Zhipeng Hu
Abstract:
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. The…
▽ More
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Authors:
Deng Cai,
Huayang Li,
Tingchen Fu,
Siheng Li,
Weiwen Xu,
Shuaiyi Li,
Bowen Cao,
Zhisong Zhang,
Xinting Huang,
Leyang Cui,
Yan Wang,
Lemao Liu,
Taro Watanabe,
Shuming Shi
Abstract:
Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation…
▽ More
Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation directions, each of which facilitates a variety of applications. Our work offers a holistic view that unifies numerous existing studies and suggests potential research directions. We envision our work as a useful roadmap for future research on LLMs.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Towards unlocking the mystery of adversarial fragility of neural networks
Authors:
Jingchao Gao,
Raghu Mudumbai,
Xiaodong Wu,
Jirong Yi,
Catherine Xu,
Hui Xie,
Weiyu Xu
Abstract:
In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural ne…
▽ More
In this paper, we study the adversarial robustness of deep neural networks for classification tasks. We look at the smallest magnitude of possible additive perturbations that can change the output of a classification algorithm. We provide a matrix-theoretic explanation of the adversarial fragility of deep neural network for classification. In particular, our theoretical results show that neural network's adversarial robustness can degrade as the input dimension $d$ increases. Analytically we show that neural networks' adversarial robustness can be only $1/\sqrt{d}$ of the best possible adversarial robustness. Our matrix-theoretic explanation is consistent with an earlier information-theoretic feature-compression-based explanation for the adversarial fragility of neural networks.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Design and Control of a Low-cost Non-backdrivable End-effector Upper Limb Rehabilitation Device
Authors:
Fulan Li,
Yunfei Guo,
Wenda Xu,
Weide Zhang,
Fangyun Zhao,
Baiyu Wang,
Huaguang Du,
Chengkun Zhang
Abstract:
This paper presents the development of an upper limb end-effector based rehabilitation device for stroke patients, offering assistance or resistance along any 2-dimensional trajectory during physical therapy. It employs a non-backdrivable ball-screw-driven mechanism for enhanced control accuracy. The control system features three novel algorithms: First, the Implicit Euler velocity control algorit…
▽ More
This paper presents the development of an upper limb end-effector based rehabilitation device for stroke patients, offering assistance or resistance along any 2-dimensional trajectory during physical therapy. It employs a non-backdrivable ball-screw-driven mechanism for enhanced control accuracy. The control system features three novel algorithms: First, the Implicit Euler velocity control algorithm (IEVC) highlighted for its state-of-the-art accuracy, stability, efficiency and generalizability in motion restriction control. Second, an Admittance Virtual Dynamics simulation algorithm that achieves a smooth and natural human interaction with the non-backdrivable end-effector. Third, a generalized impedance force calculation algorithm allowing efficient impedance control on any trajectory or area boundary. Experimental validation demonstrated the system's effectiveness in accurate end-effector position control across various trajectories and configurations. The proposed upper limb end-effector-based rehabilitation device, with its high performance and adaptability, holds significant promise for extensive clinical application, potentially improving rehabilitation outcomes for stroke patients.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking
Authors:
Can Jin,
Hongwu Peng,
Shiyu Zhao,
Zhenting Wang,
Wujiang Xu,
Ligong Han,
Jiahui Zhao,
Kai Zhong,
Sanguthevar Rajasekaran,
Dimitris N. Metaxas
Abstract:
Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly…
▽ More
Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs. Code is available at https://github.com/jincan333/APEER.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark
Authors:
Tao Han,
Song Guo,
Zhenghao Chen,
Wanghan Xu,
Lei Bai
Abstract:
Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from signific…
▽ More
Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: https://github.com/taohan10200/WEATHER-5K.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models
Authors:
Xinrui Zhou,
Yuhao Huang,
Wufeng Xue,
Haoran Dou,
Jun Cheng,
Han Zhou,
Dong Ni
Abstract:
Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development…
▽ More
Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Authors:
Wenda Xu,
Jiachen Li,
William Yang Wang,
Lei Li
Abstract:
Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of onli…
▽ More
Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment.
We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human reference text.
△ Less
Submitted 19 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Feasibility of Federated Learning from Client Databases with Different Brain Diseases and MRI Modalities
Authors:
Felix Wagner,
Wentian Xu,
Pramit Saha,
Ziyun Liang,
Daniel Whitehouse,
David Menon,
Natalie Voets,
J. Alison Noble,
Konstantinos Kamnitsas
Abstract:
Segmentation models for brain lesions in MRI are commonly developed for a specific disease and trained on data with a predefined set of MRI modalities. Each such model cannot segment the disease using data with a different set of MRI modalities, nor can it segment any other type of disease. Moreover, this training paradigm does not allow a model to benefit from learning from heterogeneous database…
▽ More
Segmentation models for brain lesions in MRI are commonly developed for a specific disease and trained on data with a predefined set of MRI modalities. Each such model cannot segment the disease using data with a different set of MRI modalities, nor can it segment any other type of disease. Moreover, this training paradigm does not allow a model to benefit from learning from heterogeneous databases that may contain scans and segmentation labels for different types of brain pathologies and diverse sets of MRI modalities. Is it feasible to use Federated Learning (FL) for training a single model on client databases that contain scans and labels of different brain pathologies and diverse sets of MRI modalities? We demonstrate promising results by combining appropriate, simple, and practical modifications to the model and training strategy: Designing a model with input channels that cover the whole set of modalities available across clients, training with random modality drop, and exploring the effects of feature normalization methods. Evaluation on 7 brain MRI databases with 5 different diseases shows that such FL framework can train a single model that is shown to be very promising in segmenting all disease types seen during training. Importantly, it is able to segment these diseases in new databases that contain sets of modalities different from those in training clients. These results demonstrate, for the first time, feasibility and effectiveness of using FL to train a single segmentation model on decentralised data with diverse brain diseases and MRI modalities, a necessary step towards leveraging heterogeneous real-world databases. Code will be made available at: https://github.com/FelixWag/FL-MultiDisease-MRI
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
ADSNet: Cross-Domain LTV Prediction with an Adaptive Siamese Network in Advertising
Authors:
Ruize Wang,
Hui Xu,
Ying Cheng,
Qi He,
Xing Zhou,
Rui Feng,
Wei Xu,
Lei Huang,
Jie Jiang
Abstract:
Advertising platforms have evolved in estimating Lifetime Value (LTV) to better align with advertisers' true performance metric. However, the sparsity of real-world LTV data presents a significant challenge to LTV predictive model(i.e., pLTV), severely limiting the their capabilities. Therefore, we propose to utilize external data, in addition to the internal data of advertising platform, to expan…
▽ More
Advertising platforms have evolved in estimating Lifetime Value (LTV) to better align with advertisers' true performance metric. However, the sparsity of real-world LTV data presents a significant challenge to LTV predictive model(i.e., pLTV), severely limiting the their capabilities. Therefore, we propose to utilize external data, in addition to the internal data of advertising platform, to expand the size of purchase samples and enhance the LTV prediction model of the advertising platform. To tackle the issue of data distribution shift between internal and external platforms, we introduce an Adaptive Difference Siamese Network (ADSNet), which employs cross-domain transfer learning to prevent negative transfer. Specifically, ADSNet is designed to learn information that is beneficial to the target domain. We introduce a gain evaluation strategy to calculate information gain, aiding the model in learning helpful information for the target domain and providing the ability to reject noisy samples, thus avoiding negative transfer. Additionally, we also design a Domain Adaptation Module as a bridge to connect different domains, reduce the distribution distance between them, and enhance the consistency of representation space distribution. We conduct extensive offline experiments and online A/B tests on a real advertising platform. Our proposed ADSNet method outperforms other methods, improving GINI by 2$\%$. The ablation study highlights the importance of the gain evaluation strategy in negative gain sample rejection and improving model performance. Additionally, ADSNet significantly improves long-tail prediction. The online A/B tests confirm ADSNet's efficacy, increasing online LTV by 3.47$\%$ and GMV by 3.89$\%$.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
CUDRT: Benchmarking the Detection of Human vs. Large Language Models Generated Texts
Authors:
Zhen Tao,
Zhiyu Li,
Dinghao Xi,
Wei Xu
Abstract:
The proliferation of large language models (LLMs) has significantly enhanced text generation capabilities across various industries. However, these models' ability to generate human-like text poses substantial challenges in discerning between human and AI authorship. Despite the effectiveness of existing AI-generated text detectors, their development is hindered by the lack of comprehensive, publi…
▽ More
The proliferation of large language models (LLMs) has significantly enhanced text generation capabilities across various industries. However, these models' ability to generate human-like text poses substantial challenges in discerning between human and AI authorship. Despite the effectiveness of existing AI-generated text detectors, their development is hindered by the lack of comprehensive, publicly available benchmarks. Current benchmarks are limited to specific scenarios, such as question answering and text polishing, and predominantly focus on English texts, failing to capture the diverse applications and linguistic nuances of LLMs. To address these limitations, this paper constructs a comprehensive bilingual benchmark in both Chinese and English to evaluate mainstream AI-generated text detectors. We categorize LLM text generation into five distinct operations: Create, Update, Delete, Rewrite, and Translate (CUDRT), encompassing all current LLMs activities. We also establish a robust benchmark evaluation framework to support scalable and reproducible experiments. For each CUDRT category, we have developed extensive datasets to thoroughly assess detector performance. By employing the latest mainstream LLMs specific to each language, our datasets provide a thorough evaluation environment. Extensive experimental results offer critical insights for optimizing AI-generated text detectors and suggest future research directions to improve detection accuracy and generalizability across various scenarios.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Authors:
Xiaoshuai Song,
Muxi Diao,
Guanting Dong,
Zhengyang Wang,
Yujia Fu,
Runqi Qiao,
Zhexu Wang,
Dayuan Fu,
Huangxuan Wu,
Bin Liang,
Weihao Zeng,
Yejie Wang,
Zhuoma GongQue,
Jianing Yu,
Qiuna Tan,
Weiran Xu
Abstract:
Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer scie…
▽ More
Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs' capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs' diverse reasoning capabilities. The CS-Bench data and evaluation code are available at https://github.com/csbench/csbench.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
M-LRM: Multi-view Large Reconstruction Model
Authors:
Mengfei Li,
Xiaoxiao Long,
Yixun Liang,
Weiyu Li,
Yuan Liu,
Peng Li,
Xiaowei Chi,
Xingqun Qi,
Wei Xue,
Wenhan Luo,
Qifeng Liu,
Yike Guo
Abstract:
Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected.
It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the…
▽ More
Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected.
It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to efficiently reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the tri-plane tokens. Compared to LRM, the proposed M-LRM can produce a tri-plane NeRF with $128 \times 128$ resolution and generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence than LRM. Project page: https://murphylmf.github.io/M-LRM/
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Erasing Radio Frequency Fingerprints via Active Adversarial Perturbation
Authors:
Zhaoyi Lu,
Wenchao Xu,
Ming Tu,
Xin Xie,
Cunqing Hua,
Nan Cheng
Abstract:
Radio Frequency (RF) fingerprinting is to identify a wireless device from its uniqueness of the analog circuitry or hardware imperfections. However, unlike the MAC address which can be modified, such hardware feature is inevitable for the signal emitted to air, which can possibly reveal device whereabouts, e.g., a sniffer can use a pre-trained model to identify a nearby device when receiving its s…
▽ More
Radio Frequency (RF) fingerprinting is to identify a wireless device from its uniqueness of the analog circuitry or hardware imperfections. However, unlike the MAC address which can be modified, such hardware feature is inevitable for the signal emitted to air, which can possibly reveal device whereabouts, e.g., a sniffer can use a pre-trained model to identify a nearby device when receiving its signal. Such fingerprint may expose critical private information, e.g., the associated upper-layer applications or the end-user. In this paper, we propose to erase such RF feature for wireless devices, which can prevent fingerprinting by actively perturbation from the signal perspective. Specifically, we consider a common RF fingerprinting scenario, where machine learning models are trained from pilot signal data for identification. A novel adversarial attack solution is designed to generate proper perturbations, whereby the perturbed pilot signal can hide the hardware feature and misclassify the model. We theoretically show that the perturbation would not affect the communication function within a tolerable perturbation threshold. We also implement the pilot signal fingerprinting and the proposed perturbation process in a practical LTE system. Extensive experiment results demonstrate that the RF fingerprints can be effectively erased to protect the user privacy.
△ Less
Submitted 12 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
TernaryLLM: Ternarized Large Language Model
Authors:
Tianqi Chen,
Zhe Li,
Weixiang Xu,
Zeyu Zhu,
Dong Li,
Lu Tian,
Emad Barsoum,
Peisong Wang,
Jian Cheng
Abstract:
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming fr…
▽ More
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents
Authors:
Wenjia Xu,
Zijian Yu,
Yixu Wang,
Jiuniu Wang,
Mugen Peng
Abstract:
An increasing number of models have achieved great performance in remote sensing tasks with the recent development of Large Language Models (LLMs) and Visual Language Models (VLMs). However, these models are constrained to basic vision and language instruction-tuning tasks, facing challenges in complex remote sensing applications. Additionally, these models lack specialized expertise in profession…
▽ More
An increasing number of models have achieved great performance in remote sensing tasks with the recent development of Large Language Models (LLMs) and Visual Language Models (VLMs). However, these models are constrained to basic vision and language instruction-tuning tasks, facing challenges in complex remote sensing applications. Additionally, these models lack specialized expertise in professional domains. To address these limitations, we propose a LLM-driven remote sensing intelligent agent named RS-Agent. Firstly, RS-Agent is powered by a large language model (LLM) that acts as its "Central Controller," enabling it to understand and respond to various problems intelligently. Secondly, our RS-Agent integrates many high-performance remote sensing image processing tools, facilitating multi-tool and multi-turn conversations. Thirdly, our RS-Agent can answer professional questions by leveraging robust knowledge documents. We conducted experiments using several datasets, e.g., RSSDIVCS, RSVQA, and DOTAv1. The experimental results demonstrate that our RS-Agent delivers outstanding performance in many tasks, i.e., scene classification, visual question answering, and object counting tasks.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator
Authors:
Wenhao Xu,
Xuefeng Gao,
Xuedong He
Abstract:
Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control. In this paper, we study online adaptive control of risk-sensitive linear quadratic regulator in the finite horizon episodic setting. We propose a simple least-squares greedy algorithm and show that it achieves $\widetilde{\mathcal{O}}(\log N)$ regret under a specific identifiability…
▽ More
Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control. In this paper, we study online adaptive control of risk-sensitive linear quadratic regulator in the finite horizon episodic setting. We propose a simple least-squares greedy algorithm and show that it achieves $\widetilde{\mathcal{O}}(\log N)$ regret under a specific identifiability assumption, where $N$ is the total number of episodes. If the identifiability assumption is not satisfied, we propose incorporating exploration noise into the least-squares-based algorithm, resulting in an algorithm with $\widetilde{\mathcal{O}}(\sqrt{N})$ regret. To our best knowledge, this is the first set of regret bounds for episodic risk-sensitive linear quadratic regulator. Our proof relies on perturbation analysis of less-standard Riccati equations for risk-sensitive linear quadratic control, and a delicate analysis of the loss in the risk-sensitive performance criterion due to applying the suboptimal controller in the online learning process.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
MoralBench: Moral Evaluation of LLMs
Authors:
Jianchao Ji,
Yutong Chen,
Mingyu Jin,
Wujiang Xu,
Wenyue Hua,
Yongfeng Zhang
Abstract:
In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critica…
▽ More
In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities.
The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at https://drive.google.com/drive/u/0/folders/1k93YZJserYc2CkqP8d4B3M3sgd3kA8W7 and also open-source the code of the project at https://github.com/agiresearch/MoralBench.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Authors:
Zeyue Tian,
Zhaoyang Liu,
Ruibin Yuan,
Jiahao Pan,
Xiaoqiang Huang,
Qifeng Liu,
Xu Tan,
Qifeng Chen,
Wei Xue,
Yike Guo
Abstract:
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music t…
▽ More
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
MSE-Based Training and Transmission Optimization for MIMO ISAC Systems
Authors:
Zhenyao He,
Wei Xu,
Hong Shen,
Yonina C. Eldar,
Xiaohu You
Abstract:
In this paper, we investigate a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system under typical block-fading channels. As a non-trivial extension to most existing works on ISAC, both the training and transmission signals sent by the ISAC transmitter are exploited for sensing. Specifically, we develop two training and transmission design schemes to minimize a…
▽ More
In this paper, we investigate a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system under typical block-fading channels. As a non-trivial extension to most existing works on ISAC, both the training and transmission signals sent by the ISAC transmitter are exploited for sensing. Specifically, we develop two training and transmission design schemes to minimize a weighted sum of the mean-squared errors (MSEs) of data transmission and radar target response matrix (TRM) estimation. For the former, we first optimize the training signal for simultaneous communication channel and radar TRM estimation. Then, based on the estimated instantaneous channel state information (CSI), we propose an efficient majorization-minimization (MM)-based robust ISAC transmission design, where a semi-closed form solution is obtained in each iteration. For the second scheme, the ISAC transmitter is assumed to have statistical CSI only for reducing the feedback overhead. With CSI statistics available, we integrate the training and transmission design into one single problem and propose an MM-based alternating algorithm to find a high-quality solution. In addition, we provide alternative structured and low-complexity solutions for both schemes under certain special cases. Finally, simulation results demonstrate that the radar performance is significantly improved compared to the existing scheme that integrates sensing into the transmission stage only. Moreover, it is verified that the investigated two schemes have advantages in terms of communication and sensing performances, respectively.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation
Authors:
Fanyou Wu,
Weijie Xu,
Chandan K. Reddy,
Srinivasan H. Sengamedu
Abstract:
In this study, we tackle the challenge of inadequate and costly training data that has hindered the development of conversational question answering (ConvQA) systems. Enterprises have a large corpus of diverse internal documents. Instead of relying on a searching engine, a more compelling approach for people to comprehend these documents is to create a dialogue system. In this paper, we propose a…
▽ More
In this study, we tackle the challenge of inadequate and costly training data that has hindered the development of conversational question answering (ConvQA) systems. Enterprises have a large corpus of diverse internal documents. Instead of relying on a searching engine, a more compelling approach for people to comprehend these documents is to create a dialogue system. In this paper, we propose a robust dialog synthesising method. We learn the segmentation of data for the dialog task instead of using segmenting at sentence boundaries. The synthetic dataset generated by our proposed method achieves superior quality when compared to WikiDialog, as assessed through machine and human evaluations. By employing our inpainted data for ConvQA retrieval system pre-training, we observed a notable improvement in performance across OR-QuAC benchmarks.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
Authors:
Qiang Chen,
Xiangbo Su,
Xinyu Zhang,
Jian Wang,
Jiahui Chen,
Yunpeng Shen,
Chuchu Han,
Ziliang Chen,
Weixiang Xu,
Fanrong Li,
Shan Zhang,
Kun Yao,
Errui Ding,
Gang Zhang,
Jingdong Wang
Abstract:
In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for r…
▽ More
In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
A deep-learning-based MAC for integrating channel access, rate adaptation and channel switch
Authors:
Jiantao Xin,
Wei Xu,
Bin Cao,
Taotao Wang,
Shengli Zhang
Abstract:
With increasing density and heterogeneity in unlicensed wireless networks, traditional MAC protocols, such as carrier-sense multiple access with collision avoidance (CSMA/CA) in Wi-Fi networks, are experiencing performance degradation. This is manifested in increased collisions and extended backoff times, leading to diminished spectrum efficiency and protocol coordination. Addressing these issues,…
▽ More
With increasing density and heterogeneity in unlicensed wireless networks, traditional MAC protocols, such as carrier-sense multiple access with collision avoidance (CSMA/CA) in Wi-Fi networks, are experiencing performance degradation. This is manifested in increased collisions and extended backoff times, leading to diminished spectrum efficiency and protocol coordination. Addressing these issues, this paper proposes a deep-learning-based MAC paradigm, dubbed DL-MAC, which leverages spectrum sensing data readily available from energy detection modules in wireless devices to achieve the MAC functionalities of channel access, rate adaptation and channel switch. First, we utilize DL-MAC to realize a joint design of channel access and rate adaptation. Subsequently, we integrate the capability of channel switch into DL-MAC, enhancing its functionality from single-channel to multi-channel operation. Specifically, the DL-MAC protocol incorporates a deep neural network (DNN) for channel selection and a recurrent neural network (RNN) for the joint design of channel access and rate adaptation. We conducted real-world data collection within the 2.4 GHz frequency band to validate the effectiveness of DL-MAC, and our experiments reveal that DL-MAC exhibits superior performance over traditional algorithms in both single and multi-channel environments and also outperforms single-function approaches in terms of overall performance. Additionally, the performance of DL-MAC remains robust, unaffected by channel switch overhead within the evaluated range.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Efficient Behavior Tree Planning with Commonsense Pruning and Heuristic
Authors:
Xinglin Chen,
Yishuai Cai,
Yunxin Mao,
Minglong Li,
Zhou Yang,
Wen Shanghua,
Wenjing Yang,
Weixia Xu,
Ji Wang
Abstract:
Behavior Tree (BT) planning is crucial for autonomous robot behavior control, yet its application in complex scenarios is hampered by long planning times. Pruning and heuristics are common techniques to accelerate planning, but it is difficult to design general pruning strategies and heuristic functions for BT planning problems. This paper proposes improving BT planning efficiency for everyday ser…
▽ More
Behavior Tree (BT) planning is crucial for autonomous robot behavior control, yet its application in complex scenarios is hampered by long planning times. Pruning and heuristics are common techniques to accelerate planning, but it is difficult to design general pruning strategies and heuristic functions for BT planning problems. This paper proposes improving BT planning efficiency for everyday service robots leveraging commonsense reasoning provided by Large Language Models (LLMs), leading to model-free pre-planning action space pruning and heuristic generation. This approach takes advantage of the modularity and interpretability of BT nodes, represented by predicate logic, to enable LLMs to predict the task-relevant action predicates and objects, and even the optimal path, without an explicit action model. We propose the Heuristic Optimal Behavior Tree Expansion Algorithm (HOBTEA) with two heuristic variants and provide a formal comparison and discussion of their efficiency and optimality. We introduce a learnable and transferable commonsense library to enhance the LLM's reasoning performance without fine-tuning. The action space expansion based on the commonsense library can further increase the success rate of planning. Experiments show the theoretical bounds of commonsense pruning and heuristic, and demonstrate the actual performance of LLM learning and reasoning with the commonsense library. Results in four datasets showcase the practical effectiveness of our approach in everyday service robot applications.
△ Less
Submitted 3 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning
Authors:
Yuwei Fu,
Haichao Zhang,
Di Wu,
Wei Xu,
Benoit Boulet
Abstract:
In this work, we investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL). In particular, we focus on sparse reward tasks with pre-defined textual task descriptions. We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. To address this issue, we introduce a lightweight fine-tuning method, named Fuzzy VLM r…
▽ More
In this work, we investigate how to leverage pre-trained visual-language models (VLM) for online Reinforcement Learning (RL). In particular, we focus on sparse reward tasks with pre-defined textual task descriptions. We first identify the problem of reward misalignment when applying VLM as a reward in RL tasks. To address this issue, we introduce a lightweight fine-tuning method, named Fuzzy VLM reward-aided RL (FuRL), based on reward alignment and relay RL. Specifically, we enhance the performance of SAC/DrQ baseline agents on sparse reward tasks by fine-tuning VLM representations and using relay RL to avoid local minima. Extensive experiments on the Meta-world benchmark tasks demonstrate the efficacy of the proposed method. Code is available at: https://github.com/fuyw/FuRL.
△ Less
Submitted 4 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Preemptive Answer "Attacks" on Chain-of-Thought Reasoning
Authors:
Rongwu Xu,
Zehan Qi,
Wei Xu
Abstract:
Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought (CoT) prompting. However, the robustness of this approach warrants further investigation. In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning. This situation can arise inadvertently or induced by malicious users…
▽ More
Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought (CoT) prompting. However, the robustness of this approach warrants further investigation. In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning. This situation can arise inadvertently or induced by malicious users by prompt injection attacks. Experiments reveal that preemptive answers significantly impair the model's reasoning capability across various CoT methods and a broad spectrum of datasets. To bolster the robustness of reasoning, we propose two measures aimed at mitigating this issue to some extent.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning
Authors:
Xuxin Cheng,
Wanshi Xu,
Zhihong Zhu,
Hongxiang Li,
Yuexian Zou
Abstract:
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot filling. Although there are some SLU frameworks joint modeling the two subtasks and achieving high performance, most of them still overlook the inhere…
▽ More
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot filling. Although there are some SLU frameworks joint modeling the two subtasks and achieving high performance, most of them still overlook the inherent relationships between intents and slots and fail to achieve mutual guidance between the two subtasks. To solve the problem, we propose a multi-level multi-grained SLU framework MMCL to apply contrastive learning at three levels, including utterance level, slot level, and word level to enable intent and slot to mutually guide each other. For the utterance level, our framework implements coarse granularity contrastive learning and fine granularity contrastive learning simultaneously. Besides, we also apply the self-distillation method to improve the robustness of the model. Experimental results and further analysis demonstrate that our proposed model achieves new state-of-the-art results on two public multi-intent SLU datasets, obtaining a 2.6 overall accuracy improvement on the MixATIS dataset compared to previous best models.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Information Maximization via Variational Autoencoders for Cross-Domain Recommendation
Authors:
Xuying Ning,
Wujiang Xu,
Xiaolei Liu,
Mingming Ha,
Qiongxu Ma,
Youru Li,
Linxun Chen,
Yongfeng Zhang
Abstract:
Cross-Domain Sequential Recommendation (CDSR) methods aim to address the data sparsity and cold-start problems present in Single-Domain Sequential Recommendation (SDSR). Existing CDSR methods typically rely on overlapping users, designing complex cross-domain modules to capture users' latent interests that can propagate across different domains. However, their propagated informative information is…
▽ More
Cross-Domain Sequential Recommendation (CDSR) methods aim to address the data sparsity and cold-start problems present in Single-Domain Sequential Recommendation (SDSR). Existing CDSR methods typically rely on overlapping users, designing complex cross-domain modules to capture users' latent interests that can propagate across different domains. However, their propagated informative information is limited to the overlapping users and the users who have rich historical behavior records. As a result, these methods often underperform in real-world scenarios, where most users are non-overlapping (cold-start) and long-tailed. In this research, we introduce a new CDSR framework named Information Maximization Variational Autoencoder (\textbf{\texttt{IM-VAE}}). Here, we suggest using a Pseudo-Sequence Generator to enhance the user's interaction history input for downstream fine-grained CDSR models to alleviate the cold-start issues. We also propose a Generative Recommendation Framework combined with three regularizers inspired by the mutual information maximization (MIM) theory \cite{mcgill1954multivariate} to capture the semantic differences between a user's interests shared across domains and those specific to certain domains, as well as address the informational gap between a user's actual interaction sequences and the pseudo-sequences generated. To the best of our knowledge, this paper is the first CDSR work that considers the information disentanglement and denoising of pseudo-sequences in the open-world recommendation scenario. Empirical experiments illustrate that \texttt{IM-VAE} outperforms the state-of-the-art approaches on two real-world cross-domain datasets on all sorts of users, including cold-start and tailed users, demonstrating the effectiveness of \texttt{IM-VAE} in open-world recommendation.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
LLMs Meet Multimodal Generation and Editing: A Survey
Authors:
Yingqing He,
Zhaoyang Liu,
Jingye Chen,
Zeyue Tian,
Hongyu Liu,
Xiaowei Chi,
Runtao Liu,
Ruibin Yuan,
Yazhou Xing,
Wenhai Wang,
Jifeng Dai,
Yong Zhang,
Wei Xue,
Qifeng Liu,
Yike Guo,
Qifeng Chen
Abstract:
With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable a…
▽ More
With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
△ Less
Submitted 9 June, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Feasibility and benefits of joint learning from MRI databases with different brain diseases and modalities for segmentation
Authors:
Wentian Xu,
Matthew Moffat,
Thalia Seale,
Ziyun Liang,
Felix Wagner,
Daniel Whitehouse,
David Menon,
Virginia Newcombe,
Natalie Voets,
Abhirup Banerjee,
Konstantinos Kamnitsas
Abstract:
Models for segmentation of brain lesions in multi-modal MRI are commonly trained for a specific pathology using a single database with a predefined set of MRI modalities, determined by a protocol for the specific disease. This work explores the following open questions: Is it feasible to train a model using multiple databases that contain varying sets of MRI modalities and annotations for differen…
▽ More
Models for segmentation of brain lesions in multi-modal MRI are commonly trained for a specific pathology using a single database with a predefined set of MRI modalities, determined by a protocol for the specific disease. This work explores the following open questions: Is it feasible to train a model using multiple databases that contain varying sets of MRI modalities and annotations for different brain pathologies? Will this joint learning benefit performance on the sets of modalities and pathologies available during training? Will it enable analysis of new databases with different sets of modalities and pathologies? We develop and compare different methods and show that promising results can be achieved with appropriate, simple and practical alterations to the model and training framework. We experiment with 7 databases containing 5 types of brain pathologies and different sets of MRI modalities. Results demonstrate, for the first time, that joint training on multi-modal MRI databases with different brain pathologies and sets of modalities is feasible and offers practical benefits. It enables a single model to segment pathologies encountered during training in diverse sets of modalities, while facilitating segmentation of new types of pathologies such as via follow-up fine-tuning. The insights this study provides into the potential and limitations of this paradigm should prove useful for guiding future advances in the direction. Code and pretrained models: https://github.com/WenTXuL/MultiUnet
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
SLMRec: Empowering Small Language Models for Sequential Recommendation
Authors:
Wujiang Xu,
Zujie Liang,
Jiaojiao Han,
Xuying Ning,
Wenfang Lin,
Linxun Chen,
Feng Wei,
Yongfeng Zhang
Abstract:
The sequential Recommendation (SR) task involves predicting the next item a user is likely to interact with, given their past interactions. The SR models examine the sequence of a user's actions to discern more complex behavioral patterns and temporal dynamics. Recent research demonstrates the great impact of LLMs on sequential recommendation systems, either viewing sequential recommendation as la…
▽ More
The sequential Recommendation (SR) task involves predicting the next item a user is likely to interact with, given their past interactions. The SR models examine the sequence of a user's actions to discern more complex behavioral patterns and temporal dynamics. Recent research demonstrates the great impact of LLMs on sequential recommendation systems, either viewing sequential recommendation as language modeling or serving as the backbone for user representation. Although these methods deliver outstanding performance, there is scant evidence of the necessity of a large language model and how large the language model is needed, especially in the sequential recommendation scene. Meanwhile, due to the huge size of LLMs, it is inefficient and impractical to apply a LLM-based model in real-world platforms that often need to process billions of traffic logs daily. In this paper, we explore the influence of LLMs' depth by conducting extensive experiments on large-scale industry datasets. Surprisingly, we discover that most intermediate layers of LLMs are redundant. Motivated by this insight, we empower small language models for SR, namely SLMRec, which adopt a simple yet effective knowledge distillation method. Moreover, SLMRec is orthogonal to other post-training efficiency techniques, such as quantization and pruning, so that they can be leveraged in combination. Comprehensive experimental results illustrate that the proposed SLMRec model attains the best performance using only 13% of the parameters found in LLM-based recommendation models, while simultaneously achieving up to 6.6x and 8.0x speedups in training and inference time costs, respectively.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Wireless Federated Learning over Resource-Constrained Networks: Digital versus Analog Transmissions
Authors:
Jiacheng Yao,
Wei Xu,
Zhaohui Yang,
Xiaohu You,
Mehdi Bennis,
H. Vincent Poor
Abstract:
To enable wireless federated learning (FL) in communication resource-constrained networks, two communication schemes, i.e., digital and analog ones, are effective solutions. In this paper, we quantitatively compare these two techniques, highlighting their essential differences as well as respectively suitable scenarios. We first examine both digital and analog transmission schemes, together with a…
▽ More
To enable wireless federated learning (FL) in communication resource-constrained networks, two communication schemes, i.e., digital and analog ones, are effective solutions. In this paper, we quantitatively compare these two techniques, highlighting their essential differences as well as respectively suitable scenarios. We first examine both digital and analog transmission schemes, together with a unified and fair comparison framework under imbalanced device sampling, strict latency targets, and transmit power constraints. A universal convergence analysis under various imperfections is established for evaluating the performance of FL over wireless networks. These analytical results reveal that the fundamental difference between the digital and analog communications lies in whether communication and computation are jointly designed or not. The digital scheme decouples the communication design from FL computing tasks, making it difficult to support uplink transmission from massive devices with limited bandwidth and hence the performance is mainly communication-limited. In contrast, the analog communication allows over-the-air computation (AirComp) and achieves better spectrum utilization. However, the computation-oriented analog transmission reduces power efficiency, and its performance is sensitive to computation errors from imperfect channel state information (CSI). Furthermore, device sampling for both schemes are optimized and differences in sampling optimization are analyzed. Numerical results verify the theoretical analysis and affirm the superior performance of the sampling optimization.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Joint MIMO Transceiver and Reflector Design for Reconfigurable Intelligent Surface-Assisted Communication
Authors:
Yaqiong Zhao,
Jindan Xu,
Wei Xu,
Kezhi Wang,
Xinquan Ye,
Chau Yuen,
Xiaohu You
Abstract:
In this paper, we consider a reconfigurable intelligent surface (RIS)-assisted multiple-input multiple-output communication system with multiple antennas at both the base station (BS) and the user. We plan to maximize the achievable rate through jointly optimizing the transmit precoding matrix, the receive combining matrix, and the RIS reflection matrix under the constraints of the transmit power…
▽ More
In this paper, we consider a reconfigurable intelligent surface (RIS)-assisted multiple-input multiple-output communication system with multiple antennas at both the base station (BS) and the user. We plan to maximize the achievable rate through jointly optimizing the transmit precoding matrix, the receive combining matrix, and the RIS reflection matrix under the constraints of the transmit power at the BS and the unit-modulus reflection at the RIS. Regarding the non-trivial problem form, we initially reformulate it into an considerable problem to make it tractable by utilizing the relationship between the achievable rate and the weighted minimum mean squared error. Next, the transmit precoding matrix, the receive combining matrix, and the RIS reflection matrix are alternately optimized. In particular, the optimal transmit precoding matrix and receive combining matrix are obtained in closed forms. Furthermore, a pair of computationally efficient methods are proposed for the RIS reflection matrix, namely the semi-definite relaxation (SDR) method and the successive closed form (SCF) method. We theoretically prove that both methods are ensured to converge, and the SCF-based algorithm is able to converges to a Karush-Kuhn-Tucker point of the problem.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild
Authors:
Xingqun Qi,
Hengyuan Zhang,
Yatian Wang,
Jiahao Pan,
Chen Liu,
Peng Li,
Xiaowei Chi,
Mengfei Li,
Qixun Zhang,
Wei Xue,
Shanghang Zhang,
Wenhan Luo,
Qifeng Liu,
Yike Guo
Abstract:
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upo…
▽ More
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold. Therefore, to alleviate the scarcity of 3D data, we first construct a large-scale co-speech 3D gesture dataset containing more than 40M meshed posture instances across 4.3K speakers, dubbed GES-X. Then, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our gesture experts. At the finetune stage, we present the audio ControlNet that incorporates the human voice as condition prompts to guide the gesture generation. Here, we construct the audio ControlNet through a trainable copy of our pre-trained diffusion model. Moreover, we design a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism. Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation. Extensive experiments demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation. The dataset will be publicly available at: https://mattie-e.github.io/GES-X/
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
A hierarchical Bayesian model for syntactic priming
Authors:
Weijie Xu,
Richard Futrell
Abstract:
The effect of syntactic priming exhibits three well-documented empirical properties: the lexical boost, the inverse frequency effect, and the asymmetrical decay. We aim to show how these three empirical phenomena can be reconciled in a general learning framework, the hierarchical Bayesian model (HBM). The model represents syntactic knowledge in a hierarchical structure of syntactic statistics, whe…
▽ More
The effect of syntactic priming exhibits three well-documented empirical properties: the lexical boost, the inverse frequency effect, and the asymmetrical decay. We aim to show how these three empirical phenomena can be reconciled in a general learning framework, the hierarchical Bayesian model (HBM). The model represents syntactic knowledge in a hierarchical structure of syntactic statistics, where a lower level represents the verb-specific biases of syntactic decisions, and a higher level represents the abstract bias as an aggregation of verb-specific biases. This knowledge is updated in response to experience by Bayesian inference. In simulations, we show that the HBM captures the above-mentioned properties of syntactic priming. The results indicate that some properties of priming which are usually explained by a residual activation account can also be explained by an implicit learning account. We also discuss the model's implications for the lexical basis of syntactic priming.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Fine-Grained Dynamic Framework for Bias-Variance Joint Optimization on Data Missing Not at Random
Authors:
Mingming Ha,
Xuewen Tao,
Wenfang Lin,
Qionxu Ma,
Wujiang Xu,
Linxun Chen
Abstract:
In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, varia…
▽ More
In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, variances and generalization bound of these methods are generally unbounded when the propensity scores tend to zero, compromising their stability and robustness. In this paper, we first theoretically reveal that limitations of regularization techniques. Besides, we further illustrate that, for more general estimators, unbiasedness will inevitably lead to unbounded variance. These general laws inspire us that the estimator designs is not merely about eliminating bias, reducing variance, or simply achieve a bias-variance trade-off. Instead, it involves a quantitative joint optimization of bias and variance. Then, we develop a systematic fine-grained dynamic learning framework to jointly optimize bias and variance, which adaptively selects an appropriate estimator for each user-item pair according to the predefined objective function. With this operation, the generalization bounds and variances of models are reduced and bounded with theoretical guarantees. Extensive experiments are conducted to verify the theoretical results and the effectiveness of the proposed dynamic learning framework.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling
Authors:
Wanghan Xu,
Fenghua Ling,
Wenlong Zhang,
Tao Han,
Hao Chen,
Wanli Ouyang,
Lei Bai
Abstract:
Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets preven…
▽ More
Data-driven artificial intelligence (AI) models have made significant advancements in weather forecasting, particularly in medium-range and nowcasting. However, most data-driven weather forecasting models are black-box systems that focus on learning data mapping rather than fine-grained physical evolution in the time dimension. Consequently, the limitations in the temporal scale of datasets prevent these models from forecasting at finer time scales. This paper proposes a physics-AI hybrid model (i.e., WeatherGFT) which Generalizes weather forecasts to Finer-grained Temporal scales beyond training dataset. Specifically, we employ a carefully designed PDE kernel to simulate physical evolution on a small time scale (e.g., 300 seconds) and use a parallel neural networks with a learnable router for bias correction. Furthermore, we introduce a lead time-aware training framework to promote the generalization of the model at different lead times. The weight analysis of physics-AI modules indicates that physics conducts major evolution while AI performs corrections adaptively. Extensive experiments show that WeatherGFT trained on an hourly dataset, achieves state-of-the-art performance across multiple lead times and exhibits the capability to generalize 30-minute forecasts.
△ Less
Submitted 29 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
VAE-Var: Variational-Autoencoder-Enhanced Variational Assimilation
Authors:
Yi Xiao,
Qilong Jia,
Wei Xue,
Lei Bai
Abstract:
Data assimilation refers to a set of algorithms designed to compute the optimal estimate of a system's state by refining the prior prediction (known as background states) using observed data. Variational assimilation methods rely on the maximum likelihood approach to formulate a variational cost, with the optimal state estimate derived by minimizing this cost. Although traditional variational meth…
▽ More
Data assimilation refers to a set of algorithms designed to compute the optimal estimate of a system's state by refining the prior prediction (known as background states) using observed data. Variational assimilation methods rely on the maximum likelihood approach to formulate a variational cost, with the optimal state estimate derived by minimizing this cost. Although traditional variational methods have achieved great success and have been widely used in many numerical weather prediction centers, they generally assume Gaussian errors in the background states, which limits the accuracy of these algorithms due to the inherent inaccuracies of this assumption. In this paper, we introduce VAE-Var, a novel variational algorithm that leverages a variational autoencoder (VAE) to model a non-Gaussian estimate of the background error distribution. We theoretically derive the variational cost under the VAE estimation and present the general formulation of VAE-Var; we implement VAE-Var on low-dimensional chaotic systems and demonstrate through experimental results that VAE-Var consistently outperforms traditional variational assimilation methods in terms of accuracy across various observational settings.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
Authors:
Govind Ramesh,
Yao Dou,
Wei Xu
Abstract:
Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by u…
▽ More
Research on jailbreaking has been valuable for testing and understanding the safety and security issues of large language models (LLMs). In this paper, we introduce Iterative Refinement Induced Self-Jailbreak (IRIS), a novel approach that leverages the reflective capabilities of LLMs for jailbreaking with only black-box access. Unlike previous methods, IRIS simplifies the jailbreaking process by using a single model as both the attacker and target. This method first iteratively refines adversarial prompts through self-explanation, which is crucial for ensuring that even well-aligned LLMs obey adversarial instructions. IRIS then rates and enhances the output given the refined prompt to increase its harmfulness. We find IRIS achieves jailbreak success rates of 98% on GPT-4 and 92% on GPT-4 Turbo in under 7 queries. It significantly outperforms prior approaches in automatic, black-box and interpretable jailbreaking, while requiring substantially fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
UAV-VisLoc: A Large-scale Dataset for UAV Visual Localization
Authors:
Wenjia Xu,
Yaxuan Yao,
Jiaqi Cao,
Zhiwei Wei,
Chunbo Liu,
Jiuniu Wang,
Mugen Peng
Abstract:
The application of unmanned aerial vehicles (UAV) has been widely extended recently. It is crucial to ensure accurate latitude and longitude coordinates for UAVs, especially when the global navigation satellite systems (GNSS) are disrupted and unreliable. Existing visual localization methods achieve autonomous visual localization without error accumulation by matching the ground-down view image of…
▽ More
The application of unmanned aerial vehicles (UAV) has been widely extended recently. It is crucial to ensure accurate latitude and longitude coordinates for UAVs, especially when the global navigation satellite systems (GNSS) are disrupted and unreliable. Existing visual localization methods achieve autonomous visual localization without error accumulation by matching the ground-down view image of UAV with the ortho satellite maps. However, collecting UAV ground-down view images across diverse locations is costly, leading to a scarcity of large-scale datasets for real-world scenarios. Existing datasets for UAV visual localization are often limited to small geographic areas or are focused only on urban regions with distinct textures. To address this, we define the UAV visual localization task by determining the UAV's real position coordinates on a large-scale satellite map based on the captured ground-down view. In this paper, we present a large-scale dataset, UAV-VisLoc, to facilitate the UAV visual localization task. This dataset comprises images from diverse drones across 11 locations in China, capturing a range of topographical features. The dataset features images from fixed-wing drones and multi-terrain drones, captured at different altitudes and orientations. Our dataset includes 6,742 drone images and 11 satellite maps, with metadata such as latitude, longitude, altitude, and capture date. Our dataset is tailored to support both the training and testing of models by providing a diverse and extensive data.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Liver Fat Quantification Network with Body Shape
Authors:
Qiyue Wang,
Wu Xue,
Xiaoke Zhang,
Fang Jin,
James Hahn
Abstract:
It is critically important to detect the content of liver fat as it is related to cardiac complications and cardiovascular disease mortality. However, existing methods are either associated with high cost and/or medical complications (e.g., liver biopsy, imaging technology) or only roughly estimate the grades of steatosis. In this paper, we propose a deep neural network to estimate the percentage…
▽ More
It is critically important to detect the content of liver fat as it is related to cardiac complications and cardiovascular disease mortality. However, existing methods are either associated with high cost and/or medical complications (e.g., liver biopsy, imaging technology) or only roughly estimate the grades of steatosis. In this paper, we propose a deep neural network to estimate the percentage of liver fat using only body shapes. The proposed is composed of a flexible baseline network and a lightweight Attention module. The attention module is trained to generate discriminative and diverse features which significant improve the performance. In order to validate the method, we perform extensive tests on the public medical dataset. The results verify that our proposed method yields state-of-the-art performance with Root mean squared error (RMSE) of 5.26% and R-Squared value over 0.8. It offers an accurate and more accessible assessment of hepatic steatosis.
△ Less
Submitted 30 May, 2024; v1 submitted 18 May, 2024;
originally announced May 2024.
-
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Authors:
Wanting Xu,
Yang Liu,
Langping He,
Xucheng Huang,
Ling Jiang
Abstract:
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, emp…
▽ More
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
△ Less
Submitted 20 June, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Investigating the Semantic Robustness of CLIP-based Zero-Shot Anomaly Segmentation
Authors:
Kevin Stangl,
Marius Arvinte,
Weilin Xu,
Cory Cornelius
Abstract:
Zero-shot anomaly segmentation using pre-trained foundation models is a promising approach that enables effective algorithms without expensive, domain-specific training or fine-tuning. Ensuring that these methods work across various environmental conditions and are robust to distribution shifts is an open problem. We investigate the performance of WinCLIP [14] zero-shot anomaly segmentation algori…
▽ More
Zero-shot anomaly segmentation using pre-trained foundation models is a promising approach that enables effective algorithms without expensive, domain-specific training or fine-tuning. Ensuring that these methods work across various environmental conditions and are robust to distribution shifts is an open problem. We investigate the performance of WinCLIP [14] zero-shot anomaly segmentation algorithm by perturbing test data using three semantic transformations: bounded angular rotations, bounded saturation shifts, and hue shifts. We empirically measure a lower performance bound by aggregating across per-sample worst-case perturbations and find that average performance drops by up to 20% in area under the ROC curve and 40% in area under the per-region overlap curve. We find that performance is consistently lowered on three CLIP backbones, regardless of model architecture or learning objective, demonstrating a need for careful performance evaluation.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Authentic Hand Avatar from a Phone Scan via Universal Hand Model
Authors:
Gyeongsik Moon,
Weipeng Xu,
Rohan Joshi,
Chenglei Wu,
Takaaki Shiratori
Abstract:
The authentic 3D hand avatar with every identifiable information, such as hand shapes and textures, is necessary for immersive experiences in AR/VR. In this paper, we present a universal hand model (UHM), which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effec…
▽ More
The authentic 3D hand avatar with every identifiable information, such as hand shapes and textures, is necessary for immersive experiences in AR/VR. In this paper, we present a universal hand model (UHM), which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effective universal hand modeling, we perform tracking and modeling at the same time, while previous 3D hand models perform them separately. The conventional separate pipeline suffers from the accumulated errors from the tracking stage, which cannot be recovered in the modeling stage. On the other hand, ours does not suffer from the accumulated errors while having a much more concise overall pipeline. We additionally introduce a novel image matching loss function to address a skin sliding during the tracking and modeling, while existing works have not focused on it much. Finally, using learned priors from our UHM, we effectively adapt our UHM to each person's short phone scan for the authentic hand avatar.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation
Authors:
Jianyi Chen,
Wei Xue,
Xu Tan,
Zhen Ye,
Qifeng Liu,
Yike Guo
Abstract:
Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-…
▽ More
Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-time applications. In this paper, we aim to develop a Fast SAG method that can create high-quality and coherent accompaniments. A non-AR diffusion-based framework is developed, which by carefully designing the conditions inferred from the vocal signals, generates the Mel spectrogram of the target accompaniment directly. With diffusion and Mel spectrogram modeling, the proposed method significantly simplifies the AR token-based SingSong framework, and largely accelerates the generation. We also design semantic projection, prior projection blocks as well as a set of loss functions, to ensure the generated accompaniment has semantic and rhythm coherence with the vocal signal. By intensive experimental studies, we demonstrate that the proposed method can generate better samples than SingSong, and accelerate the generation by at least 30 times. Audio samples and code are available at https://fastsag.github.io/.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions
Authors:
Xinglin Chen,
Yishuai Cai,
Yunxin Mao,
Minglong Li,
Wenjing Yang,
Weixia Xu,
Ji Wang
Abstract:
Robots executing tasks following human instructions in domestic or industrial environments essentially require both adaptability and reliability. Behavior Tree (BT) emerges as an appropriate control architecture for these scenarios due to its modularity and reactivity. Existing BT generation methods, however, either do not involve interpreting natural language or cannot theoretically guarantee the…
▽ More
Robots executing tasks following human instructions in domestic or industrial environments essentially require both adaptability and reliability. Behavior Tree (BT) emerges as an appropriate control architecture for these scenarios due to its modularity and reactivity. Existing BT generation methods, however, either do not involve interpreting natural language or cannot theoretically guarantee the BTs' success. This paper proposes a two-stage framework for BT generation, which first employs large language models (LLMs) to interpret goals from high-level instructions, then constructs an efficient goal-specific BT through the Optimal Behavior Tree Expansion Algorithm (OBTEA). We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning. Experiments in the service robot validate the proficiency of LLMs in producing grammatically correct and accurately interpreted goals, demonstrate OBTEA's superiority over the baseline BT Expansion algorithm in various metrics, and finally confirm the practical deployability of our framework. The project website is https://dids-ei.github.io/Project/LLM-OBTEA/.
△ Less
Submitted 27 June, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
A framework of text-dependent speaker verification for chinese numerical string corpus
Authors:
Litong Zheng,
Feng Hong,
Weijie Xu,
Wan Zheng
Abstract:
The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impa…
▽ More
The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
△ Less
Submitted 21 May, 2024; v1 submitted 11 May, 2024;
originally announced May 2024.