-
Galaxy Mass Modelling from Multi-Wavelength JWST Strong Lens Analysis: Dark Matter Substructure, Angular Mass Complexity, or Both?
Authors:
Samuel C. Lange,
Aristeidis Amvrosiadis,
James W. Nightingale,
Qiuhan He,
Carlos S. Frenk,
Andrew Robertson,
Shaun Cole,
Richard Massey,
Xiaoyue Cao,
Ran Li,
Kaihao Wang
Abstract:
We analyze two galaxy-scale strong gravitational lenses, SPT0418-47 and SPT2147-50, using JWST NIRCam imaging across multiple filters. To account for angular complexity in the lens mass distribution, we introduce multipole perturbations with orders $m=1, 3, 4$. Our results show strong evidence for angular mass complexity in SPT2147, with multipole strengths of 0.3-1.7 $\%$ for $m=3, 4$ and 2.4-9.5…
▽ More
We analyze two galaxy-scale strong gravitational lenses, SPT0418-47 and SPT2147-50, using JWST NIRCam imaging across multiple filters. To account for angular complexity in the lens mass distribution, we introduce multipole perturbations with orders $m=1, 3, 4$. Our results show strong evidence for angular mass complexity in SPT2147, with multipole strengths of 0.3-1.7 $\%$ for $m=3, 4$ and 2.4-9.5 $\%$ for $m=1$, while SPT0418 shows no such preference. We also test lens models that include a dark matter substructure, finding a strong preference for a substructure in SPT2147-50 with a Bayes factor (log-evidence change) of $\sim 60$ when multipoles are not included. Including multipoles reduces the Bayes factor to $\sim 11$, still corresponding to a $5σ$ detection of a subhalo with an NFW mass of $\log_{10}(M_{200}/M_{\odot}) = 10.87\substack{+0.53\\ -0.71}$. While SPT2147-50 may represent the fourth detection of a dark matter substructure in a strong lens, further analysis is needed to confirm that the signal is not due to systematics associated with the lens mass model.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
Preference Optimization with Multi-Sample Comparisons
Authors:
Chaoqi Wang,
Zhuokai Zhao,
Chen Zhu,
Karthik Abinav Sankararaman,
Michal Valko,
Xuefei Cao,
Zhaorun Chen,
Madian Khabsa,
Yuxin Chen,
Hao Ma,
Sinong Wang
Abstract:
Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approach…
▽ More
Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
SpeGCL: Self-supervised Graph Spectrum Contrastive Learning without Positive Samples
Authors:
Yuntao Shou,
Xiangyong Cao,
Deyu Meng
Abstract:
Graph Contrastive Learning (GCL) excels at managing noise and fluctuations in input data, making it popular in various fields (e.g., social networks, and knowledge graphs). Our study finds that the difference in high-frequency information between augmented graphs is greater than that in low-frequency information. However, most existing GCL methods focus mainly on the time domain (low-frequency inf…
▽ More
Graph Contrastive Learning (GCL) excels at managing noise and fluctuations in input data, making it popular in various fields (e.g., social networks, and knowledge graphs). Our study finds that the difference in high-frequency information between augmented graphs is greater than that in low-frequency information. However, most existing GCL methods focus mainly on the time domain (low-frequency information) for node feature representations and cannot make good use of high-frequency information to speed up model convergence. Furthermore, existing GCL paradigms optimize graph embedding representations by pulling the distance between positive sample pairs closer and pushing the distance between positive and negative sample pairs farther away, but our theoretical analysis shows that graph contrastive learning benefits from pushing negative pairs farther away rather than pulling positive pairs closer. To solve the above-mentioned problems, we propose a novel spectral GCL framework without positive samples, named SpeGCL. Specifically, to solve the problem that existing GCL methods cannot utilize high-frequency information, SpeGCL uses a Fourier transform to extract high-frequency and low-frequency information of node features, and constructs a contrastive learning mechanism in a Fourier space to obtain better node feature representation. Furthermore, SpeGCL relies entirely on negative samples to refine the graph embedding. We also provide a theoretical justification for the efficacy of using only negative samples in SpeGCL. Extensive experiments on un-supervised learning, transfer learning, and semi-supervised learning have validated the superiority of our SpeGCL framework over the state-of-the-art GCL methods.
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Chain-of-Restoration: Multi-Task Image Restoration Models are Zero-Shot Step-by-Step Universal Image Restorers
Authors:
Jin Cao,
Deyu Meng,
Xiangyong Cao
Abstract:
Despite previous works typically targeting isolated degradation types, recent research has increasingly focused on addressing composite degradations which involve a complex interplay of multiple different isolated degradations. Recognizing the challenges posed by the exponential number of possible degradation combinations, we propose Universal Image Restoration (UIR), a new task setting that requi…
▽ More
Despite previous works typically targeting isolated degradation types, recent research has increasingly focused on addressing composite degradations which involve a complex interplay of multiple different isolated degradations. Recognizing the challenges posed by the exponential number of possible degradation combinations, we propose Universal Image Restoration (UIR), a new task setting that requires models to be trained on a set of degradation bases and then remove any degradation that these bases can potentially compose in a zero-shot manner. Inspired by the Chain-of-Thought which prompts LLMs to address problems step-by-step, we propose the Chain-of-Restoration (CoR), which instructs models to step-by-step remove unknown composite degradations. By integrating a simple Degradation Discriminator into pre-trained multi-task models, CoR facilitates the process where models remove one degradation basis per step, continuing this process until the image is fully restored from the unknown composite degradation. Extensive experiments show that CoR significantly improves model performance in removing composite degradations, achieving results comparable to or surpassing those of State-of-The-Art (SoTA) methods trained on all degradations. The code will be released at https://github.com/toummHus/Chain-of-Restoration.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Boosting the Performance of Decentralized Federated Learning via Catalyst Acceleration
Authors:
Qinglun Li,
Miao Zhang,
Yingqi Liu,
Quanjun Yin,
Li Shen,
Xiaochun Cao
Abstract:
Decentralized Federated Learning has emerged as an alternative to centralized architectures due to its faster training, privacy preservation, and reduced communication overhead. In decentralized communication, the server aggregation phase in Centralized Federated Learning shifts to the client side, which means that clients connect with each other in a peer-to-peer manner. However, compared to the…
▽ More
Decentralized Federated Learning has emerged as an alternative to centralized architectures due to its faster training, privacy preservation, and reduced communication overhead. In decentralized communication, the server aggregation phase in Centralized Federated Learning shifts to the client side, which means that clients connect with each other in a peer-to-peer manner. However, compared to the centralized mode, data heterogeneity in Decentralized Federated Learning will cause larger variances between aggregated models, which leads to slow convergence in training and poor generalization performance in tests. To address these issues, we introduce Catalyst Acceleration and propose an acceleration Decentralized Federated Learning algorithm called DFedCata. It consists of two main components: the Moreau envelope function, which primarily addresses parameter inconsistencies among clients caused by data heterogeneity, and Nesterov's extrapolation step, which accelerates the aggregation phase. Theoretically, we prove the optimization error bound and generalization error bound of the algorithm, providing a further understanding of the nature of the algorithm and the theoretical perspectives on the hyperparameter choice. Empirically, we demonstrate the advantages of the proposed algorithm in both convergence speed and generalization performance on CIFAR10/100 with various non-iid data distributions. Furthermore, we also experimentally verify the theoretical properties of DFedCata.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Exponents for Shared Randomness-Assisted Channel Simulation
Authors:
Aadil Oufkir,
Michael X. Cao,
Hao-Chung Cheng,
Mario Berta
Abstract:
We determine the exact error and strong converse exponents of shared randomness-assisted channel simulation in worst case total-variation distance. Namely, we find that these exponents can be written as simple optimizations over the Rényi channel mutual information. Strikingly, and in stark contrast to channel coding, there are no critical rates, allowing a tight characterization for arbitrary rat…
▽ More
We determine the exact error and strong converse exponents of shared randomness-assisted channel simulation in worst case total-variation distance. Namely, we find that these exponents can be written as simple optimizations over the Rényi channel mutual information. Strikingly, and in stark contrast to channel coding, there are no critical rates, allowing a tight characterization for arbitrary rates below and above the simulation capacity. We derive our results by asymptotically expanding the meta-converse for channel simulation [Cao {\it et al.}, IEEE Trans.~Inf.~Theory (2024)], which corresponds to non-signaling assisted codes. We prove this to be asymptotically tight by employing the approximation algorithms from [Berta {\it et al.}, Proc.~IEEE ISIT (2024)], which show how to round any non-signaling assisted strategy to a strategy that only uses shared randomness. Notably, this implies that any additional quantum entanglement-assistance does not change the error or the strong converse exponents.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques
Authors:
Benyuan Meng,
Qianqian Xu,
Zitai Wang,
Zhiyong Yang,
Xiaochun Cao,
Qingming Huang
Abstract:
Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content difference…
▽ More
Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https://github.com/Darkbblue/diffusion-content-shift.
△ Less
Submitted 10 October, 2024; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Experimental coherent-state quantum secret sharing with finite pulses
Authors:
Yuan-Zhuo Wang,
Xiao-Ran Sun,
Xiao-Yu Cao,
Hua-Lei Yin,
Zeng-Bing Chen
Abstract:
Quantum secret sharing (QSS) plays a significant role in multiparty quantum communication and is a crucial component of future quantum multiparty computing networks. Therefore, it is highly valuable to develop a QSS protocol that offers both information-theoretic security and validation in real optical systems under a finite-key regime. In this work, we propose a three-user QSS protocol based on p…
▽ More
Quantum secret sharing (QSS) plays a significant role in multiparty quantum communication and is a crucial component of future quantum multiparty computing networks. Therefore, it is highly valuable to develop a QSS protocol that offers both information-theoretic security and validation in real optical systems under a finite-key regime. In this work, we propose a three-user QSS protocol based on phase-encoding technology. By adopting symmetric procedures for the two players, our protocol resolves the security loopholes introduced by asymmetric basis choice without prior knowledge of the identity of the malicious player. Kato's concentration inequality is exploited to provide security against coherent attacks with the finite-key effect. Moreover, the practicality of our protocol has been validated under a 30-dB channel loss with a transmission distance of 5-km fiber. Our protocol achieves secure key rates ranging from 432 to 192 bps by choosing different pulse intensities and basis selection probabilities. Offering enhanced security and practicality, our protocol stands as an essential element for the realization of quantum multiparty computing networks.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Path Planning and Robust Path Tracking Control of an Automated Parallel Parking Maneuver
Authors:
Xincheng Cao,
Levent Guvenc
Abstract:
Self driving vehicles should be able to perform parallel parking or a similar maneuver successfully. With this motivation, the S shaped maneuverability test of the Ohio driver license examination is chosen here for automatic execution by a self driving vehicle with drive by wire capability and longitudinal and lateral controls. The Ohio maneuverability test requires the driver to start within an a…
▽ More
Self driving vehicles should be able to perform parallel parking or a similar maneuver successfully. With this motivation, the S shaped maneuverability test of the Ohio driver license examination is chosen here for automatic execution by a self driving vehicle with drive by wire capability and longitudinal and lateral controls. The Ohio maneuverability test requires the driver to start within an area enclosed by four pylons and the driver is asked to go to the left of the fifth pylon directly in front of the vehicle in a smooth and continuous manner while ending in a parallel direction to the initial one. The driver is then asked to go backwards to the starting location of the vehicle without stopping the vehicle or hitting the pylons. As a self driving vehicle should do a much better job repeatably than a driver, a high order polynomial path model is built along with speed profiling to start and stop smoothly at the ends of the path without large longitudinal and lateral accelerations. In contrast to the long horizon, higher speed path planning and path tracking control applications in the literature, this paper treats low speed and very short horizon path planning and path tracking control with stopping and direction reversal. The path is constructed using a segmented polynomial fit optimization routine that guarantees path curvature smoothness. A linear path tracking model is utilized as the basis of the designed control system consisting of a disturbance observer based curvature rejection filter and a speed scheduled, parameter space robust PID controller. Simulation studies indicate that it has better performance compared to other common control systems such as standalone PID controller and combined PID and feedforward control. indicate that it has better performance compared to other common control systems such as standalone PID controller and combined PID and feedforward control.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
Vehicle-in-Virtual-Environment Method for ADAS and Connected and Automated Driving Function Development/Demonstration/Evaluation
Authors:
Xincheng Cao,
Haochong Chen,
Bilin Aksun-Guvenc,
Levent Guvenc
Abstract:
The current approach for new Advanced Driver Assistance System (ADAS) and Connected and Automated Driving (CAD) function development involves a significant amount of public road testing which is inefficient due to the number miles that need to be driven for rare and extreme events to take place, thereby being very costly also, and unsafe as the rest of the road users become involuntary test subjec…
▽ More
The current approach for new Advanced Driver Assistance System (ADAS) and Connected and Automated Driving (CAD) function development involves a significant amount of public road testing which is inefficient due to the number miles that need to be driven for rare and extreme events to take place, thereby being very costly also, and unsafe as the rest of the road users become involuntary test subjects. A new development, evaluation and demonstration method for safe, efficient, and repeatable development, demonstration and evaluation of ADAS and CAD functions called VehicleInVirtualEnvironment (VVE) was recently introduced as a solution to this problem. The vehicle is operated in a large, empty, and flat area during VVE while its localization and perception sensor data is fed from the virtual environment with other traffic and rare and extreme events being generated as needed. The virtual environment can be easily configured and modified to construct different testing scenarios on demand. This paper focuses on the VVE approach and introduces the coordinate transformations needed to sync pose (location and orientation) in the virtual and physical worlds and handling of localization and perception sensor data using the highly realistic 3D simulation model of a recent autonomous shuttle deployment site in Columbus, Ohio as the virtual world. As a further example that uses multiple actors, the use of VVE for VehicleToVRU communication based Vulnerable Road User (VRU) safety is presented in the paper using VVE experiments and real pedestrian(s) in a safe and repeatable manner. VVE experiments are used to demonstrate the efficacy of the method.
△ Less
Submitted 5 October, 2024;
originally announced October 2024.
-
Pareto Control Barrier Function for Inner Safe Set Maximization Under Input Constraints
Authors:
Xiaoyang Cao,
Zhe Fu,
Alexandre M. Bayen
Abstract:
This article introduces the Pareto Control Barrier Function (PCBF) algorithm to maximize the inner safe set of dynamical systems under input constraints. Traditional Control Barrier Functions (CBFs) ensure safety by maintaining system trajectories within a safe set but often fail to account for realistic input constraints. To address this problem, we leverage the Pareto multi-task learning framewo…
▽ More
This article introduces the Pareto Control Barrier Function (PCBF) algorithm to maximize the inner safe set of dynamical systems under input constraints. Traditional Control Barrier Functions (CBFs) ensure safety by maintaining system trajectories within a safe set but often fail to account for realistic input constraints. To address this problem, we leverage the Pareto multi-task learning framework to balance competing objectives of safety and safe set volume. The PCBF algorithm is applicable to high-dimensional systems and is computationally efficient. We validate its effectiveness through comparison with Hamilton-Jacobi reachability for an inverted pendulum and through simulations on a 12-dimensional quadrotor system. Results show that the PCBF consistently outperforms existing methods, yielding larger safe sets and ensuring safety under input constraints.
△ Less
Submitted 5 October, 2024;
originally announced October 2024.
-
Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features
Authors:
Benyuan Meng,
Qianqian Xu,
Zitai Wang,
Xiaochun Cao,
Qingming Huang
Abstract:
Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a…
▽ More
Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic-diffusion-feature.
△ Less
Submitted 10 October, 2024; v1 submitted 4 October, 2024;
originally announced October 2024.
-
SegEarth-OV: Towards Traning-Free Open-Vocabulary Segmentation for Remote Sensing Images
Authors:
Kaiyu Li,
Ruixun Liu,
Xiangyong Cao,
Deyu Meng,
Zhi Wang
Abstract:
Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing conte…
▽ More
Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \url{https://earth-insights.github.io/SegEarth-OV}
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Towards Native Generative Model for 3D Head Avatar
Authors:
Yiyu Zhuang,
Yuxiao He,
Jiawei Zhang,
Yanwen Wang,
Jiahe Zhu,
Yao Yao,
Siyu Zhu,
Xun Cao,
Hao Zhu
Abstract:
Creating 3D head avatars is a significant yet challenging task for many applicated scenarios. Previous studies have set out to learn 3D human head generative models using massive 2D image data. Although these models are highly generalizable for human appearance, their result models are not 360$^\circ$-renderable, and the predicted 3D geometry is unreliable. Therefore, such results cannot be used i…
▽ More
Creating 3D head avatars is a significant yet challenging task for many applicated scenarios. Previous studies have set out to learn 3D human head generative models using massive 2D image data. Although these models are highly generalizable for human appearance, their result models are not 360$^\circ$-renderable, and the predicted 3D geometry is unreliable. Therefore, such results cannot be used in VR, game modeling, and other scenarios that require 360$^\circ$-renderable 3D head models. An intuitive idea is that 3D head models with limited amount but high 3D accuracy are more reliable training data for a high-quality 3D generative model. In this vein, we delve into how to learn a native generative model for 360$^\circ$ full head from a limited 3D head dataset. Specifically, three major problems are studied: 1) how to effectively utilize various representations for generating the 360$^\circ$-renderable human head; 2) how to disentangle the appearance, shape, and motion of human faces to generate a 3D head model that can be edited by appearance and driven by motion; 3) and how to extend the generalization capability of the generative model to support downstream tasks. Comprehensive experiments are conducted to verify the effectiveness of the proposed model. We hope the proposed models and artist-designed dataset can inspire future research on learning native generative 3D head models from limited 3D datasets.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
E-Healthcare Systems: Integrated Sensing, Computing, and Semantic Communication with Physical Layer Security
Authors:
Yinchao Yang,
Zhaohui Yang,
Weijie Yuan,
Fan Liu,
Xiaowen Cao,
Chongwen Huang,
Zhaoyang Zhang,
Mohammad Shikh-Bahaei
Abstract:
This paper introduces an integrated sensing, computing, and semantic communication (ISCSC) framework tailored for smart healthcare systems. The framework is evaluated in the context of smart healthcare, optimising the transmit beamforming matrix and semantic extraction ratio for improved data rates, sensing accuracy, and general data protection regulation (GDPR) compliance, while considering IoRT…
▽ More
This paper introduces an integrated sensing, computing, and semantic communication (ISCSC) framework tailored for smart healthcare systems. The framework is evaluated in the context of smart healthcare, optimising the transmit beamforming matrix and semantic extraction ratio for improved data rates, sensing accuracy, and general data protection regulation (GDPR) compliance, while considering IoRT device computing capabilities. Semantic metrics such as semantic transmission rate and semantic secrecy rate are derived to evaluate data rate performance and GDPR risk, respectively, while the Cramér-Rao Bound (CRB) assesses sensing performance. Simulation results demonstrate the framework's effectiveness in ensuring reliable sensing, high data rates, and secure communication.
△ Less
Submitted 30 September, 2024;
originally announced September 2024.
-
SemiDDM-Weather: A Semi-supervised Learning Framework for All-in-one Adverse Weather Removal
Authors:
Fang Long,
Wenkang Su,
Zixuan Li,
Lei Cai,
Mingjie Li,
Yuan-Gen Wang,
Xiaochun Cao
Abstract:
Adverse weather removal aims to restore clear vision under adverse weather conditions. Existing methods are mostly tailored for specific weather types and rely heavily on extensive labeled data. In dealing with these two limitations, this paper presents a pioneering semi-supervised all-in-one adverse weather removal framework built on the teacher-student network with a Denoising Diffusion Model (D…
▽ More
Adverse weather removal aims to restore clear vision under adverse weather conditions. Existing methods are mostly tailored for specific weather types and rely heavily on extensive labeled data. In dealing with these two limitations, this paper presents a pioneering semi-supervised all-in-one adverse weather removal framework built on the teacher-student network with a Denoising Diffusion Model (DDM) as the backbone, termed SemiDDM-Weather. As for the design of DDM backbone in our SemiDDM-Weather, we adopt the SOTA Wavelet Diffusion Model-Wavediff with customized inputs and loss functions, devoted to facilitating the learning of many-to-one mapping distributions for efficient all-in-one adverse weather removal with limited label data. To mitigate the risk of misleading model training due to potentially inaccurate pseudo-labels generated by the teacher network in semi-supervised learning, we introduce quality assessment and content consistency constraints to screen the "optimal" outputs from the teacher network as the pseudo-labels, thus more effectively guiding the student network training with unlabeled data. Experimental results show that on both synthetic and real-world datasets, our SemiDDM-Weather consistently delivers high visual quality and superior adverse weather removal, even when compared to fully supervised competitors. Our code and pre-trained model are available at this repository.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats
Authors:
Kuanrong Liu,
Siyuan Liang,
Jiawei Liang,
Pengwen Dai,
Xiaochun Cao
Abstract:
Multimodal contrastive learning uses various data modalities to create high-quality features, but its reliance on extensive data sources on the Internet makes it vulnerable to backdoor attacks. These attacks insert malicious behaviors during training, which are activated by specific triggers during inference, posing significant security risks. Despite existing countermeasures through fine-tuning t…
▽ More
Multimodal contrastive learning uses various data modalities to create high-quality features, but its reliance on extensive data sources on the Internet makes it vulnerable to backdoor attacks. These attacks insert malicious behaviors during training, which are activated by specific triggers during inference, posing significant security risks. Despite existing countermeasures through fine-tuning that reduce the malicious impacts of such attacks, these defenses frequently necessitate extensive training time and degrade clean accuracy. In this study, we propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities, known as Unlearn Backdoor Threats (UBT). We specifically use overfit training to improve backdoor shortcuts and accurately detect suspicious samples in the potential poisoning data set. Then, we select fewer unlearned samples from suspicious samples for rapid forgetting in order to eliminate the backdoor effect and thus improve backdoor defense efficiency. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime. This technique focuses on the model's compromised elements, dissociating backdoor correlations while maintaining the model's overall integrity. Extensive experimental results show that our method effectively defends against various backdoor attack methods in the CLIP model. Compared to SoTA backdoor defense methods, UBT achieves the lowest attack success rate while maintaining a high clean accuracy of the model (attack success rate decreases by 19% compared to SOTA, while clean accuracy increases by 2.57%).
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Finite-time blow-up in fully parabolic quasilinear Keller-Segel systems with supercritical exponents
Authors:
Xinru Cao,
Mario Fuest
Abstract:
We examine the possibility of finite-time blow-up of solutions to the fully parabolic quasilinear Keller--Segel model \begin{align}\tag{$\star$}\label{prob:star}
\begin{cases}
u_t = \nabla \cdot ((u+1)^{m-1}\nabla u - u(u+1)^{q-1}\nabla v) & \text{in $Ω\times (0, T)$}, \\
v_t = Δv - v + u & \text{in $Ω\times (0, T)$}
\end{cases} \end{align} in a ball $Ω\subset \mathbb R^n$ with $n\geq 2$.…
▽ More
We examine the possibility of finite-time blow-up of solutions to the fully parabolic quasilinear Keller--Segel model \begin{align}\tag{$\star$}\label{prob:star}
\begin{cases}
u_t = \nabla \cdot ((u+1)^{m-1}\nabla u - u(u+1)^{q-1}\nabla v) & \text{in $Ω\times (0, T)$}, \\
v_t = Δv - v + u & \text{in $Ω\times (0, T)$}
\end{cases} \end{align} in a ball $Ω\subset \mathbb R^n$ with $n\geq 2$. Previous results show that unbounded solutions exist for all $m, q \in \mathbb R$ with $m-q<\frac{n-2}{n}$, which, however, are necessarily global in time if $q \leq 0$. It is expected that finite-time blow-up is possible whenever $q > 0$ but in the fully parabolic setting this has so far only been shown when $\max\{m, q\} \geq 1$.
In the present paper, we substantially extend these findings. Our main results for the two- and three-dimensional settings state that \eqref{prob:star} admits solutions blowing up in finite time if \begin{align*}
m-q<\frac{n-2}{n}
\quad \text{and} \quad
\begin{cases}
q < 2m & \text{if } n = 2, \\
q < 2m - \frac23 \text{ or } m > \frac23 & \text{if } n = 3,
\end{cases} \end{align*} that is, also for certain $m, q$ with $\max\{m, q\} < 1$. As a key new ingredient in our proof, we make use of (singular) pointwise upper estimates for $u$.
△ Less
Submitted 28 September, 2024;
originally announced September 2024.
-
Probing mental health information in speech foundation models
Authors:
Marc de Gennes,
Adrien Lesage,
Martin Denais,
Xuan-Nga Cao,
Simon Chang,
Pierre Van Remoortere,
Cyrille Dakhlia,
Rachid Riad
Abstract:
Non-invasive methods for diagnosing mental health conditions, such as speech analysis, offer promising potential in modern medicine. Recent advancements in machine learning, particularly speech foundation models, have shown significant promise in detecting mental health states by capturing diverse features. This study investigates which pretext tasks in these models best transfer to mental health…
▽ More
Non-invasive methods for diagnosing mental health conditions, such as speech analysis, offer promising potential in modern medicine. Recent advancements in machine learning, particularly speech foundation models, have shown significant promise in detecting mental health states by capturing diverse features. This study investigates which pretext tasks in these models best transfer to mental health detection and examines how different model layers encode features relevant to mental health conditions. We also probed the optimal length of audio segments and the best pooling strategies to improve detection accuracy. Using the Callyope-GP and Androids datasets, we evaluated the models' effectiveness across different languages and speech tasks, aiming to enhance the generalizability of speech-based mental health diagnostics. Our approach achieved SOTA scores in depression detection on the Androids dataset.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Computation Pre-Offloading for MEC-Enabled Vehicular Networks via Trajectory Prediction
Authors:
Ting Zhang,
Bo Yang,
Zhiwen Yu,
Xuelin Cao,
George C. Alexandropoulos,
Yan Zhang,
Chau Yuen
Abstract:
Task offloading is of paramount importance to efficiently orchestrate vehicular wireless networks, necessitating the availability of information regarding the current network status and computational resources. However, due to the mobility of the vehicles and the limited computational resources for performing task offloading in near-real-time, such schemes may require high latency, thus, become ev…
▽ More
Task offloading is of paramount importance to efficiently orchestrate vehicular wireless networks, necessitating the availability of information regarding the current network status and computational resources. However, due to the mobility of the vehicles and the limited computational resources for performing task offloading in near-real-time, such schemes may require high latency, thus, become even infeasible. To address this issue, in this paper, we present a Trajectory Prediction-based Pre-offloading Decision (TPPD) algorithm for analyzing the historical trajectories of vehicles to predict their future coordinates, thereby allowing for computational resource allocation in advance. We first utilize the Long Short-Term Memory (LSTM) network model to predict each vehicle's movement trajectory. Then, based on the task requirements and the predicted trajectories, we devise a dynamic resource allocation algorithm using a Double Deep Q-Network (DDQN) that enables the edge server to minimize task processing delay, while ensuring effective utilization of the available computational resources. Our simulation results verify the effectiveness of the proposed approach, showcasing that, as compared with traditional real-time task offloading strategies, the proposed TPPD algorithm significantly reduces task processing delay while improving resource utilization.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
P4Q: Learning to Prompt for Quantization in Visual-language Models
Authors:
Huixin Sun,
Runqi Wang,
Yanjing Li,
Xianbin Cao,
Xiaolong Jiang,
Yao Hu,
Baochang Zhang
Abstract:
Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent n…
▽ More
Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization'' (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 $\times$ while achieving 66.94\% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24\% with negligible additional parameters on the ImageNet dataset.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
TA-Cleaner: A Fine-grained Text Alignment Backdoor Defense Strategy for Multimodal Contrastive Learning
Authors:
Yuan Xun,
Siyuan Liang,
Xiaojun Jia,
Xinwei Liu,
Xiaochun Cao
Abstract:
Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the…
▽ More
Pre-trained large models for multimodal contrastive learning, such as CLIP, have been widely recognized in the industry as highly susceptible to data-poisoned backdoor attacks. This poses significant risks to downstream model training. In response to such potential threats, finetuning offers a simpler and more efficient defense choice compared to retraining large models with augmented data. In the supervised learning domain, fine-tuning defense strategies can achieve excellent defense performance. However, in the unsupervised and semi-supervised domain, we find that when CLIP faces some complex attack techniques, the existing fine-tuning defense strategy, CleanCLIP, has some limitations on defense performance. The synonym substitution of its text-augmentation is insufficient to enhance the text feature space. To compensate for this weakness, we improve it by proposing a fine-grained \textbf{T}ext \textbf{A}lignment \textbf{C}leaner (TA-Cleaner) to cut off feature connections of backdoor triggers. We randomly select a few samples for positive and negative subtext generation at each epoch of CleanCLIP, and align the subtexts to the images to strengthen the text self-supervision. We evaluate the effectiveness of our TA-Cleaner against six attack algorithms and conduct comprehensive zero-shot classification tests on ImageNet1K. Our experimental results demonstrate that TA-Cleaner achieves state-of-the-art defensiveness among finetuning-based defense techniques. Even when faced with the novel attack technique BadCLIP, our TA-Cleaner outperforms CleanCLIP by reducing the ASR of Top-1 and Top-10 by 52.02\% and 63.88\%, respectively.
△ Less
Submitted 7 October, 2024; v1 submitted 26 September, 2024;
originally announced September 2024.
-
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors
Authors:
Aiping Zhang,
Zongsheng Yue,
Renjing Pei,
Wenqi Ren,
Xiaochun Cao
Abstract:
Diffusion-based image super-resolution (SR) methods have achieved remarkable success by leveraging large pre-trained text-to-image diffusion models as priors. However, these methods still face two challenges: the requirement for dozens of sampling steps to achieve satisfactory results, which limits efficiency in real scenarios, and the neglect of degradation models, which are critical auxiliary in…
▽ More
Diffusion-based image super-resolution (SR) methods have achieved remarkable success by leveraging large pre-trained text-to-image diffusion models as priors. However, these methods still face two challenges: the requirement for dozens of sampling steps to achieve satisfactory results, which limits efficiency in real scenarios, and the neglect of degradation models, which are critical auxiliary information in solving the SR problem. In this work, we introduced a novel one-step SR model, which significantly addresses the efficiency issue of diffusion-based SR methods. Unlike existing fine-tuning strategies, we designed a degradation-guided Low-Rank Adaptation (LoRA) module specifically for SR, which corrects the model parameters based on the pre-estimated degradation information from low-resolution images. This module not only facilitates a powerful data-dependent or degradation-dependent SR model but also preserves the generative prior of the pre-trained diffusion model as much as possible. Furthermore, we tailor a novel training pipeline by introducing an online negative sample generation strategy. Combined with the classifier-free guidance strategy during inference, it largely improves the perceptual quality of the super-resolution results. Extensive experiments have demonstrated the superior efficiency and effectiveness of the proposed model compared to recent state-of-the-art methods.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Adversarial Backdoor Defense in CLIP
Authors:
Junhao Kuang,
Siyuan Liang,
Jiawei Liang,
Kuanrong Liu,
Xiaochun Cao
Abstract:
Multimodal contrastive pretraining, exemplified by models like CLIP, has been found to be vulnerable to backdoor attacks. While current backdoor defense methods primarily employ conventional data augmentation to create augmented samples aimed at feature alignment, these methods fail to capture the distinct features of backdoor samples, resulting in suboptimal defense performance. Observations reve…
▽ More
Multimodal contrastive pretraining, exemplified by models like CLIP, has been found to be vulnerable to backdoor attacks. While current backdoor defense methods primarily employ conventional data augmentation to create augmented samples aimed at feature alignment, these methods fail to capture the distinct features of backdoor samples, resulting in suboptimal defense performance. Observations reveal that adversarial examples and backdoor samples exhibit similarities in the feature space within the compromised models. Building on this insight, we propose Adversarial Backdoor Defense (ABD), a novel data augmentation strategy that aligns features with meticulously crafted adversarial examples. This approach effectively disrupts the backdoor association. Our experiments demonstrate that ABD provides robust defense against both traditional uni-modal and multimodal backdoor attacks targeting CLIP. Compared to the current state-of-the-art defense method, CleanCLIP, ABD reduces the attack success rate by 8.66% for BadNet, 10.52% for Blended, and 53.64% for BadCLIP, while maintaining a minimal average decrease of just 1.73% in clean accuracy.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Towards Social AI: A Survey on Understanding Social Interactions
Authors:
Sangmin Lee,
Minzhi Li,
Bolin Lai,
Wenqi Jia,
Fiona Ryan,
Xu Cao,
Ozgur Kara,
Bikram Boote,
Weiyan Shi,
Diyi Yang,
James M. Rehg
Abstract:
Social interactions form the foundation of human societies. Artificial intelligence has made significant progress in certain areas, but enabling machines to seamlessly understand social interactions remains an open challenge. It is important to address this gap by endowing machines with social capabilities. We identify three key capabilities needed for effective social understanding: 1) understand…
▽ More
Social interactions form the foundation of human societies. Artificial intelligence has made significant progress in certain areas, but enabling machines to seamlessly understand social interactions remains an open challenge. It is important to address this gap by endowing machines with social capabilities. We identify three key capabilities needed for effective social understanding: 1) understanding multimodal social cues, 2) understanding multi-party dynamics, and 3) understanding beliefs. Building upon these foundations, we classify and review existing machine learning works on social understanding from the perspectives of verbal, non-verbal, and multimodal social cues. The verbal branch focuses on understanding linguistic signals such as speaker intent, dialogue sentiment, and commonsense reasoning. The non-verbal branch addresses techniques for perceiving social meaning from visual behaviors such as body gestures, gaze patterns, and facial expressions. The multimodal branch covers approaches that integrate verbal and non-verbal multimodal cues to holistically interpret social interactions such as recognizing emotions, conversational dynamics, and social situations. By reviewing the scope and limitations of current approaches and benchmarks, we aim to clarify the development trajectory and illuminate the path towards more comprehensive intelligence for social understanding. We hope this survey will spur further research interest and insights into this area.
△ Less
Submitted 30 September, 2024; v1 submitted 5 September, 2024;
originally announced September 2024.
-
ID-Guard: A Universal Framework for Combating Facial Manipulation via Breaking Identification
Authors:
Zuomin Qu,
Wei Lu,
Xiangyang Luo,
Qian Wang,
Xiaochun Cao
Abstract:
The misuse of deep learning-based facial manipulation poses a potential threat to civil rights. To prevent this fraud at its source, proactive defense technology was proposed to disrupt the manipulation process by adding invisible adversarial perturbations into images, making the forged output unconvincing to the observer. However, their non-directional disruption of the output may result in the r…
▽ More
The misuse of deep learning-based facial manipulation poses a potential threat to civil rights. To prevent this fraud at its source, proactive defense technology was proposed to disrupt the manipulation process by adding invisible adversarial perturbations into images, making the forged output unconvincing to the observer. However, their non-directional disruption of the output may result in the retention of identity information of the person in the image, leading to stigmatization of the individual. In this paper, we propose a novel universal framework for combating facial manipulation, called ID-Guard. Specifically, this framework requires only a single forward pass of an encoder-decoder network to generate a cross-model universal adversarial perturbation corresponding to a specific facial image. To ensure anonymity in manipulated facial images, a novel Identity Destruction Module (IDM) is introduced to destroy the identifiable information in forged faces targetedly. Additionally, we optimize the perturbations produced by considering the disruption towards different facial manipulations as a multi-task learning problem and design a dynamic weights strategy to improve cross-model performance. The proposed framework reports impressive results in defending against multiple widely used facial manipulations, effectively distorting the identifiable regions in the manipulated facial images. In addition, our experiments reveal the ID-Guard's ability to enable disrupted images to avoid face inpaintings and open-source image recognition systems.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
JoyHallo: Digital human model for Mandarin
Authors:
Sheng Shi,
Xuyang Cao,
Jun Zhao,
Guoxin Wang
Abstract:
In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This datase…
▽ More
In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its strong ability to generate English videos, demonstrating excellent cross-language generation capabilities. The code and models are available at https://jdh-algo.github.io/JoyHallo.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Geometry and Analysis of Gradient Ricci Solitons in Dimension Four
Authors:
Xiaodong Cao,
Hung Tran
Abstract:
[Dedicated to Richard S. Hamilton on forty years of Ricci flow] Gradient Ricci solitons have garnered significant attention both as self-similar solutions and singularity models of the Ricci flow. This survey article starts with a list of examples; it also provides some geometric aspects of gradient Ricci solitons, including various asymptotic behaviors; finally, it discusses some recent results o…
▽ More
[Dedicated to Richard S. Hamilton on forty years of Ricci flow] Gradient Ricci solitons have garnered significant attention both as self-similar solutions and singularity models of the Ricci flow. This survey article starts with a list of examples; it also provides some geometric aspects of gradient Ricci solitons, including various asymptotic behaviors; finally, it discusses some recent results on classification and rigidity. In particular, this survey focuses on dimension four.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
HSIGene: A Foundation Model For Hyperspectral Image Generation
Authors:
Li Pang,
Datao Tang,
Shuang Xu,
Deyu Meng,
Xiangyong Cao
Abstract:
Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affe…
▽ More
Hyperspectral image (HSI) plays a vital role in various fields such as agriculture and environmental monitoring. However, due to the expensive acquisition cost, the number of hyperspectral images is limited, degenerating the performance of downstream tasks. Although some recent studies have attempted to employ diffusion models to synthesize HSIs, they still struggle with the scarcity of HSIs, affecting the reliability and diversity of the generated images. Some studies propose to incorporate multi-modal data to enhance spatial diversity, but the spectral fidelity cannot be ensured. In addition, existing HSI synthesis models are typically uncontrollable or only support single-condition control, limiting their ability to generate accurate and reliable HSIs. To alleviate these issues, we propose HSIGene, a novel HSI generation foundation model which is based on latent diffusion and supports multi-condition control, allowing for more precise and reliable HSI generation. To enhance the spatial diversity of the training data while preserving spectral fidelity, we propose a new data augmentation method based on spatial super-resolution, in which HSIs are upscaled first, and thus abundant training patches could be obtained by cropping the high-resolution HSIs. In addition, to improve the perceptual quality of the augmented data, we introduce a novel two-stage HSI super-resolution framework, which first applies RGB bands super-resolution and then utilizes our proposed Rectangular Guided Attention Network (RGAN) for guided HSI super-resolution. Experiments demonstrate that the proposed model is capable of generating a vast quantity of realistic HSIs for downstream tasks such as denoising and super-resolution. The code and models are available at https://github.com/LiPang/HSIGene.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
Infrared Small Target Detection in Satellite Videos: A New Dataset and A Novel Recurrent Feature Refinement Framework
Authors:
Xinyi Ying,
Li Liu,
Zaipin Lin,
Yangsi Shi,
Yingqian Wang,
Ruojing Li,
Xu Cao,
Boyang Li,
Shilin Zhou
Abstract:
Multi-frame infrared small target (MIRST) detection in satellite videos is a long-standing, fundamental yet challenging task for decades, and the challenges can be summarized as: First, extremely small target size, highly complex clutters & noises, various satellite motions result in limited feature representation, high false alarms, and difficult motion analyses. Second, the lack of large-scale p…
▽ More
Multi-frame infrared small target (MIRST) detection in satellite videos is a long-standing, fundamental yet challenging task for decades, and the challenges can be summarized as: First, extremely small target size, highly complex clutters & noises, various satellite motions result in limited feature representation, high false alarms, and difficult motion analyses. Second, the lack of large-scale public available MIRST dataset in satellite videos greatly hinders the algorithm development. To address the aforementioned challenges, in this paper, we first build a large-scale dataset for MIRST detection in satellite videos (namely IRSatVideo-LEO), and then develop a recurrent feature refinement (RFR) framework as the baseline method. Specifically, IRSatVideo-LEO is a semi-simulated dataset with synthesized satellite motion, target appearance, trajectory and intensity, which can provide a standard toolbox for satellite video generation and a reliable evaluation platform to facilitate the algorithm development. For baseline method, RFR is proposed to be equipped with existing powerful CNN-based methods for long-term temporal dependency exploitation and integrated motion compensation & MIRST detection. Specifically, a pyramid deformable alignment (PDA) module and a temporal-spatial-frequency modulation (TSFM) module are proposed to achieve effective and efficient feature alignment, propagation, aggregation and refinement. Extensive experiments have been conducted to demonstrate the effectiveness and superiority of our scheme. The comparative results show that ResUNet equipped with RFR outperforms the state-of-the-art MIRST detection methods. Dataset and code are released at https://github.com/XinyiYing/RFR.
△ Less
Submitted 4 October, 2024; v1 submitted 18 September, 2024;
originally announced September 2024.
-
A cytokine-enhanced viral infection model with CTL immune response, distributed delay and saturation incidence
Authors:
Xiaodong Cao,
Songbo Hou,
Xiaoqing Kong
Abstract:
In this paper, we propose a delayed cytokine-enhanced viral infection model incorporating saturation incidence and immune response. We compute the basic reproduction numbers and introduce a convex cone to discuss the impact of non-negative initial data on solutions. By defining appropriate Lyapunov functionals and employing LaSalle's invariance principle, we investigate the stability of three equi…
▽ More
In this paper, we propose a delayed cytokine-enhanced viral infection model incorporating saturation incidence and immune response. We compute the basic reproduction numbers and introduce a convex cone to discuss the impact of non-negative initial data on solutions. By defining appropriate Lyapunov functionals and employing LaSalle's invariance principle, we investigate the stability of three equilibria: the disease-free equilibrium, the immunity-inactivated equilibrium, and the immunity-activated equilibrium. We establish conditions under which these equilibria are globally asymptotically stable. Numerical analyses not only corroborate the theoretical results but also reveal that intervention in virus infection can be achieved by extending the delay period.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Length Desensitization in Directed Preference Optimization
Authors:
Wei Liu,
Yang Bai,
Chengcheng Han,
Rongxiang Weng,
Jun Xu,
Xuezhi Cao,
Jingang Wang,
Xunliang Cai
Abstract:
Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we…
▽ More
Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO's optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40\% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-real preferences.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
SDF-Net: A Hybrid Detection Network for Mediastinal Lymph Node Detection on Contrast CT Images
Authors:
Jiuli Xiong,
Lanzhuju Mei,
Jiameng Liu,
Dinggang Shen,
Zhong Xue,
Xiaohuan Cao
Abstract:
Accurate lymph node detection and quantification are crucial for cancer diagnosis and staging on contrast-enhanced CT images, as they impact treatment planning and prognosis. However, detecting lymph nodes in the mediastinal area poses challenges due to their low contrast, irregular shapes and dispersed distribution. In this paper, we propose a Swin-Det Fusion Network (SDF-Net) to effectively dete…
▽ More
Accurate lymph node detection and quantification are crucial for cancer diagnosis and staging on contrast-enhanced CT images, as they impact treatment planning and prognosis. However, detecting lymph nodes in the mediastinal area poses challenges due to their low contrast, irregular shapes and dispersed distribution. In this paper, we propose a Swin-Det Fusion Network (SDF-Net) to effectively detect lymph nodes. SDF-Net integrates features from both segmentation and detection to enhance the detection capability of lymph nodes with various shapes and sizes. Specifically, an auto-fusion module is designed to merge the feature maps of segmentation and detection networks at different levels. To facilitate effective learning without mask annotations, we introduce a shape-adaptive Gaussian kernel to represent lymph node in the training stage and provide more anatomical information for effective learning. Comparative results demonstrate promising performance in addressing the complex lymph node detection problem.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
LAMP: Learnable Meta-Path Guided Adversarial Contrastive Learning for Heterogeneous Graphs
Authors:
Siqing Li,
Jin-Duk Park,
Wei Huang,
Xin Cao,
Won-Yong Shin,
Zhiqiang Xu
Abstract:
Heterogeneous graph neural networks (HGNNs) have significantly propelled the information retrieval (IR) field. Still, the effectiveness of HGNNs heavily relies on high-quality labels, which are often expensive to acquire. This challenge has shifted attention towards Heterogeneous Graph Contrastive Learning (HGCL), which usually requires pre-defined meta-paths. However, our findings reveal that met…
▽ More
Heterogeneous graph neural networks (HGNNs) have significantly propelled the information retrieval (IR) field. Still, the effectiveness of HGNNs heavily relies on high-quality labels, which are often expensive to acquire. This challenge has shifted attention towards Heterogeneous Graph Contrastive Learning (HGCL), which usually requires pre-defined meta-paths. However, our findings reveal that meta-path combinations significantly affect performance in unsupervised settings, an aspect often overlooked in current literature. Existing HGCL methods have considerable variability in outcomes across different meta-path combinations, thereby challenging the optimization process to achieve consistent and high performance. In response, we introduce \textsf{LAMP} (\underline{\textbf{L}}earn\underline{\textbf{A}}ble \underline{\textbf{M}}eta-\underline{\textbf{P}}ath), a novel adversarial contrastive learning approach that integrates various meta-path sub-graphs into a unified and stable structure, leveraging the overlap among these sub-graphs. To address the denseness of this integrated sub-graph, we propose an adversarial training strategy for edge pruning, maintaining sparsity to enhance model performance and robustness. \textsf{LAMP} aims to maximize the difference between meta-path and network schema views for guiding contrastive learning to capture the most meaningful information. Our extensive experimental study conducted on four diverse datasets from the Heterogeneous Graph Benchmark (HGB) demonstrates that \textsf{LAMP} significantly outperforms existing state-of-the-art unsupervised models in terms of accuracy and robustness.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
Dissipative Nonlinear Thouless Pumping of Temporal Solitons
Authors:
Xuzhen Cao,
Chunyu Jia,
Ying Hu,
Zhaoxin Liang
Abstract:
The interplay between topology and soliton is a central topic in nonlinear topological physics. So far, most studies have been confined to conservative settings. Here, we explore Thouless pumping of dissipative temporal solitons in a nonconservative one-dimensional optical system with gain and spectral filtering, described by the paradigmatic complex Ginzburg-Landau equation. Two dissipatively ind…
▽ More
The interplay between topology and soliton is a central topic in nonlinear topological physics. So far, most studies have been confined to conservative settings. Here, we explore Thouless pumping of dissipative temporal solitons in a nonconservative one-dimensional optical system with gain and spectral filtering, described by the paradigmatic complex Ginzburg-Landau equation. Two dissipatively induced nonlinear topological phase transitions are identified. First, when varying dissipative parameters across a threshold, the soliton transitions from being trapped in time to quantized drifting. This quantized temporal drift remains robust, even as the system evolves from a single-soliton state into multi-soliton state. Second, a dynamically emergent phase transition is found: the soliton is arrested until a critical point of its evolution, where a transition to topological drift occurs. Both phenomena uniquely arise from the dynamical interplay of dissipation, nonlinearity and topology.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Improved Diversity-Promoting Collaborative Metric Learning for Recommendation
Authors:
Shilong Bao,
Qianqian Xu,
Zhiyong Yang,
Yuan He,
Xiaochun Cao,
Qingming Huang
Abstract:
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this settin…
▽ More
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems (RS), closing the gap between metric learning and collaborative filtering. Following the convention of RS, existing practices exploit unique user representation in their model design. This paper focuses on a challenging scenario where a user has multiple categories of interests. Under this setting, the unique user representation might induce preference bias, especially when the item category distribution is imbalanced. To address this issue, we propose a novel method called \textit{Diversity-Promoting Collaborative Metric Learning} (DPCML), with the hope of considering the commonly ignored minority interest of the user. The key idea behind DPCML is to introduce a set of multiple representations for each user in the system where users' preference toward an item is aggregated by taking the minimum item-user distance among their embedding set. Specifically, we instantiate two effective assignment strategies to explore a proper quantity of vectors for each user. Meanwhile, a \textit{Diversity Control Regularization Scheme} (DCRS) is developed to accommodate the multi-vector representation strategy better. Theoretically, we show that DPCML could induce a smaller generalization error than traditional CML. Furthermore, we notice that CML-based approaches usually require \textit{negative sampling} to reduce the heavy computational burden caused by the pairwise objective therein. In this paper, we reveal the fundamental limitation of the widely adopted hard-aware sampling from the One-Way Partial AUC (OPAUC) perspective and then develop an effective sampling alternative for the CML-based paradigm. Finally, comprehensive experiments over a range of benchmark datasets speak to the efficacy of DPCML. Code are available at \url{https://github.com/statusrank/LibCML}.
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
A Hybrid Transformer-Mamba Network for Single Image Deraining
Authors:
Shangquan Sun,
Wenqi Ren,
Juxiang Zhou,
Jianhou Gan,
Rui Wang,
Xiaochun Cao
Abstract:
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions, limiting the exploitation of non-local receptive fields. In response to this issue, we introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies. Based on the prior of distinct spectra…
▽ More
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions, limiting the exploitation of non-local receptive fields. In response to this issue, we introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies. Based on the prior of distinct spectral-domain features of rain degradation and background, we design a spectral-banded Transformer blocks on the first branch. Self-attention is executed within the combination of the spectral-domain channel dimension to improve the ability of modeling long-range dependencies. To enhance frequency-specific information, we present a spectral enhanced feed-forward module that aggregates features in the spectral domain. In the second branch, Mamba layers are equipped with cascaded bidirectional state space model modules to additionally capture the modeling of both local and global information. At each stage of both the encoder and decoder, we perform channel-wise concatenation of dual-branch features and achieve feature fusion through channel reduction, enabling more effective integration of the multi-scale information from the Transformer and Mamba branches. To better reconstruct innate signal-level relations within clean images, we also develop a spectral coherence loss. Extensive experiments on diverse datasets and real-world images demonstrate the superiority of our method compared against the state-of-the-art approaches.
△ Less
Submitted 31 August, 2024;
originally announced September 2024.
-
Subspace Diffusion Posterior Sampling for Travel-Time Tomography
Authors:
Xiang Cao,
Xiaoqun Zhang
Abstract:
Diffusion models have been widely studied as effective generative tools for solving inverse problems. The main ideas focus on performing the reverse sampling process conditioned on noisy measurements, using well-established numerical solvers for gradient updates. Although diffusion-based sampling methods can produce high-quality reconstructions, challenges persist in nonlinear PDE-based inverse pr…
▽ More
Diffusion models have been widely studied as effective generative tools for solving inverse problems. The main ideas focus on performing the reverse sampling process conditioned on noisy measurements, using well-established numerical solvers for gradient updates. Although diffusion-based sampling methods can produce high-quality reconstructions, challenges persist in nonlinear PDE-based inverse problems and sampling speed. In this work, we explore solving PDE-based travel-time tomography based on subspace diffusion generative models. Our main contributions are twofold: First, we propose a posterior sampling process for PDE-based inverse problems by solving the associated adjoint-state equation. Second, we resorted to the subspace-based dimension reduction technique for conditional sampling acceleration, enabling solving the PDE-based inverse problems from coarse to refined grids. Our numerical experiments showed satisfactory advancements in improving the travel-time imaging quality and reducing the sampling time for reconstruction.
△ Less
Submitted 30 August, 2024;
originally announced August 2024.
-
A comprehensive study of axion photoproduction off the nucleon in chiral effective field theory
Authors:
Xiong-Hui Cao,
Zhi-Hui Guo
Abstract:
We calculate the amplitudes of the axion photoproduction off the nucleon, i.e., $γN \to a N$, within the framework of chiral effective field theory. Several different types of contributions are simultaneously included in our calculation, namely the nucleon exchanges up to next-to-leading order, the $aγγ$ vertex and the vector meson exchanges in the $t$-channel. We utilize the existing hadronic inp…
▽ More
We calculate the amplitudes of the axion photoproduction off the nucleon, i.e., $γN \to a N$, within the framework of chiral effective field theory. Several different types of contributions are simultaneously included in our calculation, namely the nucleon exchanges up to next-to-leading order, the $aγγ$ vertex and the vector meson exchanges in the $t$-channel. We utilize the existing hadronic inputs as much as possible to fix the unknown couplings. A comprehensive study of the phenomenological discussions is then provided in this work. Different mechanisms in the $γN \to a N$ processes manifest distinct behaviors in the total and differential cross sections, which could provide useful quantities to distinguish different axion models.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection
Authors:
Siyuan Yao,
Hao Sun,
Tian-Zhu Xiang,
Xiao Wang,
Xiaochun Cao
Abstract:
Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for cam…
▽ More
Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at https://github.com/Garyson1204/HGINet.
△ Less
Submitted 21 September, 2024; v1 submitted 27 August, 2024;
originally announced August 2024.
-
Bubble $^{36}$Ar and Its New Breathing Modes
Authors:
Ge Ren,
Chun-Wang Ma,
Xi-Guang Cao,
Yu-Gang Ma
Abstract:
The bubble nuclei are important components of exotic nuclear structures characterized by special depletions of central densities. Focusing on bubble structures of $^{36}$Ar, the characterizations of bubble nuclei were explored with the framework of the extended quantum molecular dynamics model. Three density distribution modes were uncovered for the first time, i.e. micro-bubble, bubble, and clust…
▽ More
The bubble nuclei are important components of exotic nuclear structures characterized by special depletions of central densities. Focusing on bubble structures of $^{36}$Ar, the characterizations of bubble nuclei were explored with the framework of the extended quantum molecular dynamics model. Three density distribution modes were uncovered for the first time, i.e. micro-bubble, bubble, and cluster resonances, which show unique spectral signature compared to the monopole resonance spectrum as the excitation intensity was increased in bubbles. Of pivotal importance is the revelation that the bubble mode's oscillation frequency closely resembles macroscopic bubble dynamics, building a connection between classical macroscopic phenomena and the quantum complexity of the nuclear structure. The discovery marks a crucial step forward in deciphering the relationship between classical and quantum domains within the enigmatic world of atomic nuclei.
△ Less
Submitted 25 August, 2024;
originally announced August 2024.
-
Optical Inversion Using Plasmonic Contrast Agents
Authors:
Xinlin Cao,
Ahcene Ghandriche,
Mourad Sini
Abstract:
We describe a new method to reconstruct the permittivity distribution, of an object to image, from the remotely measured electromagnetic field. We propose to use the remote fields measured before and after injecting locally in the medium plasmonic nano-particles. Such a technique is known in the framework of imaging using contrast agents where, in optical imaging, the nano-particles play the role…
▽ More
We describe a new method to reconstruct the permittivity distribution, of an object to image, from the remotely measured electromagnetic field. We propose to use the remote fields measured before and after injecting locally in the medium plasmonic nano-particles. Such a technique is known in the framework of imaging using contrast agents where, in optical imaging, the nano-particles play the role of these contrast agents. The plasmonic nano-particles are known to enjoy resonant effects, as enhancing the applied incident field, while excited at certain particular frequencies called plasmonic resonances. These resonant frequencies encode the values of the unknown permittivity at the location of the injected nano-particles. The imaging methods we propose mainly use this resonant effect. We show that the imaging functional build up from contrasting the fields before and after injecting the nano-particles, measured at one single back-scattered direction, and in an explicit band of incident frequencies reaches its maximum values, in terms of the incident frequency, precisely at the mentioned plasmonic resonances. Such a behavior allows us to recover these plasmonic resonances from which we recover the point-wise values of the permittivity distribution.
In this work, we describe the method and provide the mathematical justification of this resonant effect and its use for the optical inversion using plasmonic nano-particles as contrast agents.
△ Less
Submitted 25 August, 2024;
originally announced August 2024.
-
Vision Transformer Neural Quantum States for Impurity Models
Authors:
Xiaodong Cao,
Zhicheng Zhong,
Yi Lu
Abstract:
Transformer neural networks, known for their ability to recognize complex patterns in high-dimensional data, offer a promising framework for capturing many-body correlations in quantum systems. We employ an adapted Vision Transformer (ViT) architecture to model quantum impurity models, optimizing it with a subspace expansion scheme that surpasses conventional variational Monte Carlo in both accura…
▽ More
Transformer neural networks, known for their ability to recognize complex patterns in high-dimensional data, offer a promising framework for capturing many-body correlations in quantum systems. We employ an adapted Vision Transformer (ViT) architecture to model quantum impurity models, optimizing it with a subspace expansion scheme that surpasses conventional variational Monte Carlo in both accuracy and efficiency. Benchmarks against matrix product states in single- and three-orbital Anderson impurity models show that these ViT-based neural quantum states achieve comparable or superior accuracy with significantly fewer variational parameters. We further extend our approach to compute dynamical quantities by constructing a restricted excitation space that effectively captures relevant physical processes, yielding accurate core-level X-ray absorption spectra. These findings highlight the potential of ViT-based neural quantum states for accurate and efficient modeling of quantum impurity models.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Gravitational form factor $D$ of charmonium from shear stress
Authors:
Tianyang Hu,
Xianghui Cao,
Siqi Xu,
Yang Li,
Xingbo Zhao,
James P. Vary
Abstract:
Based on our recent analysis of the hadronic matrix element of the stress-energy tensor in covariant light front dynamics, we extract the charmonium gravitational form factor $D(Q^2)$ from shear stress $T^{12}$. This is in contrast to our recent work using the (light-front) energy density $T^{+-}$. Indeed, by comparing these two currents, we identify terms that are responsible for the violation of…
▽ More
Based on our recent analysis of the hadronic matrix element of the stress-energy tensor in covariant light front dynamics, we extract the charmonium gravitational form factor $D(Q^2)$ from shear stress $T^{12}$. This is in contrast to our recent work using the (light-front) energy density $T^{+-}$. Indeed, by comparing these two currents, we identify terms that are responsible for the violation of the current conservation. Numerical results based on basis light-front quantization show that the violation effects are small and the $D$-term extracted from the two currents are close to each other, hence validating our previous work using $T^{+-}$.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
Dissecting a strongly coupled scalar nucleon
Authors:
Xianghui Cao,
Yang Li,
James P. Vary
Abstract:
We continue our investigation of the stress within a strongly coupled scalar nucleon, and now dissect the gravitational form factors into contributions from its constituents, the (mock) nucleon and the (mock) pion. The computation is based on a non-perturbative solution of the scalar Yukawa model in the light-front Hamiltonian formalism with a Fock sector expansion including up to one nucleon and…
▽ More
We continue our investigation of the stress within a strongly coupled scalar nucleon, and now dissect the gravitational form factors into contributions from its constituents, the (mock) nucleon and the (mock) pion. The computation is based on a non-perturbative solution of the scalar Yukawa model in the light-front Hamiltonian formalism with a Fock sector expansion including up to one nucleon and two pions. By employing the ``good currents" $T^{++}_i$, $T^{+-}_i$ and $T^{12}_i$, we extract the full set of gravitational form factors $A_i$, $D_i$, $\bar c_i$ without the contamination of the spurious form factors, and free of uncanceled UV divergences. With these results, we decompose the mass of the system into its constituents and compute the matter and mechanical radii, gaining insights into the strongly coupled system.
△ Less
Submitted 18 August, 2024;
originally announced August 2024.
-
SSNeRF: Sparse View Semi-supervised Neural Radiance Fields with Augmentation
Authors:
Xiao Cao,
Beibei Lin,
Bo Wang,
Zhiyong Huang,
Robby T. Tan
Abstract:
Sparse view NeRF is challenging because limited input images lead to an under constrained optimization problem for volume rendering. Existing methods address this issue by relying on supplementary information, such as depth maps. However, generating this supplementary information accurately remains problematic and often leads to NeRF producing images with undesired artifacts. To address these arti…
▽ More
Sparse view NeRF is challenging because limited input images lead to an under constrained optimization problem for volume rendering. Existing methods address this issue by relying on supplementary information, such as depth maps. However, generating this supplementary information accurately remains problematic and often leads to NeRF producing images with undesired artifacts. To address these artifacts and enhance robustness, we propose SSNeRF, a sparse view semi supervised NeRF method based on a teacher student framework. Our key idea is to challenge the NeRF module with progressively severe sparse view degradation while providing high confidence pseudo labels. This approach helps the NeRF model become aware of noise and incomplete information associated with sparse views, thus improving its robustness. The novelty of SSNeRF lies in its sparse view specific augmentations and semi supervised learning mechanism. In this approach, the teacher NeRF generates novel views along with confidence scores, while the student NeRF, perturbed by the augmented input, learns from the high confidence pseudo labels. Our sparse view degradation augmentation progressively injects noise into volume rendering weights, perturbs feature maps in vulnerable layers, and simulates sparse view blurriness. These augmentation strategies force the student NeRF to recognize degradation and produce clearer rendered views. By transferring the student's parameters to the teacher, the teacher gains increased robustness in subsequent training iterations. Extensive experiments demonstrate the effectiveness of our SSNeRF in generating novel views with less sparse view degradation. We will release code upon acceptance.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
TsCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning
Authors:
Miaoge Li,
Jingcai Guo,
Richard Yi Da Xu,
Dongsheng Wang,
Xiaofeng Cao,
Song Guo
Abstract:
Compositional Zero-Shot Learning (CZSL) aims to recognize novel \textit{state-object} compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In t…
▽ More
Compositional Zero-Shot Learning (CZSL) aims to recognize novel \textit{state-object} compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.
△ Less
Submitted 22 August, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
HAIR: Hypernetworks-based All-in-One Image Restoration
Authors:
Jin Cao,
Yi Cao,
Li Pang,
Deyu Meng,
Xiangyong Cao
Abstract:
Image restoration aims to recover a high-quality clean image from its degraded version. Recent progress in image restoration has demonstrated the effectiveness of All-in-One image restoration models in addressing various unknown degradations simultaneously. However, these existing methods typically utilize the same parameters to tackle images with different types of degradation, forcing the model…
▽ More
Image restoration aims to recover a high-quality clean image from its degraded version. Recent progress in image restoration has demonstrated the effectiveness of All-in-One image restoration models in addressing various unknown degradations simultaneously. However, these existing methods typically utilize the same parameters to tackle images with different types of degradation, forcing the model to balance the performance between different tasks and limiting its performance on each task. To alleviate this issue, we propose HAIR, a Hypernetworks-based All-in-One Image Restoration plug-and-play method that generates parameters based on the input image and thus makes the model to adapt to specific degradation dynamically. Specifically, HAIR consists of two main components, i.e., Classifier and Hyper Selecting Net (HSN). The Classifier is a simple image classification network used to generate a Global Information Vector (GIV) that contains the degradation information of the input image, and the HSN is a simple fully-connected neural network that receives the GIV and outputs parameters for the corresponding modules. Extensive experiments demonstrate that HAIR can significantly improve the performance of existing image restoration models in a plug-and-play manner, both in single-task and All-in-One settings. Notably, our proposed model Res-HAIR, which integrates HAIR into the well-known Restormer, can obtain superior or comparable performance compared with current state-of-the-art methods. Moreover, we theoretically demonstrate that to achieve a given small enough error, our proposed HAIR requires fewer parameters in contrast to mainstream embedding-based All-in-One methods. The code is available at https://github.com/toummHus/HAIR.
△ Less
Submitted 15 October, 2024; v1 submitted 15 August, 2024;
originally announced August 2024.
-
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
Authors:
Enneng Yang,
Li Shen,
Guibing Guo,
Xingwei Wang,
Xiaochun Cao,
Jie Zhang,
Dacheng Tao
Abstract:
Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature reg…
▽ More
Model merging is an efficient empowerment technique in the machine learning community that does not require the collection of raw training data and does not require expensive computation. As model merging becomes increasingly prevalent across various fields, it is crucial to understand the available model merging techniques comprehensively. However, there is a significant gap in the literature regarding a systematic and thorough review of these techniques. This survey provides a comprehensive overview of model merging methods and theories, their applications in various domains and settings, and future research directions. Specifically, we first propose a new taxonomic approach that exhaustively discusses existing model merging methods. Secondly, we discuss the application of model merging techniques in large language models, multimodal large language models, and 10+ machine learning subfields, including continual learning, multi-task learning, few-shot learning, etc. Finally, we highlight the remaining challenges of model merging and discuss future research directions. A comprehensive list of papers about model merging is available at \url{https://github.com/EnnengYang/Awesome-Model-Merging-Methods-Theories-Applications}.
△ Less
Submitted 5 September, 2024; v1 submitted 14 August, 2024;
originally announced August 2024.
-
DiffSG: A Generative Solver for Network Optimization with Diffusion Model
Authors:
Ruihuai Liang,
Bo Yang,
Zhiwen Yu,
Bin Guo,
Xuelin Cao,
Mérouane Debbah,
H. Vincent Poor,
Chau Yuen
Abstract:
Diffusion generative models, famous for their performance in image generation, are popular in various cross-domain applications. However, their use in the communication community has been mostly limited to auxiliary tasks like data modeling and feature extraction. These models hold greater promise for fundamental problems in network optimization compared to traditional machine learning methods. Di…
▽ More
Diffusion generative models, famous for their performance in image generation, are popular in various cross-domain applications. However, their use in the communication community has been mostly limited to auxiliary tasks like data modeling and feature extraction. These models hold greater promise for fundamental problems in network optimization compared to traditional machine learning methods. Discriminative deep learning often falls short due to its single-step input-output mapping and lack of global awareness of the solution space, especially given the complexity of network optimization's objective functions. In contrast, diffusion generative models can consider a broader range of solutions and exhibit stronger generalization by learning parameters that describe the distribution of the underlying solution space, with higher probabilities assigned to better solutions. We propose a new framework Diffusion Model-based Solution Generation (DiffSG), which leverages the intrinsic distribution learning capabilities of diffusion generative models to learn high-quality solution distributions based on given inputs. The optimal solution within this distribution is highly probable, allowing it to be effectively reached through repeated sampling. We validate the performance of DiffSG on several typical network optimization problems, including mixed-integer non-linear programming, convex optimization, and hierarchical non-convex optimization. Our results show that DiffSG outperforms existing baselines. In summary, we demonstrate the potential of diffusion generative models in tackling complex network optimization problems and outline a promising path for their broader application in the communication community.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.