-
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
Authors:
Wei An,
Xiao Bi,
Guanting Chen,
Shanhuang Chen,
Chengqi Deng,
Honghui Ding,
Kai Dong,
Qiushi Du,
Wenjun Gao,
Kang Guan,
Jianzhong Guo,
Yongqiang Guo,
Zhe Fu,
Ying He,
Panpan Huang,
Jiashi Li,
Wenfeng Liang,
Xiaodong Liu,
Xin Liu,
Yiyuan Liu,
Yuxuan Liu,
Shanghao Lu,
Xuan Lu,
Xiaotao Nie,
Tian Pei
, et al. (27 additional authors not shown)
Abstract:
The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic…
▽ More
The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.
△ Less
Submitted 31 August, 2024; v1 submitted 26 August, 2024;
originally announced August 2024.
-
BihoT: A Large-Scale Dataset and Benchmark for Hyperspectral Camouflaged Object Tracking
Authors:
Hanzheng Wang,
Wei Li,
Xiang-Gen Xia,
Qian Du
Abstract:
Hyperspectral object tracking (HOT) has exhibited potential in various applications, particularly in scenes where objects are camouflaged. Existing trackers can effectively retrieve objects via band regrouping because of the bias in existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows the tracker to directly…
▽ More
Hyperspectral object tracking (HOT) has exhibited potential in various applications, particularly in scenes where objects are camouflaged. Existing trackers can effectively retrieve objects via band regrouping because of the bias in existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows the tracker to directly use the visual features obtained from the false-color images generated by hyperspectral images without the need to extract spectral features. To tackle this bias, we find that the tracker should focus on the spectral information when object appearance is unreliable. Thus, we provide a new task called hyperspectral camouflaged object tracking (HCOT) and meticulously construct a large-scale HCOT dataset, termed BihoT, which consists of 41,912 hyperspectral images covering 49 video sequences. The dataset covers various artificial camouflage scenes where objects have similar appearances, diverse spectrums, and frequent occlusion, making it a very challenging dataset for HCOT. Besides, a simple but effective baseline model, named spectral prompt-based distractor-aware network (SPDAN), is proposed, comprising a spectral embedding network (SEN), a spectral prompt-based backbone network (SPBN), and a distractor-aware module (DAM). Specifically, the SEN extracts spectral-spatial features via 3-D and 2-D convolutions. Then, the SPBN fine-tunes powerful RGB trackers with spectral prompts and alleviates the insufficiency of training samples. Moreover, the DAM utilizes a novel statistic to capture the distractor caused by occlusion from objects and background. Extensive experiments demonstrate that our proposed SPDAN achieves state-of-the-art performance on the proposed BihoT and other HOT datasets.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Authors:
Chenglong Wang,
Yang Gan,
Yifu Huo,
Yongyu Mu,
Murun Yang,
Qiaozhi He,
Tong Xiao,
Chunliang Zhang,
Tongran Liu,
Quan Du,
Di Yang,
Jingbo Zhu
Abstract:
Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the sc…
▽ More
Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Authors:
Huajian Xin,
Z. Z. Ren,
Junxiao Song,
Zhihong Shao,
Wanjia Zhao,
Haocheng Wang,
Bo Liu,
Liyue Zhang,
Xuan Lu,
Qiushi Du,
Wenjun Gao,
Qihao Zhu,
Dejian Yang,
Zhibin Gou,
Z. F. Wu,
Fuli Luo,
Chong Ruan
Abstract:
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-…
▽ More
We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark ($63.5\%$) and the undergraduate level ProofNet benchmark ($25.3\%$).
△ Less
Submitted 15 August, 2024;
originally announced August 2024.
-
Ginzburg--Landau Functionals in the Large-Graph Limit
Authors:
Edith Zhang,
James Scott,
Qiang Du,
Mason A. Porter
Abstract:
Ginzburg--Landau (GL) functionals on graphs, which are relaxations of graph-cut functionals on graphs, have yielded a variety of insights in image segmentation and graph clustering. In this paper, we study large-graph limits of GL functionals by taking a functional-analytic view of graphs as nonlocal kernels. For a graph $W_n$ with $n$ nodes, the corresponding graph GL functional $\GL^{W_n}_\ep$ i…
▽ More
Ginzburg--Landau (GL) functionals on graphs, which are relaxations of graph-cut functionals on graphs, have yielded a variety of insights in image segmentation and graph clustering. In this paper, we study large-graph limits of GL functionals by taking a functional-analytic view of graphs as nonlocal kernels. For a graph $W_n$ with $n$ nodes, the corresponding graph GL functional $\GL^{W_n}_\ep$ is an energy for functions on $W_n$. We minimize GL functionals on sequences of growing graphs that converge to functions called graphons. For such sequences of graphs, we show that the graph GL functional $Γ$-converges to a continuous and nonlocal functional that we call the \emph{graphon GL functional}. We also investigate the sharp-interface limits of the graph GL and graphon GL functionals, and we relate these limits to a nonlocal total variation. We express the limiting GL functional in terms of Young measures and thereby obtain a probabilistic interpretation of the variational problem in the large-graph limit. Finally, to develop intuition about the graphon GL functional, we compute the GL minimizer for several example families of graphons.
△ Less
Submitted 1 August, 2024;
originally announced August 2024.
-
Imperative Learning: A Self-supervised Neural-Symbolic Learning Framework for Robot Autonomy
Authors:
Chen Wang,
Kaiyi Ji,
Junyi Geng,
Zhongqiang Ren,
Taimeng Fu,
Fan Yang,
Yifan Guo,
Haonan He,
Xiangyu Chen,
Zitong Zhan,
Qiwei Du,
Shaoshu Su,
Bowen Li,
Yuheng Qiu,
Yi Du,
Qihang Li,
Yifan Yang,
Xiao Lin,
Zhipeng Zhao
Abstract:
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neural-symbolic (NeS…
▽ More
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neural-symbolic (NeSy) computational framework, imperative learning (IL), for robot autonomy, leveraging the generalization abilities of symbolic reasoning. The framework of IL consists of three primary components: a neural module, a reasoning engine, and a memory system. We formulate IL as a special bilevel optimization (BLO), which enables reciprocal learning over the three modules. This overcomes the label-intensive obstacles associated with data-driven approaches and takes advantage of symbolic reasoning concerning logical reasoning, physical principles, geometric analysis, etc. We discuss several optimization techniques for IL and verify their effectiveness in five distinct robot autonomy tasks including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. Through various experiments, we show that IL can significantly enhance robot autonomy capabilities and we anticipate that it will catalyze further research across diverse domains.
△ Less
Submitted 6 August, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Authors:
DeepSeek-AI,
Qihao Zhu,
Daya Guo,
Zhihong Shao,
Dejian Yang,
Peiyi Wang,
Runxin Xu,
Y. Wu,
Yukun Li,
Huazuo Gao,
Shirong Ma,
Wangding Zeng,
Xiao Bi,
Zihui Gu,
Hanwei Xu,
Damai Dai,
Kai Dong,
Liyue Zhang,
Yishi Piao,
Zhibin Gou,
Zhenda Xie,
Zhewen Hao,
Bingxuan Wang,
Junxiao Song,
Deli Chen
, et al. (15 additional authors not shown)
Abstract:
We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathe…
▽ More
We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Object-Attribute-Relation Representation based Video Semantic Communication
Authors:
Qiyuan Du,
Yiping Duan,
Qianqian Yang,
Xiaoming Tao,
Mérouane Debbah
Abstract:
With the rapid growth of multimedia data volume, there is an increasing need for efficient video transmission in applications such as virtual reality and future video streaming services. Semantic communication is emerging as a vital technique for ensuring efficient and reliable transmission in low-bandwidth, high-noise settings. However, most current approaches focus on joint source-channel coding…
▽ More
With the rapid growth of multimedia data volume, there is an increasing need for efficient video transmission in applications such as virtual reality and future video streaming services. Semantic communication is emerging as a vital technique for ensuring efficient and reliable transmission in low-bandwidth, high-noise settings. However, most current approaches focus on joint source-channel coding (JSCC) that depends on end-to-end training. These methods often lack an interpretable semantic representation and struggle with adaptability to various downstream tasks. In this paper, we introduce the use of object-attribute-relation (OAR) as a semantic framework for videos to facilitate low bit-rate coding and enhance the JSCC process for more effective video transmission. We utilize OAR sequences for both low bit-rate representation and generative video reconstruction. Additionally, we incorporate OAR into the image JSCC model to prioritize communication resources for areas more critical to downstream tasks. Our experiments on traffic surveillance video datasets assess the effectiveness of our approach in terms of video transmission performance. The empirical findings demonstrate that our OAR-based video coding method not only outperforms H.265 coding at lower bit-rates but also synergizes with JSCC to deliver robust and efficient video transmission.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
EAIA: An Efficient and Anonymous Identity Authentication Scheme in 5G-V2V
Authors:
Qianmin Du,
Jianhong Zhou,
Maode Ma
Abstract:
Vehicle Ad-hoc Networks (VANETs) have experienced significant development in recent years, playing a crucial role in enhancing the driving experience by enabling safer and more efficient inter-vehicle interactions through information exchange. Vehicle-to-vehicle (V2V) communication is particularly vital as it not only helps to prevent collisions and improve traffic efficiency but also provides ess…
▽ More
Vehicle Ad-hoc Networks (VANETs) have experienced significant development in recent years, playing a crucial role in enhancing the driving experience by enabling safer and more efficient inter-vehicle interactions through information exchange. Vehicle-to-vehicle (V2V) communication is particularly vital as it not only helps to prevent collisions and improve traffic efficiency but also provides essential situational awareness to drivers or autonomous driving systems. Communication is typically supported by Roadside Units (RSUs); however, in practical applications, vehicles may exceed the communication range of RSUs, thus exposing them to various malicious attacks. Additionally, considering the limited computational resources of onboard units (OBUs) in vehicles, there is a high demand for designing lightweight security protocols that support V2V communication. To address this issue, this paper proposes an efficient anonymous V2V identity authentication protocol tailored for scenarios that lack RSU support. The proposed protocol has been formally assessed using the Scyther tool, demonstrating its capability to withstand major typical malicious attacks. Performance evaluations indicate that the proposed protocol is efficient in terms of communication and computational overhead, making it a viable solution for V2V vehicle communication.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling
Authors:
Diwei Huang,
Kunyang Lin,
Peihao Chen,
Qing Du,
Mingkui Tan
Abstract:
Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and map…
▽ More
Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and maps provide explicit structural regularities of sound propagation, which are valuable for modeling environment acoustics. We thus extract pixel-wise semantic features derived from observations and project them into a top-down map, namely the **observation semantic map**. This map contains the relative positional information among points and the semantic feature information associated with each point. Yet, limited information extracted by few-shot observations on the map is not sufficient for understanding and modeling the whole scene. We address the challenge by generating a **scene semantic map** via diffusing features and anticipating the observation semantic map. The scene semantic map then interacts with echo encoding by a transformer-based encoder-decoder to predict RIR for arbitrary speaker-listener query pairs. Extensive experiments on Matterport3D and Replica dataset verify the efficacy of our framework.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Social Force Embedded Mixed Graph Convolutional Network for Multi-class Trajectory Prediction
Authors:
Quancheng Du,
Xiao Wang,
Shouguo Yin,
Lingxi Li,
Huansheng Ning
Abstract:
Accurate prediction of agent motion trajectories is crucial for autonomous driving, contributing to the reduction of collision risks in human-vehicle interactions and ensuring ample response time for other traffic participants. Current research predominantly focuses on traditional deep learning methods, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These meth…
▽ More
Accurate prediction of agent motion trajectories is crucial for autonomous driving, contributing to the reduction of collision risks in human-vehicle interactions and ensuring ample response time for other traffic participants. Current research predominantly focuses on traditional deep learning methods, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). These methods leverage relative distances to forecast the motion trajectories of a single class of agents. However, in complex traffic scenarios, the motion patterns of various types of traffic participants exhibit inherent randomness and uncertainty. Relying solely on relative distances may not adequately capture the nuanced interaction patterns between different classes of road users. In this paper, we propose a novel multi-class trajectory prediction method named the social force embedded mixed graph convolutional network (SFEM-GCN). SFEM-GCN comprises three graph topologies: the semantic graph (SG), position graph (PG), and velocity graph (VG). These graphs encode various of social force relationships among different classes of agents in complex scenes. Specifically, SG utilizes one-hot encoding of agent-class information to guide the construction of graph adjacency matrices based on semantic information. PG and VG create adjacency matrices to capture motion interaction relationships between different classes agents. These graph structures are then integrated into a mixed graph, where learning is conducted using a spatiotemporal graph convolutional neural network (ST-GCNN). To further enhance prediction performance, we adopt temporal convolutional networks (TCNs) to generate the predicted trajectory with fewer parameters. Experimental results on publicly available datasets demonstrate that SFEM-GCN surpasses state-of-the-art methods in terms of accuracy and robustness.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
S4TP: Social-Suitable and Safety-Sensitive Trajectory Planning for Autonomous Vehicles
Authors:
Xiao Wang,
Ke Tang,
Xingyuan Dai,
Jintao Xu,
Quancheng Du,
Rui Ai,
Yuxiao Wang,
Weihao Gu
Abstract:
In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and…
▽ More
In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and safety-sensitive trajectory planning (S4TP) framework. Specifically, S4TP integrates the Social-Aware Trajectory Prediction (SATP) and Social-Aware Driving Risk Field (SADRF) modules. SATP utilizes Transformers to effectively encode the driving scene and incorporates an AV's planned trajectory during the prediction decoding process. SADRF assesses the expected surrounding risk degrees during AVs-HDVs interactions, each with different social characteristics, visualized as two-dimensional heat maps centered on the AV. SADRF models the driving intentions of the surrounding HDVs and predicts trajectories based on the representation of vehicular interactions. S4TP employs an optimization-based approach for motion planning, utilizing the predicted HDVs'trajectories as input. With the integration of SADRF, S4TP executes real-time online optimization of the planned trajectory of AV within lowrisk regions, thus improving the safety and the interpretability of the planned trajectory. We have conducted comprehensive tests of the proposed method using the SMARTS simulator. Experimental results in complex social scenarios, such as unprotected left turn intersections, merging, cruising, and overtaking, validate the superiority of our proposed S4TP in terms of safety and rationality. S4TP achieves a pass rate of 100% across all scenarios, surpassing the current state-of-the-art methods Fanta of 98.25% and Predictive-Decision of 94.75%.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Single-temporal Supervised Remote Change Detection for Domain Generalization
Authors:
Qiangang Du,
Jinlong Peng,
Xu Chen,
Qingdong He,
Liren He,
Qiang Nie,
Wenbing Zhu,
Mingmin Chi,
Yabiao Wang,
Chengjie Wang
Abstract:
Change detection is widely applied in remote sensing image analysis. Existing methods require training models separately for each dataset, which leads to poor domain generalization. Moreover, these methods rely heavily on large amounts of high-quality pair-labelled data for training, which is expensive and impractical. In this paper, we propose a multimodal contrastive learning (ChangeCLIP) based…
▽ More
Change detection is widely applied in remote sensing image analysis. Existing methods require training models separately for each dataset, which leads to poor domain generalization. Moreover, these methods rely heavily on large amounts of high-quality pair-labelled data for training, which is expensive and impractical. In this paper, we propose a multimodal contrastive learning (ChangeCLIP) based on visual-language pre-training for change detection domain generalization. Additionally, we propose a dynamic context optimization for prompt learning. Meanwhile, to address the data dependency issue of existing methods, we introduce a single-temporal and controllable AI-generated training strategy (SAIN). This allows us to train the model using a large number of single-temporal images without image pairs in the real world, achieving excellent generalization. Extensive experiments on series of real change detection datasets validate the superiority and strong generalization of ChangeCLIP, outperforming state-of-the-art change detection methods. Code will be available.
△ Less
Submitted 23 April, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Leveraging Fine-Grained Information and Noise Decoupling for Remote Sensing Change Detection
Authors:
Qiangang Du,
Jinlong Peng,
Changan Wang,
Xu Chen,
Qingdong He,
Wenbing Zhu,
Mingmin Chi,
Yabiao Wang,
Chengjie Wang
Abstract:
Change detection aims to identify remote sense object changes by analyzing data between bitemporal image pairs. Due to the large temporal and spatial span of data collection in change detection image pairs, there are often a significant amount of task-specific and task-agnostic noise. Previous effort has focused excessively on denoising, with this goes a great deal of loss of fine-grained informat…
▽ More
Change detection aims to identify remote sense object changes by analyzing data between bitemporal image pairs. Due to the large temporal and spatial span of data collection in change detection image pairs, there are often a significant amount of task-specific and task-agnostic noise. Previous effort has focused excessively on denoising, with this goes a great deal of loss of fine-grained information. In this paper, we revisit the importance of fine-grained features in change detection and propose a series of operations for fine-grained information compensation and noise decoupling (FINO). First, the context is utilized to compensate for the fine-grained information in the feature space. Next, a shape-aware and a brightness-aware module are designed to improve the capacity for representation learning. The shape-aware module guides the backbone for more precise shape estimation, guiding the backbone network in extracting object shape features. The brightness-aware module learns a overall brightness estimation to improve the model's robustness to task-agnostic noise. Finally, a task-specific noise decoupling structure is designed as a way to improve the model's ability to separate noise interference from feature similarity. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in multiple change detection benchmarks. The code will be made available.
△ Less
Submitted 21 June, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
AlphaFin: Benchmarking Financial Analysis with Retrieval-Augmented Stock-Chain Framework
Authors:
Xiang Li,
Zhenyu Li,
Chen Shi,
Yong Xu,
Qing Du,
Mingkui Tan,
Jun Huang,
Wei Lin
Abstract:
The task of financial analysis primarily encompasses two key areas: stock trend prediction and the corresponding financial question answering. Currently, machine learning and deep learning algorithms (ML&DL) have been widely applied for stock trend predictions, leading to significant progress. However, these methods fail to provide reasons for predictions, lacking interpretability and reasoning pr…
▽ More
The task of financial analysis primarily encompasses two key areas: stock trend prediction and the corresponding financial question answering. Currently, machine learning and deep learning algorithms (ML&DL) have been widely applied for stock trend predictions, leading to significant progress. However, these methods fail to provide reasons for predictions, lacking interpretability and reasoning processes. Also, they can not integrate textual information such as financial news or reports. Meanwhile, large language models (LLMs) have remarkable textual understanding and generation ability. But due to the scarcity of financial training datasets and limited integration with real-time knowledge, LLMs still suffer from hallucinations and are unable to keep up with the latest information. To tackle these challenges, we first release AlphaFin datasets, combining traditional research datasets, real-time financial data, and handwritten chain-of-thought (CoT) data. It has a positive impact on training LLMs for completing financial analysis. We then use AlphaFin datasets to benchmark a state-of-the-art method, called Stock-Chain, for effectively tackling the financial analysis task, which integrates retrieval-augmented generation (RAG) techniques. Extensive experiments are conducted to demonstrate the effectiveness of our framework on financial analysis.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection
Authors:
Liren He,
Zhengkai Jiang,
Jinlong Peng,
Liang Liu,
Qiangang Du,
Xiaobin Hu,
Wenbing Zhu,
Mingmin Chi,
Yabiao Wang,
Chengjie Wang
Abstract:
In the field of multi-class anomaly detection, reconstruction-based methods derived from single-class anomaly detection face the well-known challenge of "learning shortcuts", wherein the model fails to learn the patterns of normal samples as it should, opting instead for shortcuts such as identity mapping or artificial noise elimination. Consequently, the model becomes unable to reconstruct genuin…
▽ More
In the field of multi-class anomaly detection, reconstruction-based methods derived from single-class anomaly detection face the well-known challenge of "learning shortcuts", wherein the model fails to learn the patterns of normal samples as it should, opting instead for shortcuts such as identity mapping or artificial noise elimination. Consequently, the model becomes unable to reconstruct genuine anomalies as normal instances, resulting in a failure of anomaly detection. To counter this issue, we present a novel unified feature reconstruction-based anomaly detection framework termed RLR (Reconstruct features from a Learnable Reference representation). Unlike previous methods, RLR utilizes learnable reference representations to compel the model to learn normal feature patterns explicitly, thereby prevents the model from succumbing to the "learning shortcuts" issue. Additionally, RLR incorporates locality constraints into the learnable reference to facilitate more effective normal pattern capture and utilizes a masked learnable key attention mechanism to enhance robustness. Evaluation of RLR on the 15-category MVTec-AD dataset and the 12-category VisA dataset shows superior performance compared to state-of-the-art methods under the unified setting. The code of RLR will be publicly available.
△ Less
Submitted 16 July, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising
Authors:
Shuai Hu,
Feng Gao,
Xiaowei Zhou,
Junyu Dong,
Qian Du
Abstract:
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data. However, simultaneously modeling global and local features is rarely explored to enhance HSI denoising. In this letter, we propose a hybrid convolution and attention network (HCANet), which leverages both the strengths of convolution neural networks (CNNs) and Transformers. To enhan…
▽ More
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data. However, simultaneously modeling global and local features is rarely explored to enhance HSI denoising. In this letter, we propose a hybrid convolution and attention network (HCANet), which leverages both the strengths of convolution neural networks (CNNs) and Transformers. To enhance the modeling of both global and local features, we have devised a convolution and attention fusion module aimed at capturing long-range dependencies and neighborhood spectral correlations. Furthermore, to improve multi-scale information aggregation, we design a multi-scale feed-forward network to enhance denoising performance by extracting features at different scales. Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet. The proposed model is effective in removing various types of complex noise. Our codes are available at \url{https://github.com/summitgao/HCANet}.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
SSF-Net: Spatial-Spectral Fusion Network with Spectral Angle Awareness for Hyperspectral Object Tracking
Authors:
Hanzheng Wang,
Wei Li,
Xiang-Gen Xia,
Qian Du,
Jing Tian
Abstract:
Hyperspectral video (HSV) offers valuable spatial, spectral, and temporal information simultaneously, making it highly suitable for handling challenges such as background clutter and visual similarity in object tracking. However, existing methods primarily focus on band regrouping and rely on RGB trackers for feature extraction, resulting in limited exploration of spectral information and difficul…
▽ More
Hyperspectral video (HSV) offers valuable spatial, spectral, and temporal information simultaneously, making it highly suitable for handling challenges such as background clutter and visual similarity in object tracking. However, existing methods primarily focus on band regrouping and rely on RGB trackers for feature extraction, resulting in limited exploration of spectral information and difficulties in achieving complementary representations of object features. In this paper, a spatial-spectral fusion network with spectral angle awareness (SST-Net) is proposed for hyperspectral (HS) object tracking. Firstly, to address the issue of insufficient spectral feature extraction in existing networks, a spatial-spectral feature backbone ($S^2$FB) is designed. With the spatial and spectral extraction branch, a joint representation of texture and spectrum is obtained. Secondly, a spectral attention fusion module (SAFM) is presented to capture the intra- and inter-modality correlation to obtain the fused features from the HS and RGB modalities. It can incorporate the visual information into the HS spectral context to form a robust representation. Thirdly, to ensure a more accurate response of the tracker to the object position, a spectral angle awareness module (SAAM) investigates the region-level spectral similarity between the template and search images during the prediction stage. Furthermore, we develop a novel spectral angle awareness loss (SAAL) to offer guidance for the SAAM based on similar regions. Finally, to obtain the robust tracking results, a weighted prediction method is considered to combine the HS and RGB predicted motions of objects to leverage the strengths of each modality. Extensive experiments on the HOTC dataset demonstrate the effectiveness of the proposed SSF-Net, compared with state-of-the-art trackers.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
LightSword: A Customized Virtual Reality Exergame for Long-Term Cognitive Inhibition Training in Older Adults
Authors:
Qiuxin Du,
Zhen Song,
Haiyan Jiang,
Xiaoying Wei,
Dongdong Weng,
Mingming Fan
Abstract:
The decline of cognitive inhibition significantly impacts older adults' quality of life and well-being, making it a vital public health problem in today's aging society. Previous research has demonstrated that Virtual reality (VR) exergames have great potential to enhance cognitive inhibition among older adults. However, existing commercial VR exergames were unsuitable for older adults' long-term…
▽ More
The decline of cognitive inhibition significantly impacts older adults' quality of life and well-being, making it a vital public health problem in today's aging society. Previous research has demonstrated that Virtual reality (VR) exergames have great potential to enhance cognitive inhibition among older adults. However, existing commercial VR exergames were unsuitable for older adults' long-term cognitive training due to the inappropriate cognitive activation paradigm, unnecessary complexity, and unbefitting difficulty levels. To bridge these gaps, we developed a customized VR cognitive training exergame (LightSword) based on Dual-task and Stroop paradigms for long-term cognitive inhibition training among healthy older adults. Subsequently, we conducted an eight-month longitudinal user study with 12 older adults aged 60 years and above to demonstrate the effectiveness of LightSword in improving cognitive inhibition. After the training, the cognitive inhibition abilities of older adults were significantly enhanced, with benefits persisting for 6 months. This result indicated that LightSword has both short-term and long-term effects in enhancing cognitive inhibition. Furthermore, qualitative feedback revealed that older adults exhibited a positive attitude toward long-term training with LightSword, which enhanced their motivation and compliance.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
A Survey on Data Selection for LLM Instruction Tuning
Authors:
Jiahao Wang,
Bolin Zhang,
Qianlong Du,
Jiajun Zhang,
Dianhui Chu
Abstract:
Instruction tuning is a vital step of training large language models (LLM), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLM. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instructi…
▽ More
Instruction tuning is a vital step of training large language models (LLM), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLM. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instruction datasets, aiming to reduce training costs and enhance the instruction-following capabilities of LLMs. This paper presents a comprehensive survey on data selection for LLM instruction tuning. Firstly, we introduce the wildly used instruction datasets. Then, we propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances,and the evaluation strategies and results of data selection methods are also elaborated in detail. Finally, we emphasize the open challenges and present new frontiers of this task.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Authors:
DeepSeek-AI,
:,
Xiao Bi,
Deli Chen,
Guanting Chen,
Shanhuang Chen,
Damai Dai,
Chengqi Deng,
Honghui Ding,
Kai Dong,
Qiushi Du,
Zhe Fu,
Huazuo Gao,
Kaige Gao,
Wenjun Gao,
Ruiqi Ge,
Kang Guan,
Daya Guo,
Jianzhong Guo,
Guangbo Hao,
Zhewen Hao,
Ying He,
Wenjie Hu,
Panpan Huang,
Erhang Li
, et al. (63 additional authors not shown)
Abstract:
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B…
▽ More
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
△ Less
Submitted 5 January, 2024;
originally announced January 2024.
-
Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent
Authors:
Haoran Liao,
Qinyi Du,
Shaohua Hu,
Hao He,
Yanyan Xu,
Jidong Tian,
Yaohui Jin
Abstract:
Large language models (LLMs) face challenges in solving complex mathematical problems that require comprehensive capacities to parse the statements, associate domain knowledge, perform compound logical reasoning, and integrate the intermediate rationales. Tackling all these problems once could be arduous for LLMs, thus leading to confusion in generation. In this work, we explore the potential of e…
▽ More
Large language models (LLMs) face challenges in solving complex mathematical problems that require comprehensive capacities to parse the statements, associate domain knowledge, perform compound logical reasoning, and integrate the intermediate rationales. Tackling all these problems once could be arduous for LLMs, thus leading to confusion in generation. In this work, we explore the potential of enhancing LLMs with agents by meticulous decomposition and modeling of mathematical reasoning process. Specifically, we propose a formal description of the mathematical solving and extend LLMs with an agent-based zero-shot framework named $\bf{P}$lanner-$\bf{R}$easoner-$\bf{E}$xecutor-$\bf{R}$eflector (PRER). We further provide and implement two MathAgents that define the logical forms and inherent relations via a pool of actions in different grains and orientations: MathAgent-M adapts its actions to LLMs, while MathAgent-H aligns with humankind. Experiments on miniF2F and MATH have demonstrated the effectiveness of PRER and proposed MathAgents, achieving an increase of $12.3\%$($53.9\%\xrightarrow{}66.2\%$) on the MiniF2F, $9.2\%$ ($49.8\%\xrightarrow{}59.0\%$) on MATH, and $13.2\%$($23.2\%\xrightarrow{}35.4\%$) for level-5 problems of MATH against GPT-4. Further analytical results provide more insightful perspectives on exploiting the behaviors of LLMs as agents.
△ Less
Submitted 16 December, 2023; v1 submitted 14 December, 2023;
originally announced December 2023.
-
MoDS: Model-oriented Data Selection for Instruction Tuning
Authors:
Qianlong Du,
Chengqing Zong,
Jiajun Zhang
Abstract:
Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction dat…
▽ More
Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction data for a given LLM is still an open problem. To address this problem, in this paper we present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity. First, our approach utilizes a quality evaluation model to filter out the high-quality subset from the original instruction dataset, and then designs an algorithm to further select from the high-quality subset a seed instruction dataset with good coverage. The seed dataset is applied to fine-tune the foundation LLM to obtain an initial instruction-following LLM. Finally, we develop a necessity evaluation model to find out the instruction data which are performed badly in the initial instruction-following LLM and consider them necessary instructions to further improve the LLMs. In this way, we can get a small high-quality, broad-coverage and high-necessity subset from the original instruction datasets. Experimental results show that, the model fine-tuned with 4,000 instruction pairs selected by our approach could perform better than the model fine-tuned with the full original dataset which includes 214k instruction data.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification
Authors:
Junyan Lin,
Feng Gao,
Xiaocheng Shi,
Junyu Dong,
Qian Du
Abstract:
Masked image modeling (MIM) is a highly popular and effective self-supervised learning method for image understanding. Existing MIM-based methods mostly focus on spatial feature modeling, neglecting spectral feature modeling. Meanwhile, existing MIM-based methods use Transformer for feature extraction, some local or high-frequency information may get lost. To this end, we propose a spatial-spectra…
▽ More
Masked image modeling (MIM) is a highly popular and effective self-supervised learning method for image understanding. Existing MIM-based methods mostly focus on spatial feature modeling, neglecting spectral feature modeling. Meanwhile, existing MIM-based methods use Transformer for feature extraction, some local or high-frequency information may get lost. To this end, we propose a spatial-spectral masked auto-encoder (SS-MAE) for HSI and LiDAR/SAR data joint classification. Specifically, SS-MAE consists of a spatial-wise branch and a spectral-wise branch. The spatial-wise branch masks random patches and reconstructs missing pixels, while the spectral-wise branch masks random spectral channels and reconstructs missing channels. Our SS-MAE fully exploits the spatial and spectral representations of the input data. Furthermore, to complement local features in the training stage, we add two lightweight CNNs for feature extraction. Both global and local features are taken into account for feature modeling. To demonstrate the effectiveness of the proposed SS-MAE, we conduct extensive experiments on three publicly available datasets. Extensive experiments on three multi-source datasets verify the superiority of our SS-MAE compared with several state-of-the-art baselines. The source codes are available at \url{https://github.com/summitgao/SS-MAE}.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model
Authors:
Jianghao Chen,
Pu Jian,
Tengxiao Xi,
Dongyi Yi,
Qianlong Du,
Chenglin Ding,
Guibo Zhu,
Chengqing Zong,
Jinqiao Wang,
Jiajun Zhang
Abstract:
During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of…
▽ More
During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.
△ Less
Submitted 10 November, 2023; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Convolution and Attention Mixer for Synthetic Aperture Radar Image Change Detection
Authors:
Haopeng Zhang,
Zijing Lin,
Feng Gao,
Junyu Dong,
Qian Du,
Heng-Chao Li
Abstract:
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community. However, existing SAR change detection methods are mainly based on convolutional neural networks (CNNs), with limited consideration of global attention mechanism. In this letter, we explore Transformer-like architecture for SAR change detection to incorpo…
▽ More
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community. However, existing SAR change detection methods are mainly based on convolutional neural networks (CNNs), with limited consideration of global attention mechanism. In this letter, we explore Transformer-like architecture for SAR change detection to incorporate global attention. To this end, we propose a convolution and attention mixer (CAMixer). First, to compensate the inductive bias for Transformer, we combine self-attention with shift convolution in a parallel way. The parallel design effectively captures the global semantic information via the self-attention and performs local feature extraction through shift convolution simultaneously. Second, we adopt a gating mechanism in the feed-forward network to enhance the non-linear feature transformation. The gating mechanism is formulated as the element-wise multiplication of two parallel linear layers. Important features can be highlighted, leading to high-quality representations against speckle noise. Extensive experiments conducted on three SAR datasets verify the superior performance of the proposed CAMixer. The source codes will be publicly available at https://github.com/summitgao/CAMixer .
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
A Two-Dimensional Deep Network for RF-based Drone Detection and Identification Towards Secure Coverage Extension
Authors:
Zixiao Zhao,
Qinghe Du,
Xiang Yao,
Lei Lu,
Shijiao Zhang
Abstract:
As drones become increasingly prevalent in human life, they also raises security concerns such as unauthorized access and control, as well as collisions and interference with manned aircraft. Therefore, ensuring the ability to accurately detect and identify between different drones holds significant implications for coverage extension. Assisted by machine learning, radio frequency (RF) detection c…
▽ More
As drones become increasingly prevalent in human life, they also raises security concerns such as unauthorized access and control, as well as collisions and interference with manned aircraft. Therefore, ensuring the ability to accurately detect and identify between different drones holds significant implications for coverage extension. Assisted by machine learning, radio frequency (RF) detection can recognize the type and flight mode of drones based on the sampled drone signals. In this paper, we first utilize Short-Time Fourier. Transform (STFT) to extract two-dimensional features from the raw signals, which contain both time-domain and frequency-domain information. Then, we employ a Convolutional Neural Network (CNN) built with ResNet structure to achieve multi-class classifications. Our experimental results show that the proposed ResNet-STFT can achieve higher accuracy and faster convergence on the extended dataset. Additionally, it exhibits balanced performance compared to other baselines on the raw dataset.
△ Less
Submitted 26 August, 2023;
originally announced August 2023.
-
Learning Evaluation Models from Large Language Models for Sequence Generation
Authors:
Chenglong Wang,
Hang Zhou,
Kaiyan Chang,
Tongran Liu,
Chunliang Zhang,
Quan Du,
Tong Xiao,
Jingbo Zhu
Abstract:
Large language models achieve state-of-the-art performance on sequence generation evaluation, but typically have a large number of parameters. This is a computational challenge as presented by applying their evaluation capability at scale. To overcome the challenge, in this paper, we propose \textbf{ECT}, an \textbf{e}valuation \textbf{c}apability \textbf{t}ransfer method, to transfer the evaluati…
▽ More
Large language models achieve state-of-the-art performance on sequence generation evaluation, but typically have a large number of parameters. This is a computational challenge as presented by applying their evaluation capability at scale. To overcome the challenge, in this paper, we propose \textbf{ECT}, an \textbf{e}valuation \textbf{c}apability \textbf{t}ransfer method, to transfer the evaluation capability from LLMs to relatively lightweight language models. Based on the proposed ECT, we learn various evaluation models from ChatGPT, and employ them as reward models to improve sequence generation models via reinforcement learning and reranking approaches. Experimental results on machine translation, text style transfer, and summarization tasks demonstrate the effectiveness of our ECT. Notably, applying the learned evaluation models to sequence generation models results in better generated sequences as evaluated by commonly used metrics and ChatGPT.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Imbalance-Agnostic Source-Free Domain Adaptation via Avatar Prototype Alignment
Authors:
Hongbin Lin,
Mingkui Tan,
Yifan Zhang,
Zhen Qiu,
Shuaicheng Niu,
Dong Liu,
Qing Du,
Yanxia Liu
Abstract:
Source-free Unsupervised Domain Adaptation (SF-UDA) aims to adapt a well-trained source model to an unlabeled target domain without access to the source data. One key challenge is the lack of source data during domain adaptation. To handle this, we propose to mine the hidden knowledge of the source model and exploit it to generate source avatar prototypes. To this end, we propose a Contrastive Pro…
▽ More
Source-free Unsupervised Domain Adaptation (SF-UDA) aims to adapt a well-trained source model to an unlabeled target domain without access to the source data. One key challenge is the lack of source data during domain adaptation. To handle this, we propose to mine the hidden knowledge of the source model and exploit it to generate source avatar prototypes. To this end, we propose a Contrastive Prototype Generation and Adaptation (CPGA) method. CPGA consists of two stages: Prototype generation and Prototype adaptation. Extensive experiments on three UDA benchmark datasets demonstrate the superiority of CPGA. However, existing SF.UDA studies implicitly assume balanced class distributions for both the source and target domains, which hinders their real applications. To address this issue, we study a more practical SF-UDA task, termed imbalance-agnostic SF-UDA, where the class distributions of both the unseen source domain and unlabeled target domain are unknown and could be arbitrarily skewed. This task is much more challenging than vanilla SF-UDA due to the co-occurrence of covariate shifts and unidentified class distribution shifts between the source and target domains. To address this task, we extend CPGA and propose a new Target-aware Contrastive Prototype Generation and Adaptation (T-CPGA) method. Specifically, for better prototype adaptation in the imbalance-agnostic scenario, T-CPGA applies a new pseudo label generation strategy to identify unknown target class distribution and generate accurate pseudo labels, by utilizing the collective intelligence of the source model and an additional contrastive language-image pre-trained model. Meanwhile, we further devise a target label-distribution-aware classifier to adapt the model to the unknown target class distribution. We empirically show that T-CPGA significantly outperforms CPGA and other SF-UDA methods in imbalance-agnostic SF-UDA.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
Physical Knowledge Enhanced Deep Neural Network for Sea Surface Temperature Prediction
Authors:
Yuxin Meng,
Feng Gao,
Eric Rigall,
Ran Dong,
Junyu Dong,
Qian Du
Abstract:
Traditionally, numerical models have been deployed in oceanography studies to simulate ocean dynamics by representing physical equations. However, many factors pertaining to ocean dynamics seem to be ill-defined. We argue that transferring physical knowledge from observed data could further improve the accuracy of numerical models when predicting Sea Surface Temperature (SST). Recently, the advanc…
▽ More
Traditionally, numerical models have been deployed in oceanography studies to simulate ocean dynamics by representing physical equations. However, many factors pertaining to ocean dynamics seem to be ill-defined. We argue that transferring physical knowledge from observed data could further improve the accuracy of numerical models when predicting Sea Surface Temperature (SST). Recently, the advances in earth observation technologies have yielded a monumental growth of data. Consequently, it is imperative to explore ways in which to improve and supplement numerical models utilizing the ever-increasing amounts of historical observational data. To this end, we introduce a method for SST prediction that transfers physical knowledge from historical observations to numerical models. Specifically, we use a combination of an encoder and a generative adversarial network (GAN) to capture physical knowledge from the observed data. The numerical model data is then fed into the pre-trained model to generate physics-enhanced data, which can then be used for SST prediction. Experimental results demonstrate that the proposed method considerably enhances SST prediction performance when compared to several state-of-the-art baselines.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Multi-scale Adaptive Fusion Network for Hyperspectral Image Denoising
Authors:
Haodong Pan,
Feng Gao,
Junyu Dong,
Qian Du
Abstract:
Removing the noise and improving the visual quality of hyperspectral images (HSIs) is challenging in academia and industry. Great efforts have been made to leverage local, global or spectral context information for HSI denoising. However, existing methods still have limitations in feature interaction exploitation among multiple scales and rich spectral structure preservation. In view of this, we p…
▽ More
Removing the noise and improving the visual quality of hyperspectral images (HSIs) is challenging in academia and industry. Great efforts have been made to leverage local, global or spectral context information for HSI denoising. However, existing methods still have limitations in feature interaction exploitation among multiple scales and rich spectral structure preservation. In view of this, we propose a novel solution to investigate the HSI denoising using a Multi-scale Adaptive Fusion Network (MAFNet), which can learn the complex nonlinear mapping between clean and noisy HSI. Two key components contribute to improving the hyperspectral image denoising: A progressively multiscale information aggregation network and a co-attention fusion module. Specifically, we first generate a set of multiscale images and feed them into a coarse-fusion network to exploit the contextual texture correlation. Thereafter, a fine fusion network is followed to exchange the information across the parallel multiscale subnetworks. Furthermore, we design a co-attention fusion module to adaptively emphasize informative features from different scales, and thereby enhance the discriminative learning capability for denoising. Extensive experiments on synthetic and real HSI datasets demonstrate that the proposed MAFNet has achieved better denoising performance than other state-of-the-art techniques. Our codes are available at \verb'https://github.com/summitgao/MAFNet'.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Disentangling Writer and Character Styles for Handwriting Generation
Authors:
Gang Dai,
Yifan Zhang,
Qingfeng Wang,
Qing Du,
Zhuliang Yu,
Zhuoman Liu,
Shuangping Huang
Abstract:
Training machines to synthesize diverse handwritings is an intriguing task. Recently, RNN-based methods have been proposed to generate stylized online Chinese characters. However, these methods mainly focus on capturing a person's overall writing style, neglecting subtle style inconsistencies between characters written by the same person. For example, while a person's handwriting typically exhibit…
▽ More
Training machines to synthesize diverse handwritings is an intriguing task. Recently, RNN-based methods have been proposed to generate stylized online Chinese characters. However, these methods mainly focus on capturing a person's overall writing style, neglecting subtle style inconsistencies between characters written by the same person. For example, while a person's handwriting typically exhibits general uniformity (e.g., glyph slant and aspect ratios), there are still small style variations in finer details (e.g., stroke length and curvature) of characters. In light of this, we propose to disentangle the style representations at both writer and character levels from individual handwritings to synthesize realistic stylized online handwritten characters. Specifically, we present the style-disentangled Transformer (SDT), which employs two complementary contrastive objectives to extract the style commonalities of reference samples and capture the detailed style patterns of each sample, respectively. Extensive experiments on various language scripts demonstrate the effectiveness of SDT. Notably, our empirical findings reveal that the two learned style representations provide information at different frequency magnitudes, underscoring the importance of separate style extraction. Our source code is public at: https://github.com/dailenson/SDT.
△ Less
Submitted 31 March, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Statistical Age-of-Information Optimization for Status Update over Multi-State Fading Channels
Authors:
Yuquan Xiao,
Qinghe Du
Abstract:
Age of information (AoI) is a powerful metric to evaluate the freshness of information, where minimization of average statistics, such as the average AoI and average peak AoI, currently prevails in guiding freshness optimization for related applications. Although minimizing the statistics does improve the received information's freshness for status update systems in the sense of average, the time-…
▽ More
Age of information (AoI) is a powerful metric to evaluate the freshness of information, where minimization of average statistics, such as the average AoI and average peak AoI, currently prevails in guiding freshness optimization for related applications. Although minimizing the statistics does improve the received information's freshness for status update systems in the sense of average, the time-varying fading characteristics of wireless channels often cause uncertain yet frequent age violations. The recently-proposed statistical AoI metric can better characterize more features of AoI dynamics, which evaluates the achievable minimum peak AoI under the certain constraint on age violation probability. In this paper, we study the statistical AoI minimization problem for status update systems over multi-state fading channels, which can effectively upper-bound the AoI violation probability but introduce the prohibitively-high computing complexity. To resolve this issue, we tackle the problem with a two-fold approach. For a small AoI exponent, the problem is approximated via a fractional programming problem. For a large AoI exponent, the problem is converted to a convex problem. Solving the two problems respectively, we derive the near-optimal sampling interval for diverse status update systems. Insightful observations are obtained on how sampling interval shall be tuned as a decreasing function of channel state information (CSI). Surprisingly, for the extremely stringent AoI requirement, the sampling interval converges to a constant regardless of CSI's variation. Numerical results verify effectiveness as well as superiority of our proposed scheme.
△ Less
Submitted 27 November, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
Robust Secrecy via Aerial Reflection and Jamming: Joint Optimization of Deployment and Transmission
Authors:
Xiao Tang,
Hongliang He,
Limeng Dong,
Lixin Li,
Qinghe Du,
Zhu Han
Abstract:
Reconfigurable intelligent surfaces (RISs) are recognized with great potential to strengthen wireless security, yet the performance gain largely depends on the deployment location of RISs in the network topology. In this paper, we consider the anti-eavesdropping communication established through a RIS at a fixed location, as well as an aerial platform mounting another RIS and a friendly jammer to…
▽ More
Reconfigurable intelligent surfaces (RISs) are recognized with great potential to strengthen wireless security, yet the performance gain largely depends on the deployment location of RISs in the network topology. In this paper, we consider the anti-eavesdropping communication established through a RIS at a fixed location, as well as an aerial platform mounting another RIS and a friendly jammer to further improve the secrecy. The aerial RIS helps enhance the legitimate signal and the aerial cooperative jamming is strengthened through the fixed RIS. The security gain with aerial reflection and jamming is further improved with the optimized deployment of the aerial platform. We particularly consider the imperfect channel state information issue and address the worst-case secrecy for robust performance. The formulated robust secrecy rate maximization problem is decomposed into two layers, where the inner layer solves for reflection and jamming with robust optimization, and the outer layer tackles the aerial deployment through deep reinforcement learning. Simulation results show the deployment under different network topologies and demonstrate the performance superiority of our proposal in terms of the worst-case security provisioning as compared with the baselines.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
LWS: A Framework for Log-based Workload Simulation in Session-based SUT
Authors:
Yongqi Han,
Qingfeng Du,
Jincheng Xu,
Shengjie Zhao,
Zhekang Chen,
Li Cao,
Kanglin Yin,
Dan Pei
Abstract:
Artificial intelligence for IT Operations (AIOps) plays a critical role in operating and managing cloud-native systems and microservice-based applications but is limited by the lack of high-quality datasets with diverse scenarios. Realistic workloads are the premise and basis of generating such AIOps datasets, with the session-based workload being one of the most typical examples. Due to privacy c…
▽ More
Artificial intelligence for IT Operations (AIOps) plays a critical role in operating and managing cloud-native systems and microservice-based applications but is limited by the lack of high-quality datasets with diverse scenarios. Realistic workloads are the premise and basis of generating such AIOps datasets, with the session-based workload being one of the most typical examples. Due to privacy concerns, complexity, variety, and requirements for reasonable intervention, it is difficult to copy or generate such workloads directly, showing the importance of effective and intervenable workload simulation. In this paper, we formulate the task of workload simulation and propose a framework for Log-based Workload Simulation (LWS) in session-based systems. LWS extracts the workload specification including the user behavior abstraction based on agglomerative clustering as well as relational models and the intervenable workload intensity from session logs. Then LWS combines the user behavior abstraction with the workload intensity to generate simulated workloads. The experimental evaluation is performed on an open-source cloud-native application with both well-designed and public real-world workloads, showing that the simulated workload generated by LWS is effective and intervenable, which provides the foundation of generating high-quality AIOps datasets.
△ Less
Submitted 27 April, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
KEWS: A KPIs-Based Evaluation Framework of Workload Simulation On Microservice System
Authors:
Pengsheng Li,
Qingfeng Du,
Shengjie Zhao
Abstract:
Simulating the workload is an essential procedure in microservice systems as it helps augment realistic workloads whilst safeguarding user privacy. The efficacy of such simulation depends on its dynamic assessment. The straightforward and most efficient approach to this is comparing the original workload with the simulated one using Key Performance Indicators (KPIs), which capture the state of the…
▽ More
Simulating the workload is an essential procedure in microservice systems as it helps augment realistic workloads whilst safeguarding user privacy. The efficacy of such simulation depends on its dynamic assessment. The straightforward and most efficient approach to this is comparing the original workload with the simulated one using Key Performance Indicators (KPIs), which capture the state of the system. Nonetheless, due to the extensive volume and complexity of KPIs, fully evaluating them is not feasible, and measuring their similarity poses a significant challenge. This paper introduces a similarity metric algorithm for KPIs, the Extended Shape-Based Distance (ESBD), which gauges similarity in both shape and intensity. Additionally, we propose a KPI-based Evaluation Framework for Workload Simulations (KEWS), comprising three modules: preprocessing, compression, and evaluation. These methodologies effectively counteract the adverse effects of KPIs' characteristics and offer a holistic evaluation. Experimental results substantiate the effectiveness of both ESBD and KEWS.
△ Less
Submitted 27 November, 2023; v1 submitted 16 January, 2023;
originally announced January 2023.
-
Nearest Neighbor-Based Contrastive Learning for Hyperspectral and LiDAR Data Classification
Authors:
Meng Wang,
Feng Gao,
Junyu Dong,
Heng-Chao Li,
Qian Du
Abstract:
The joint hyperspectral image (HSI) and LiDAR data classification aims to interpret ground objects at more detailed and precise level. Although deep learning methods have shown remarkable success in the multisource data classification task, self-supervised learning has rarely been explored. It is commonly nontrivial to build a robust self-supervised learning model for multisource data classificati…
▽ More
The joint hyperspectral image (HSI) and LiDAR data classification aims to interpret ground objects at more detailed and precise level. Although deep learning methods have shown remarkable success in the multisource data classification task, self-supervised learning has rarely been explored. It is commonly nontrivial to build a robust self-supervised learning model for multisource data classification, due to the fact that the semantic similarities of neighborhood regions are not exploited in existing contrastive learning framework. Furthermore, the heterogeneous gap induced by the inconsistent distribution of multisource data impedes the classification performance. To overcome these disadvantages, we propose a Nearest Neighbor-based Contrastive Learning Network (NNCNet), which takes full advantage of large amounts of unlabeled data to learn discriminative feature representations. Specifically, we propose a nearest neighbor-based data augmentation scheme to use enhanced semantic relationships among nearby regions. The intermodal semantic alignments can be captured more accurately. In addition, we design a bilinear attention module to exploit the second-order and even high-order feature interactions between the HSI and LiDAR data. Extensive experiments on four public datasets demonstrate the superiority of our NNCNet over state-of-the-art methods. The source codes are available at \url{https://github.com/summitgao/NNCNet}.
△ Less
Submitted 9 January, 2023;
originally announced January 2023.
-
SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery
Authors:
Jiaqing Zhang,
Jie Lei,
Weiying Xie,
Zhenman Fang,
Yunsong Li,
Qian Du
Abstract:
Accurately and timely detecting multiscale small objects that contain tens of pixels from remote sensing images (RSI) remains challenging. Most of the existing solutions primarily design complex deep neural networks to learn strong feature representations for objects separated from the background, which often results in a heavy computation burden. In this article, we propose an accurate yet fast o…
▽ More
Accurately and timely detecting multiscale small objects that contain tens of pixels from remote sensing images (RSI) remains challenging. Most of the existing solutions primarily design complex deep neural networks to learn strong feature representations for objects separated from the background, which often results in a heavy computation burden. In this article, we propose an accurate yet fast object detection method for RSI, named SuperYOLO, which fuses multimodal data and performs high-resolution (HR) object detection on multiscale objects by utilizing the assisted super resolution (SR) learning and considering both the detection accuracy and computation cost. First, we utilize a symmetric compact multimodal fusion (MF) to extract supplementary information from various data for improving small object detection in RSI. Furthermore, we design a simple and flexible SR branch to learn HR feature representations that can discriminate small objects from vast backgrounds with low-resolution (LR) input, thus further improving the detection accuracy. Moreover, to avoid introducing additional computation, the SR branch is discarded in the inference stage, and the computation of the network model is reduced due to the LR input. Experimental results show that, on the widely used VEDAI RS dataset, SuperYOLO achieves an accuracy of 75.09% (in terms of mAP50 ), which is more than 10% higher than the SOTA large models, such as YOLOv5l, YOLOv5x, and RS designed YOLOrs. Meanwhile, the parameter size and GFLOPs of SuperYOLO are about 18 times and 3.8 times less than YOLOv5x. Our proposed model shows a favorable accuracy and speed tradeoff compared to the state-of-the-art models. The code will be open-sourced at https://github.com/icey-zhang/SuperYOLO.
△ Less
Submitted 8 April, 2023; v1 submitted 27 September, 2022;
originally announced September 2022.
-
Single-source Domain Expansion Network for Cross-Scene Hyperspectral Image Classification
Authors:
Yuxiang Zhang,
Wei Li,
Weidong Sun,
Ran Tao,
Qian Du
Abstract:
Currently, cross-scene hyperspectral image (HSI) classification has drawn increasing attention. It is necessary to train a model only on source domain (SD) and directly transferring the model to target domain (TD), when TD needs to be processed in real time and cannot be reused for training. Based on the idea of domain generalization, a Single-source Domain Expansion Network (SDEnet) is developed…
▽ More
Currently, cross-scene hyperspectral image (HSI) classification has drawn increasing attention. It is necessary to train a model only on source domain (SD) and directly transferring the model to target domain (TD), when TD needs to be processed in real time and cannot be reused for training. Based on the idea of domain generalization, a Single-source Domain Expansion Network (SDEnet) is developed to ensure the reliability and effectiveness of domain extension. The method uses generative adversarial learning to train in SD and test in TD. A generator including semantic encoder and morph encoder is designed to generate the extended domain (ED) based on encoder-randomization-decoder architecture, where spatial and spectral randomization are specifically used to generate variable spatial and spectral information, and the morphological knowledge is implicitly applied as domain invariant information during domain expansion. Furthermore, the supervised contrastive learning is employed in the discriminator to learn class-wise domain invariant representation, which drives intra-class samples of SD and ED. Meanwhile, adversarial training is designed to optimize the generator to drive intra-class samples of SD and ED to be separated. Extensive experiments on two public HSI datasets and one additional multispectral image (MSI) dataset demonstrate the superiority of the proposed method when compared with state-of-the-art techniques.
△ Less
Submitted 4 September, 2022;
originally announced September 2022.
-
Improved Grant-Free Access for URLLC via Multi-Tier-Driven Computing: Network-Load Learning, Prediction, and Resource Allocation
Authors:
Zixiao Zhao,
Qinghe Du,
George K. Karagiannidis
Abstract:
Grant-Free (GF) access has been recognized as a promising candidate for Ultra-Reliable and Low-Latency Communications (URLLC). However, even with GF access, URLLC still may not effectively gain high reliability and millimeter-level latency, simultaneously. This is because the network load is typically time-varying and not known to the base station (BS), and thus, the resource allocated for GF acce…
▽ More
Grant-Free (GF) access has been recognized as a promising candidate for Ultra-Reliable and Low-Latency Communications (URLLC). However, even with GF access, URLLC still may not effectively gain high reliability and millimeter-level latency, simultaneously. This is because the network load is typically time-varying and not known to the base station (BS), and thus, the resource allocated for GF access cannot well adapt to variations of the network load, resulting in low resource utilization efficiency under light network load and leading to severe collisions under heavy network load. To tackle this problem, we propose a multi-tier-driven computing framework and the associated algorithms for URLLC to support users with different QoS requirements. Especially, we concentrate on K - repetition GF access in light of its simplicity and well-balanced performance for practical systems. In particular, our framework consists of three tiers of computation, namely network-load learning, network-load prediction, and adaptive resource allocation. In the first tier, the BS can learn the network-load information from the states (success, collision, and idle) of random-access resources in terms of resource blocks (RB) and time slots. In the second tier, the network-load variation is effectively predicted based on estimation results from the first tier. Finally, in the third tier, by deriving and weighing the failure probabilities of different groups of users, their QoS requirements, and the predicted network loads, the BS is able to dynamically allocate sufficient resources accommodating the varying network loads. Simulation results show that our proposed approach can estimate the network load more accurately compared with the baseline schemes. Moreover, our adaptive resource allocation offers an effective way to enhance the QoS for different URLLC services, simultaneously.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Synthetic Aperture Radar Image Change Detection via Layer Attention-Based Noise-Tolerant Network
Authors:
Desen Meng,
Feng Gao,
Junyu Dong,
Qian Du,
Heng-Chao Li
Abstract:
Recently, change detection methods for synthetic aperture radar (SAR) images based on convolutional neural networks (CNN) have gained increasing research attention. However, existing CNN-based methods neglect the interactions among multilayer convolutions, and errors involved in the preclassification restrict the network optimization. To this end, we proposed a layer attention-based noise-tolerant…
▽ More
Recently, change detection methods for synthetic aperture radar (SAR) images based on convolutional neural networks (CNN) have gained increasing research attention. However, existing CNN-based methods neglect the interactions among multilayer convolutions, and errors involved in the preclassification restrict the network optimization. To this end, we proposed a layer attention-based noise-tolerant network, termed LANTNet. In particular, we design a layer attention module that adaptively weights the feature of different convolution layers. In addition, we design a noise-tolerant loss function that effectively suppresses the impact of noisy labels. Therefore, the model is insensitive to noisy labels in the preclassification results. The experimental results on three SAR datasets show that the proposed LANTNet performs better compared to several state-of-the-art methods. The source codes are available at https://github.com/summitgao/LANTNet
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Towards Semantic Communications: Deep Learning-Based Image Semantic Coding
Authors:
Danlan Huang,
Feifei Gao,
Xiaoming Tao,
Qiyuan Du,
Jianhua Lu
Abstract:
Semantic communications has received growing interest since it can remarkably reduce the amount of data to be transmitted without missing critical information. Most existing works explore the semantic encoding and transmission for text and apply techniques in Natural Language Processing (NLP) to interpret the meaning of the text. In this paper, we conceive the semantic communications for image dat…
▽ More
Semantic communications has received growing interest since it can remarkably reduce the amount of data to be transmitted without missing critical information. Most existing works explore the semantic encoding and transmission for text and apply techniques in Natural Language Processing (NLP) to interpret the meaning of the text. In this paper, we conceive the semantic communications for image data that is much more richer in semantics and bandwidth sensitive. We propose an reinforcement learning based adaptive semantic coding (RL-ASC) approach that encodes images beyond pixel level. Firstly, we define the semantic concept of image data that includes the category, spatial arrangement, and visual feature as the representation unit, and propose a convolutional semantic encoder to extract semantic concepts. Secondly, we propose the image reconstruction criterion that evolves from the traditional pixel similarity to semantic similarity and perceptual performance. Thirdly, we design a novel RL-based semantic bit allocation model, whose reward is the increase in rate-semantic-perceptual performance after encoding a certain semantic concept with adaptive quantization level. Thus, the task-related information is preserved and reconstructed properly while less important data is discarded. Finally, we propose the Generative Adversarial Nets (GANs) based semantic decoder that fuses both locally and globally features via an attention module. Experimental results demonstrate that the proposed RL-ASC is noise robust and could reconstruct visually pleasant and semantic consistent image, and saves times of bit cost compared to standard codecs and other deep learning-based image codecs.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
An Unsupervised Deep-Learning Method for Bone Age Assessment
Authors:
Hao Zhu,
Wan-Jing Nie,
Yue-Jie Hou,
Qi-Meng Du,
Si-Jing Li,
Chi-Chun Zhou
Abstract:
The bone age, reflecting the degree of development of the bones, can be used to predict the adult height and detect endocrine diseases of children. Both examinations of radiologists and variability of operators have a significant impact on bone age assessment. To decrease human intervention , machine learning algorithms are used to assess the bone age automatically. However, conventional supervise…
▽ More
The bone age, reflecting the degree of development of the bones, can be used to predict the adult height and detect endocrine diseases of children. Both examinations of radiologists and variability of operators have a significant impact on bone age assessment. To decrease human intervention , machine learning algorithms are used to assess the bone age automatically. However, conventional supervised deep-learning methods need pre-labeled data. In this paper, based on the convolutional auto-encoder with constraints (CCAE), an unsupervised deep-learning model proposed in the classification of the fingerprint, we propose this model for the classification of the bone age and baptize it BA-CCAE. In the proposed BA-CCAE model, the key regions of the raw X-ray images of the bone age are encoded, yielding the latent vectors. The K-means clustering algorithm is used to obtain the final classifications by grouping the latent vectors of the bone images. A set of experiments on the Radiological Society of North America pediatric bone age dataset (RSNA) show that the accuracy of classifications at 48-month intervals is 76.15%. Although the accuracy now is lower than most of the existing supervised models, the proposed BA-CCAE model can establish the classification of bone age without any pre-labeled data, and to the best of our knowledge, the proposed BA-CCAE is one of the few trails using the unsupervised deep-learning method for the bone age assessment.
△ Less
Submitted 11 June, 2022;
originally announced June 2022.
-
Hyperspectral Unmixing Based on Nonnegative Matrix Factorization: A Comprehensive Review
Authors:
Xin-Ru Feng,
Heng-Chao Li,
Rui Wang,
Qian Du,
Xiuping Jia,
Antonio Plaza
Abstract:
Hyperspectral unmixing has been an important technique that estimates a set of endmembers and their corresponding abundances from a hyperspectral image (HSI). Nonnegative matrix factorization (NMF) plays an increasingly significant role in solving this problem. In this article, we present a comprehensive survey of the NMF-based methods proposed for hyperspectral unmixing. Taking the NMF model as a…
▽ More
Hyperspectral unmixing has been an important technique that estimates a set of endmembers and their corresponding abundances from a hyperspectral image (HSI). Nonnegative matrix factorization (NMF) plays an increasingly significant role in solving this problem. In this article, we present a comprehensive survey of the NMF-based methods proposed for hyperspectral unmixing. Taking the NMF model as a baseline, we show how to improve NMF by utilizing the main properties of HSIs (e.g., spectral, spatial, and structural information). We categorize three important development directions including constrained NMF, structured NMF, and generalized NMF. Furthermore, several experiments are conducted to illustrate the effectiveness of associated algorithms. Finally, we conclude the article with possible future directions with the purposes of providing guidelines and inspiration to promote the development of hyperspectral unmixing.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Adaptive Cross-Attention-Driven Spatial-Spectral Graph Convolutional Network for Hyperspectral Image Classification
Authors:
Jin-Yu Yang,
Heng-Chao Li,
Wen-Shuai Hu,
Lei Pan,
Qian Du
Abstract:
Recently, graph convolutional networks (GCNs) have been developed to explore spatial relationship between pixels, achieving better classification performance of hyperspectral images (HSIs). However, these methods fail to sufficiently leverage the relationship between spectral bands in HSI data. As such, we propose an adaptive cross-attention-driven spatial-spectral graph convolutional network (ACS…
▽ More
Recently, graph convolutional networks (GCNs) have been developed to explore spatial relationship between pixels, achieving better classification performance of hyperspectral images (HSIs). However, these methods fail to sufficiently leverage the relationship between spectral bands in HSI data. As such, we propose an adaptive cross-attention-driven spatial-spectral graph convolutional network (ACSS-GCN), which is composed of a spatial GCN (Sa-GCN) subnetwork, a spectral GCN (Se-GCN) subnetwork, and a graph cross-attention fusion module (GCAFM). Specifically, Sa-GCN and Se-GCN are proposed to extract the spatial and spectral features by modeling correlations between spatial pixels and between spectral bands, respectively. Then, by integrating attention mechanism into information aggregation of graph, the GCAFM, including three parts, i.e., spatial graph attention block, spectral graph attention block, and fusion block, is designed to fuse the spatial and spectral features and suppress noise interference in Sa-GCN and Se-GCN. Moreover, the idea of the adaptive graph is introduced to explore an optimal graph through back propagation during the training process. Experiments on two HSI data sets show that the proposed method achieves better performance than other classification methods.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
A3CLNN: Spatial, Spectral and Multiscale Attention ConvLSTM Neural Network for Multisource Remote Sensing Data Classification
Authors:
Heng-Chao Li,
Wen-Shuai Hu,
Wei Li,
Jun Li,
Qian Du,
Antonio Plaza
Abstract:
The problem of effectively exploiting the information multiple data sources has become a relevant but challenging research topic in remote sensing. In this paper, we propose a new approach to exploit the complementarity of two data sources: hyperspectral images (HSIs) and light detection and ranging (LiDAR) data. Specifically, we develop a new dual-channel spatial, spectral and multiscale attentio…
▽ More
The problem of effectively exploiting the information multiple data sources has become a relevant but challenging research topic in remote sensing. In this paper, we propose a new approach to exploit the complementarity of two data sources: hyperspectral images (HSIs) and light detection and ranging (LiDAR) data. Specifically, we develop a new dual-channel spatial, spectral and multiscale attention convolutional long short-term memory neural network (called dual-channel A3CLNN) for feature extraction and classification of multisource remote sensing data. Spatial, spectral and multiscale attention mechanisms are first designed for HSI and LiDAR data in order to learn spectral- and spatial-enhanced feature representations, and to represent multiscale information for different classes. In the designed fusion network, a novel composite attention learning mechanism (combined with a three-level fusion strategy) is used to fully integrate the features in these two data sources. Finally, inspired by the idea of transfer learning, a novel stepwise training strategy is designed to yield a final classification result. Our experimental results, conducted on several multisource remote sensing data sets, demonstrate that the newly proposed dual-channel A3CLNN exhibits better feature representation ability (leading to more competitive classification performance) than other state-of-the-art methods.
△ Less
Submitted 9 April, 2022;
originally announced April 2022.
-
MS-HLMO: Multi-scale Histogram of Local Main Orientation for Remote Sensing Image Registration
Authors:
Chenzhong Gao,
Wei Li,
Ran Tao,
Qian Du
Abstract:
Multi-source image registration is challenging due to intensity, rotation, and scale differences among the images. Considering the characteristics and differences of multi-source remote sensing images, a feature-based registration algorithm named Multi-scale Histogram of Local Main Orientation (MS-HLMO) is proposed. Harris corner detection is first adopted to generate feature points. The HLMO feat…
▽ More
Multi-source image registration is challenging due to intensity, rotation, and scale differences among the images. Considering the characteristics and differences of multi-source remote sensing images, a feature-based registration algorithm named Multi-scale Histogram of Local Main Orientation (MS-HLMO) is proposed. Harris corner detection is first adopted to generate feature points. The HLMO feature of each Harris feature point is extracted on a Partial Main Orientation Map (PMOM) with a Generalized Gradient Location and Orientation Histogram-like (GGLOH) feature descriptor, which provides high intensity, rotation, and scale invariance. The feature points are matched through a multi-scale matching strategy. Comprehensive experiments on 17 multi-source remote sensing scenes demonstrate that the proposed MS-HLMO and its simplified version MS-HLMO$^+$ outperform other competitive registration algorithms in terms of effectiveness and generalization.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
Beyond Fixation: Dynamic Window Visual Transformer
Authors:
Pengzhen Ren,
Changlin Li,
Guangrun Wang,
Yun Xiao,
Qing Du,
Xiaodan Liang,
Xiaojun Chang
Abstract:
Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this…
▽ More
Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers \cite{liu2021swin}, DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.
△ Less
Submitted 8 April, 2022; v1 submitted 24 March, 2022;
originally announced March 2022.
-
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation
Authors:
Bei Li,
Quan Du,
Tao Zhou,
Yi Jing,
Shuhan Zhou,
Xin Zeng,
Tong Xiao,
JingBo Zhu,
Xuebo Liu,
Min Zhang
Abstract:
Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, {\it ODE Transformer}, which is analogous to the…
▽ More
Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, {\it ODE Transformer}, which is analogous to the Runge-Kutta method that is well motivated in ODE. As a natural extension to Transformer, ODE Transformer is easy to implement and efficient to use. Experimental results on the large-scale machine translation, abstractive summarization, and grammar error correction tasks demonstrate the high genericity of ODE Transformer. It can gain large improvements in model performance over strong baselines (e.g., 30.77 and 44.11 BLEU scores on the WMT'14 English-German and English-French benchmarks) at a slight cost in inference efficiency.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.