Search | arXiv e-print repository

Technical Report for CVPR 2024 WeatherProof Dataset Challenge: Semantic Segmentation on Paired Real Data

Authors: Guojin Cao, Jiaxu Li, Jia He, Ying Min, Yunhao Zhang

Abstract: This technical report presents the implementation details of 2nd winning for CVPR'24 UG2 WeatherProof Dataset Challenge. This challenge aims at semantic segmentation of images degraded by various degrees of weather from all around the world. We addressed this problem by introducing a pre-trained large-scale vision foundation model: InternImage, and trained it using images with different levels of… ▽ More This technical report presents the implementation details of 2nd winning for CVPR'24 UG2 WeatherProof Dataset Challenge. This challenge aims at semantic segmentation of images degraded by various degrees of weather from all around the world. We addressed this problem by introducing a pre-trained large-scale vision foundation model: InternImage, and trained it using images with different levels of noise. Besides, we did not use additional datasets in the training procedure and utilized dense-CRF as post-processing in the final testing procedure. As a result, we achieved 2nd place in the challenge with 45.1 mIOU and fewer submissions than the other winners. △ Less

Submitted 9 June, 2024; originally announced July 2024.

arXiv:2406.17628 [pdf, other]

Video Inpainting Localization with Contrastive Learning

Authors: Zijie Lou, Gang Cao, Man Lin

Abstract: Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning… ▽ More Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning effective spatiotemporal forensic features. To enhance the discriminative power, supervised contrastive learning is adopted to capture the local inconsistency of inpainted videos through attracting/repelling the positive/negative pristine and forged pixel pairs. A pixel-wise inpainting localization map is yielded by a lightweight convolution decoder with a specialized two-stage training strategy. To prepare enough training samples, we build a video object segmentation dataset of 2500 videos with pixel-level annotations per frame. Extensive experimental results validate the superiority of ViLocal over state-of-the-arts. Code and dataset will be available at https://github.com/multimediaFor/ViLocal. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2406.13576

arXiv:2406.16907 [pdf, other]

RayProNet: A Neural Point Field Framework for Radio Propagation Modeling in 3D Environments

Authors: Ge Cao, Zhen Peng

Abstract: The radio wave propagation channel is central to the performance of wireless communication systems. In this paper, we introduce a novel machine learning-empowered methodology for wireless channel modeling. The key ingredients include a point-cloud-based neural network and a Spherical Harmonics encoder with light probes. Our approach offers several significant advantages, including the flexibility… ▽ More The radio wave propagation channel is central to the performance of wireless communication systems. In this paper, we introduce a novel machine learning-empowered methodology for wireless channel modeling. The key ingredients include a point-cloud-based neural network and a Spherical Harmonics encoder with light probes. Our approach offers several significant advantages, including the flexibility to adjust antenna radiation patterns and transmitter/receiver locations, the capability to predict radio power maps, and the scalability of large-scale wireless scenes. As a result, it lays the groundwork for an end-to-end pipeline for network planning and deployment optimization. The proposed work is validated in various outdoor and indoor radio environments. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.13576 [pdf, other]

Trusted Video Inpainting Localization via Deep Attentive Noise Learning

Authors: Zijie Lou, Gang Cao, Man Lin

Abstract: Digital video inpainting techniques have been substantially improved with deep learning in recent years. Although inpainting is originally designed to repair damaged areas, it can also be used as malicious manipulation to remove important objects for creating false scenes and facts. As such it is significant to identify inpainted regions blindly. In this paper, we present a Trusted Video Inpaintin… ▽ More Digital video inpainting techniques have been substantially improved with deep learning in recent years. Although inpainting is originally designed to repair damaged areas, it can also be used as malicious manipulation to remove important objects for creating false scenes and facts. As such it is significant to identify inpainted regions blindly. In this paper, we present a Trusted Video Inpainting Localization network (TruVIL) with excellent robustness and generalization ability. Observing that high-frequency noise can effectively unveil the inpainted regions, we design deep attentive noise learning in multiple stages to capture the inpainting traces. Firstly, a multi-scale noise extraction module based on 3D High Pass (HP3D) layers is used to create the noise modality from input RGB frames. Then the correlation between such two complementary modalities are explored by a cross-modality attentive fusion module to facilitate mutual feature learning. Lastly, spatial details are selectively enhanced by an attentive noise decoding module to boost the localization performance of the network. To prepare enough training samples, we also build a frame-level video object segmentation dataset of 2500 videos with pixel-level annotation for all frames. Extensive experimental results validate the superiority of TruVIL compared with the state-of-the-arts. In particular, both quantitative and qualitative evaluations on various inpainted videos verify the remarkable robustness and generalization ability of our proposed TruVIL. Code and dataset will be available at https://github.com/multimediaFor/TruVIL. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13565 [pdf, other]

Exploring Multi-view Pixel Contrast for General and Robust Image Forgery Localization

Authors: Zijie Lou, Gang Cao, Kun Guo, Haochen Zhu, Lifang Yu

Abstract: Image forgery localization, which aims to segment tampered regions in an image, is a fundamental yet challenging digital forensic task. While some deep learning-based forensic methods have achieved impressive results, they directly learn pixel-to-label mappings without fully exploiting the relationship between pixels in the feature space. To address such deficiency, we propose a Multi-view Pixel-w… ▽ More Image forgery localization, which aims to segment tampered regions in an image, is a fundamental yet challenging digital forensic task. While some deep learning-based forensic methods have achieved impressive results, they directly learn pixel-to-label mappings without fully exploiting the relationship between pixels in the feature space. To address such deficiency, we propose a Multi-view Pixel-wise Contrastive algorithm (MPC) for image forgery localization. Specifically, we first pre-train the backbone network with the supervised contrastive loss to model pixel relationships from the perspectives of within-image, cross-scale and cross-modality. That is aimed at increasing intra-class compactness and inter-class separability. Then the localization head is fine-tuned using the cross-entropy loss, resulting in a better pixel localizer. The MPC is trained on three different scale training datasets to make a comprehensive and fair comparison with existing image forgery localization algorithms. Extensive experiments on the small, medium and large scale training datasets show that the proposed MPC achieves higher generalization performance and robustness against post-processing than the state-of-the-arts. Code will be available at https://github.com/multimediaFor/MPC. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12651 [pdf, other]

Transforming Surgical Interventions with Embodied Intelligence for Ultrasound Robotics

Authors: Huan Xu, Jinlin Wu, Guanglin Cao, Zhen Chen, Zhen Lei, Hongbin Liu

Abstract: Ultrasonography has revolutionized non-invasive diagnostic methodologies, significantly enhancing patient outcomes across various medical domains. Despite its advancements, integrating ultrasound technology with robotic systems for automated scans presents challenges, including limited command understanding and dynamic execution capabilities. To address these challenges, this paper introduces a no… ▽ More Ultrasonography has revolutionized non-invasive diagnostic methodologies, significantly enhancing patient outcomes across various medical domains. Despite its advancements, integrating ultrasound technology with robotic systems for automated scans presents challenges, including limited command understanding and dynamic execution capabilities. To address these challenges, this paper introduces a novel Ultrasound Embodied Intelligence system that synergistically combines ultrasound robots with large language models (LLMs) and domain-specific knowledge augmentation, enhancing ultrasound robots' intelligence and operational efficiency. Our approach employs a dual strategy: firstly, integrating LLMs with ultrasound robots to interpret doctors' verbal instructions into precise motion planning through a comprehensive understanding of ultrasound domain knowledge, including APIs and operational manuals; secondly, incorporating a dynamic execution mechanism, allowing for real-time adjustments to scanning plans based on patient movements or procedural errors. We demonstrate the effectiveness of our system through extensive experiments, including ablation studies and comparisons across various models, showcasing significant improvements in executing medical procedures from verbal commands. Our findings suggest that the proposed system improves the efficiency and quality of ultrasound scans and paves the way for further advancements in autonomous medical scanning technologies, with the potential to transform non-invasive diagnostics and streamline medical workflows. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: This work has been accepted by MICCAI 2024

arXiv:2405.00461 [pdf, other]

Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning

Authors: Huan Xu, Jinlin Wu, Guanglin Cao, Zhen Lei, Zhen Chen, Hongbin Liu

Abstract: Ultrasound robots are increasingly used in medical diagnostics and early disease screening. However, current ultrasound robots lack the intelligence to understand human intentions and instructions, hindering autonomous ultrasound scanning. To solve this problem, we propose a novel Ultrasound Embodied Intelligence system that equips ultrasound robots with the large language model (LLM) and domain k… ▽ More Ultrasound robots are increasingly used in medical diagnostics and early disease screening. However, current ultrasound robots lack the intelligence to understand human intentions and instructions, hindering autonomous ultrasound scanning. To solve this problem, we propose a novel Ultrasound Embodied Intelligence system that equips ultrasound robots with the large language model (LLM) and domain knowledge, thereby improving the efficiency of ultrasound robots. Specifically, we first design an ultrasound operation knowledge database to add expertise in ultrasound scanning to the LLM, enabling the LLM to perform precise motion planning. Furthermore, we devise a dynamic ultrasound scanning strategy based on a \textit{think-observe-execute} prompt engineering, allowing LLMs to dynamically adjust motion planning strategies during the scanning procedures. Extensive experiments demonstrate that our system significantly improves ultrasound scan efficiency and quality from verbal commands. This advancement in autonomous medical scanning technology contributes to non-invasive diagnostics and streamlined medical workflows. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: 3 pages, 1 figure, 2 tables

arXiv:2403.20254 [pdf, other]

Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions

Authors: Runhao Zeng, Xiaoyong Chen, Jiaming Liang, Huisi Wu, Guangzhong Cao, Yong Guo

Abstract: Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incu… ▽ More Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR2024

arXiv:2403.16638 [pdf, other]

AI-Generated Video Detection via Spatio-Temporal Anomaly Learning

Authors: Jianfa Bai, Man Lin, Gang Cao

Abstract: The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Speci… ▽ More The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Specifically, two ResNet sub-detectors are learned separately for identifying the anomalies in spatical and optical flow domains, respectively. Results of such sub-detectors are fused to further enhance the discrimination ability. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation. Extensive experimental results verify the high generalization and robustness of our AIGVDet scheme. Code and dataset will be available at https://github.com/multimediaFor/AIGVDet. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.09143 [pdf, other]

A New Split Algorithm for 3D Gaussian Splatting

Authors: Qiyuan Feng, Gengchen Cao, Haoxiang Chen, Tai-Jiang Mu, Ralph R. Martin, Shi-Min Hu

Abstract: 3D Gaussian splatting models, as a novel explicit 3D representation, have been applied in many domains recently, such as explicit geometric editing and geometry generation. Progress has been rapid. However, due to their mixed scales and cluttered shapes, 3D Gaussian splatting models can produce a blurred or needle-like effect near the surface. At the same time, 3D Gaussian splatting models tend to… ▽ More 3D Gaussian splatting models, as a novel explicit 3D representation, have been applied in many domains recently, such as explicit geometric editing and geometry generation. Progress has been rapid. However, due to their mixed scales and cluttered shapes, 3D Gaussian splatting models can produce a blurred or needle-like effect near the surface. At the same time, 3D Gaussian splatting models tend to flatten large untextured regions, yielding a very sparse point cloud. These problems are caused by the non-uniform nature of 3D Gaussian splatting models, so in this paper, we propose a new 3D Gaussian splitting algorithm, which can produce a more uniform and surface-bounded 3D Gaussian splatting model. Our algorithm splits an $N$-dimensional Gaussian into two N-dimensional Gaussians. It ensures consistency of mathematical characteristics and similarity of appearance, allowing resulting 3D Gaussian splatting models to be more uniform and a better fit to the underlying surface, and thus more suitable for explicit editing, point cloud extraction and other tasks. Meanwhile, our 3D Gaussian splitting approach has a very simple closed-form solution, making it readily applicable to any 3D Gaussian model. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 11 pages, 10 figures

arXiv:2402.14049 [pdf, other]

Generative Adversarial Models for Extreme Downscaling of Climate Datasets

Authors: Guiye Li, Guofeng Cao

Abstract: Addressing the challenges of climate change requires accurate and high-resolution mapping of climate and weather variables. However, many existing climate datasets, such as the gridded outputs of the state-of-the-art numerical climate models (e.g., general circulation models), are only available at very coarse spatial resolutions due to the model complexity and extremely high computational demand.… ▽ More Addressing the challenges of climate change requires accurate and high-resolution mapping of climate and weather variables. However, many existing climate datasets, such as the gridded outputs of the state-of-the-art numerical climate models (e.g., general circulation models), are only available at very coarse spatial resolutions due to the model complexity and extremely high computational demand. Deep-learning-based methods, particularly generative adversarial networks (GANs) and their variants, have proved effective for refining natural images, and have shown great promise in improving scientific datasets. In this paper, we describe a conditional GAN-based geospatial downscaling method for extreme downscaling of gridded climate datasets. Compared to most existing methods, the method can generate high-resolution accurate climate datasets from very low-resolution inputs. More importantly, the method explicitly considers the uncertainty inherent to the downscaling process that tends to be ignored in existing methods. Given an input, the method can produce a multitude of plausible high-resolution samples instead of one single deterministic result. These samples allow for an empirical exploration and inferences of model uncertainty and robustness. With a case study of gridded climate datasets (wind velocity and solar irradiance), we demonstrate the performances of the framework in downscaling tasks with very high scaling factors (up to $64\times$) and highlight the advantages of the framework with a comprehensive comparison with commonly used downscaling methods, including area-to-point (ATP) kriging, deep image prior (DIP), enhanced deep super-resolution network (EDSR), enhanced super-resolution generative adversarial networks (ESRGAN), and physics-informed resolution-enhancing GAN (PhIRE GAN). △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2401.07059 [pdf]

Classifying Proposals of Decentralized Autonomous Organizations Using Large Language Models

Authors: Christian Ziegler, Marcos Miranda, Guangye Cao, Gustav Arentoft, Doo Wan Nam

Abstract: Our study demonstrates the effective use of Large Language Models (LLMs) for automating the classification of complex datasets. We specifically target proposals of Decentralized Autonomous Organizations (DAOs), as the clas-sification of this data requires the understanding of context and, therefore, depends on human expertise, leading to high costs associated with the task. The study applies an it… ▽ More Our study demonstrates the effective use of Large Language Models (LLMs) for automating the classification of complex datasets. We specifically target proposals of Decentralized Autonomous Organizations (DAOs), as the clas-sification of this data requires the understanding of context and, therefore, depends on human expertise, leading to high costs associated with the task. The study applies an iterative approach to specify categories and further re-fine them and the prompt in each iteration, which led to an accuracy rate of 95% in classifying a set of 100 proposals. With this, we demonstrate the po-tential of LLMs to automate data labeling tasks that depend on textual con-text effectively. △ Less

Submitted 3 July, 2024; v1 submitted 13 January, 2024; originally announced January 2024.

Report number: Dawo/2024/01 ACM Class: H.0

arXiv:2401.06994 [pdf, other]

UniVision: A Unified Framework for Vision-Centric 3D Perception

Authors: Yu Hong, Qian Liu, Huayuan Cheng, Danjiao Ma, Hang Dai, Yu Wang, Guangzhi Cao, Yong Ding

Abstract: The past few years have witnessed the rapid development of vision-centric 3D perception in autonomous driving. Although the 3D perception models share many structural and conceptual similarities, there still exist gaps in their feature representations, data formats, and objectives, posing challenges for unified and efficient 3D perception framework design. In this paper, we present UniVision, a si… ▽ More The past few years have witnessed the rapid development of vision-centric 3D perception in autonomous driving. Although the 3D perception models share many structural and conceptual similarities, there still exist gaps in their feature representations, data formats, and objectives, posing challenges for unified and efficient 3D perception framework design. In this paper, we present UniVision, a simple and efficient framework that unifies two major tasks in vision-centric 3D perception, \ie, occupancy prediction and object detection. Specifically, we propose an explicit-implicit view transform module for complementary 2D-3D feature transformation. We propose a local-global feature extraction and fusion module for efficient and adaptive voxel and BEV feature extraction, enhancement, and interaction. Further, we propose a joint occupancy-detection data augmentation strategy and a progressive loss weight adjustment strategy which enables the efficiency and stability of the multi-task framework training. We conduct extensive experiments for different perception tasks on four public benchmarks, including nuScenes LiDAR segmentation, nuScenes detection, OpenOccupancy, and Occ3D. UniVision achieves state-of-the-art results with +1.5 mIoU, +1.8 NDS, +1.5 mIoU, and +1.8 mIoU gains on each benchmark, respectively. We believe that the UniVision framework can serve as a high-performance baseline for the unified vision-centric 3D perception task. The code will be available at \url{https://github.com/Cc-Hy/UniVision}. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2401.06827 [pdf, other]

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Authors: Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu

Abstract: Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy… ▽ More Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models. △ Less

Submitted 23 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: 7 pages,3 figures

arXiv:2312.13537 [pdf, other]

HyperEditor: Achieving Both Authenticity and Cross-Domain Capability in Image Editing via Hypernetworks

Authors: Hai Zhang, Chunwei Wu, Guitao Cao, Hailing Wang, Wenming Cao

Abstract: Editing real images authentically while also achieving cross-domain editing remains a challenge. Recent studies have focused on converting real images into latent codes and accomplishing image editing by manipulating these codes. However, merely manipulating the latent codes would constrain the edited images to the generator's image domain, hindering the attainment of diverse editing goals. In res… ▽ More Editing real images authentically while also achieving cross-domain editing remains a challenge. Recent studies have focused on converting real images into latent codes and accomplishing image editing by manipulating these codes. However, merely manipulating the latent codes would constrain the edited images to the generator's image domain, hindering the attainment of diverse editing goals. In response, we propose an innovative image editing method called HyperEditor, which utilizes weight factors generated by hypernetworks to reassign the weights of the pre-trained StyleGAN2's generator. Guided by CLIP's cross-modal image-text semantic alignment, this innovative approach enables us to simultaneously accomplish authentic attribute editing and cross-domain style transfer, a capability not realized in previous methods. Additionally, we ascertain that modifying only the weights of specific layers in the generator can yield an equivalent editing result. Therefore, we introduce an adaptive layer selector, enabling our hypernetworks to autonomously identify the layers requiring output weight factors, which can further improve our hypernetworks' efficiency. Extensive experiments on abundant challenging datasets demonstrate the effectiveness of our method. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI2024

arXiv:2311.14939 [pdf, other]

OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss

Authors: Zezhou Wang, Guitao Cao, Xidong Xi, Jiangtao Wang

Abstract: Automated driving object detection has always been a challenging task in computer vision due to environmental uncertainties. These uncertainties include significant differences in object sizes and encountering the class unseen. It may result in poor performance when traditional object detection models are directly applied to automated driving detection. Because they usually presume fixed categorie… ▽ More Automated driving object detection has always been a challenging task in computer vision due to environmental uncertainties. These uncertainties include significant differences in object sizes and encountering the class unseen. It may result in poor performance when traditional object detection models are directly applied to automated driving detection. Because they usually presume fixed categories of common traffic participants, such as pedestrians and cars. Worsely, the huge class imbalance between common and novel classes further exacerbates performance degradation. To address the issues stated, we propose OpenNet to moderate the class imbalance with the Balanced Loss, which is based on Cross Entropy Loss. Besides, we adopt an inductive layer based on gradient reshaping to fast learn new classes with limited samples during incremental learning. To against catastrophic forgetting, we employ normalized feature distillation. By the way, we improve multi-scale detection robustness and unknown class recognition through FPN and energy-based detection, respectively. The Experimental results upon the CODA dataset show that the proposed method can obtain better performance than that of the existing methods. △ Less

Submitted 25 November, 2023; originally announced November 2023.

arXiv:2311.13128 [pdf, other]

P2RBox: Point Prompt Oriented Object Detection with SAM

Authors: Guangming Cao, Xuehui Yu, Wenwen Yu, Xumeng Han, Xue Yang, Guorong Li, Jianbin Jiao, Zhenjun Han

Abstract: Single-point annotation in oriented object detection of remote sensing scenarios is gaining increasing attention due to its cost-effectiveness. However, due to the granularity ambiguity of points, there is a significant performance gap between previous methods and those with fully supervision. In this study, we introduce P2RBox, which employs point prompt to generate rotated box (RBox) annotation… ▽ More Single-point annotation in oriented object detection of remote sensing scenarios is gaining increasing attention due to its cost-effectiveness. However, due to the granularity ambiguity of points, there is a significant performance gap between previous methods and those with fully supervision. In this study, we introduce P2RBox, which employs point prompt to generate rotated box (RBox) annotation for oriented object detection. P2RBox employs the SAM model to generate high-quality mask proposals. These proposals are then refined using the semantic and spatial information from annotation points. The best masks are converted into oriented boxes based on the feature directions suggested by the model. P2RBox incorporates two advanced guidance cues: Boundary Sensitive Mask guidance, which leverages semantic information, and Centrality guidance, which utilizes spatial information to reduce granularity ambiguity. This combination enhances detection capabilities significantly. To demonstrate the effectiveness of this method, enhancements based on the baseline were observed by integrating three different detectors. Furthermore, compared to the state-of-the-art point-annotated generative method PointOBB, P2RBox outperforms by about 29% mAP (62.43% vs 33.31%) on DOTA-v1.0 dataset, which provides possibilities for the practical application of point annotations. △ Less

Submitted 23 May, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.08910 [pdf, other]

Progressive Feedback-Enhanced Transformer for Image Forgery Localization

Authors: Haochen Zhu, Gang Cao, Xianglin Huang

Abstract: Blind detection of the forged regions in digital images is an effective authentication means to counter the malicious use of local image editing techniques. Existing encoder-decoder forensic networks overlook the fact that detecting complex and subtle tampered regions typically requires more feedback information. In this paper, we propose a Progressive FeedbACk-enhanced Transformer (ProFact) netwo… ▽ More Blind detection of the forged regions in digital images is an effective authentication means to counter the malicious use of local image editing techniques. Existing encoder-decoder forensic networks overlook the fact that detecting complex and subtle tampered regions typically requires more feedback information. In this paper, we propose a Progressive FeedbACk-enhanced Transformer (ProFact) network to achieve coarse-to-fine image forgery localization. Specifically, the coarse localization map generated by an initial branch network is adaptively fed back to the early transformer encoder layers for enhancing the representation of positive features while suppressing interference factors. The cascaded transformer network, combined with a contextual spatial pyramid module, is designed to refine discriminative forensic features for improving the forgery localization accuracy and reliability. Furthermore, we present an effective strategy to automatically generate large-scale forged image samples close to real-world forensic scenarios, especially in realistic and coherent processing. Leveraging on such samples, a progressive and cost-effective two-stage training protocol is applied to the ProFact network. The extensive experimental results on nine public forensic datasets show that our proposed localizer greatly outperforms the state-of-the-art on the generalization ability and robustness of image forgery localization. Code will be publicly available at https://github.com/multimediaFor/ProFact. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2310.19620 [pdf, other]

Large Trajectory Models are Scalable Motion Predictors and Planners

Authors: Qiao Sun, Shiduo Zhang, Danjiao Ma, Jingzhe Shi, Derun Li, Simian Luo, Yu Wang, Ningyi Xu, Guangzhi Cao, Hang Zhao

Abstract: Motion prediction and planning are vital tasks in autonomous driving, and recent efforts have shifted to machine learning-based approaches. The challenges include understanding diverse road topologies, reasoning traffic dynamics over a long time horizon, interpreting heterogeneous behaviors, and generating policies in a large continuous state space. Inspired by the success of large language models… ▽ More Motion prediction and planning are vital tasks in autonomous driving, and recent efforts have shifted to machine learning-based approaches. The challenges include understanding diverse road topologies, reasoning traffic dynamics over a long time horizon, interpreting heterogeneous behaviors, and generating policies in a large continuous state space. Inspired by the success of large language models in addressing similar complexities through model scaling, we introduce a scalable trajectory model called State Transformer (STR). STR reformulates the motion prediction and motion planning problems by arranging observations, states, and actions into one unified sequence modeling task. Our approach unites trajectory generation problems with other sequence modeling problems, powering rapid iterations with breakthroughs in neighbor domains such as language modeling. Remarkably, experimental results reveal that large trajectory models (LTMs), such as STR, adhere to the scaling laws by presenting outstanding adaptability and learning efficiency. Qualitative results further demonstrate that LTMs are capable of making plausible predictions in scenarios that diverge significantly from the training data distribution. LTMs also learn to make complex reasonings for long-term planning, without explicit loss designs or costly high-level annotations. △ Less

Submitted 28 February, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2309.10243 [pdf]

Transferable Adversarial Attack on Image Tampering Localization

Authors: Yuqi Wang, Gang Cao, Zijie Lou, Haochen Zhu

Abstract: It is significant to evaluate the security of existing digital image tampering localization algorithms in real-world applications. In this paper, we propose an adversarial attack scheme to reveal the reliability of such tampering localizers, which would be fooled and fail to predict altered regions correctly. Specifically, the adversarial examples based on optimization and gradient are implemented… ▽ More It is significant to evaluate the security of existing digital image tampering localization algorithms in real-world applications. In this paper, we propose an adversarial attack scheme to reveal the reliability of such tampering localizers, which would be fooled and fail to predict altered regions correctly. Specifically, the adversarial examples based on optimization and gradient are implemented for white/black-box attacks. Correspondingly, the adversarial example is optimized via reverse gradient propagation, and the perturbation is added adaptively in the direction of gradient rising. The black-box attack is achieved by relying on the transferability of such adversarial examples to different localizers. Extensive evaluations verify that the proposed attack sharply reduces the localization accuracy while preserving high visual quality of the attacked images. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.09482 [pdf, other]

Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

Authors: Man Lin, Gang Cao, Zijie Lou

Abstract: Digital video splicing has become easy and ubiquitous. Malicious users copy some regions of a video and paste them to another video for creating realistic forgeries. It is significant to blindly detect such forgery regions in videos. In this paper, a spatio-temporal co-attention fusion network (SCFNet) is proposed for video splicing localization. Specifically, a three-stream network is used as an… ▽ More Digital video splicing has become easy and ubiquitous. Malicious users copy some regions of a video and paste them to another video for creating realistic forgeries. It is significant to blindly detect such forgery regions in videos. In this paper, a spatio-temporal co-attention fusion network (SCFNet) is proposed for video splicing localization. Specifically, a three-stream network is used as an encoder to capture manipulation traces across multiple frames. The deep interaction and fusion of spatio-temporal forensic features are achieved by the novel parallel and cross co-attention fusion modules. A lightweight multilayer perceptron (MLP) decoder is adopted to yield a pixel-level tampering localization map. A new large-scale video splicing dataset is created for training the SCFNet. Extensive tests on benchmark datasets show that the localization and generalization performances of our SCFNet outperform the state-of-the-art. Code and datasets will be available at https://github.com/multimediaFor/SCFNet. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.09306 [pdf, other]

Effective Image Tampering Localization via Enhanced Transformer and Co-attention Fusion

Authors: Kun Guo, Haochen Zhu, Gang Cao

Abstract: Powerful manipulation techniques have made digital image forgeries be easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Sp… ▽ More Powerful manipulation techniques have made digital image forgeries be easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2308.08446 [pdf, other]

CSPM: A Contrastive Spatiotemporal Preference Model for CTR Prediction in On-Demand Food Delivery Services

Authors: Guyu Jiang, Xiaoyun Li, Rongrong Jing, Ruoqi Zhao, Xingliang Ni, Guodong Cao, Ning Hu

Abstract: Click-through rate (CTR) prediction is a crucial task in the context of an online on-demand food delivery (OFD) platform for precisely estimating the probability of a user clicking on food items. Unlike universal e-commerce platforms such as Taobao and Amazon, user behaviors and interests on the OFD platform are more location and time-sensitive due to limited delivery ranges and regional commodity… ▽ More Click-through rate (CTR) prediction is a crucial task in the context of an online on-demand food delivery (OFD) platform for precisely estimating the probability of a user clicking on food items. Unlike universal e-commerce platforms such as Taobao and Amazon, user behaviors and interests on the OFD platform are more location and time-sensitive due to limited delivery ranges and regional commodity supplies. However, existing CTR prediction algorithms in OFD scenarios concentrate on capturing interest from historical behavior sequences, which fails to effectively model the complex spatiotemporal information within features, leading to poor performance. To address this challenge, this paper introduces the Contrastive Sres under different search states using three modules: contrastive spatiotemporal representation learning (CSRL), spatiotemporal preference extractor (StPE), and spatiotemporal information filter (StIF). CSRL utilizes a contrastive learning framework to generate a spatiotemporal activation representation (SAR) for the search action. StPE employs SAR to activate users' diverse preferences related to location and time from the historical behavior sequence field, using a multi-head attention mechanism. StIF incorporates SAR into a gating network to automatically capture important features with latent spatiotemporal effects. Extensive experiments conducted on two large-scale industrial datasets demonstrate the state-of-the-art performance of CSPM. Notably, CSPM has been successfully deployed in Alibaba's online OFD platform Ele.me, resulting in a significant 0.88% lift in CTR, which has substantial business implications. △ Less

Submitted 10 August, 2023; originally announced August 2023.

arXiv:2307.11458 [pdf, other]

Strip-MLP: Efficient Token Interaction for Vision MLP

Authors: Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang

Abstract: Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small s… ▽ More Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2307.07358 [pdf, other]

Learn from Incomplete Tactile Data: Tactile Representation Learning with Masked Autoencoders

Authors: Guanqun Cao, Jiaqi Jiang, Danushka Bollegala, Shan Luo

Abstract: The missing signal caused by the objects being occluded or an unstable sensor is a common challenge during data collection. Such missing signals will adversely affect the results obtained from the data, and this issue is observed more frequently in robotic tactile perception. In tactile perception, due to the limited working space and the dynamic environment, the contact between the tactile sensor… ▽ More The missing signal caused by the objects being occluded or an unstable sensor is a common challenge during data collection. Such missing signals will adversely affect the results obtained from the data, and this issue is observed more frequently in robotic tactile perception. In tactile perception, due to the limited working space and the dynamic environment, the contact between the tactile sensor and the object is frequently insufficient and unstable, which causes the partial loss of signals, thus leading to incomplete tactile data. The tactile data will therefore contain fewer tactile cues with low information density. In this paper, we propose a tactile representation learning method, named TacMAE, based on Masked Autoencoder to address the problem of incomplete tactile data in tactile perception. In our framework, a portion of the tactile image is masked out to simulate the missing contact region. By reconstructing the missing signals in the tactile image, the trained model can achieve a high-level understanding of surface geometry and tactile properties from limited tactile cues. The experimental results of tactile texture recognition show that our proposed TacMAE can achieve a high recognition accuracy of 71.4% in the zero-shot transfer and 85.8% after fine-tuning, which are 15.2% and 8.2% higher than the results without using masked modeling. The extensive experiments on YCB objects demonstrate the knowledge transferability of our proposed method and the potential to improve efficiency in tactile exploration. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: This paper is accepted at IROS 2023

arXiv:2306.12705 [pdf, other]

Multimodal Zero-Shot Learning for Tactile Texture Recognition

Authors: Guanqun Cao, Jiaqi Jiang, Danushka Bollegala, Min Li, Shan Luo

Abstract: Tactile sensing plays an irreplaceable role in robotic material recognition. It enables robots to distinguish material properties such as their local geometry and textures, especially for materials like textiles. However, most tactile recognition methods can only classify known materials that have been touched and trained with tactile data, yet cannot classify unknown materials that are not traine… ▽ More Tactile sensing plays an irreplaceable role in robotic material recognition. It enables robots to distinguish material properties such as their local geometry and textures, especially for materials like textiles. However, most tactile recognition methods can only classify known materials that have been touched and trained with tactile data, yet cannot classify unknown materials that are not trained with tactile data. To solve this problem, we propose a tactile zero-shot learning framework to recognise unknown materials when they are touched for the first time without requiring training tactile samples. The visual modality, providing tactile cues from sight, and semantic attributes, giving high-level characteristics, are combined together to bridge the gap between touched classes and untouched classes. A generative model is learnt to synthesise tactile features according to corresponding visual images and semantic embeddings, and then a classifier can be trained using the synthesised tactile features of untouched materials for zero-shot recognition. Extensive experiments demonstrate that our proposed multimodal generative model can achieve a high recognition accuracy of 83.06% in classifying materials that were not touched before. The robotic experiment demo and the dataset are available at https://sites.google.com/view/multimodalzsl. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: Under review at Robotics and Autonomous Systems

arXiv:2305.13349 [pdf, other]

Multiclass classification for multidimensional functional data through deep neural networks

Authors: Shuoyang Wang, Guanqun Cao

Abstract: The intrinsically infinite-dimensional features of the functional observations over multidimensional domains render the standard classification methods effectively inapplicable. To address this problem, we introduce a novel multiclass functional deep neural network (mfDNN) classifier as an innovative data mining and classification tool. Specifically, we consider sparse deep neural network architec… ▽ More The intrinsically infinite-dimensional features of the functional observations over multidimensional domains render the standard classification methods effectively inapplicable. To address this problem, we introduce a novel multiclass functional deep neural network (mfDNN) classifier as an innovative data mining and classification tool. Specifically, we consider sparse deep neural network architecture with rectifier linear unit (ReLU) activation function and minimize the cross-entropy loss in the multiclass classification setup. This neural network architecture allows us to employ modern computational tools in the implementation. The convergence rates of the misclassification risk functions are also derived for both fully observed and discretely observed multidimensional functional data. We demonstrate the performance of mfDNN on simulated data and several benchmark datasets from different application domains. △ Less

Submitted 23 May, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

MSC Class: 62G05; 62G08

arXiv:2305.06430 [pdf, other]

HoneyIoT: Adaptive High-Interaction Honeypot for IoT Devices Through Reinforcement Learning

Authors: Chongqi Guan, Heting Liu, Guohong Cao, Sencun Zhu, Thomas La Porta

Abstract: As IoT devices are becoming widely deployed, there exist many threats to IoT-based systems due to their inherent vulnerabilities. One effective approach to improving IoT security is to deploy IoT honeypot systems, which can collect attack information and reveal the methods and strategies used by attackers. However, building high-interaction IoT honeypots is challenging due to the heterogeneity of… ▽ More As IoT devices are becoming widely deployed, there exist many threats to IoT-based systems due to their inherent vulnerabilities. One effective approach to improving IoT security is to deploy IoT honeypot systems, which can collect attack information and reveal the methods and strategies used by attackers. However, building high-interaction IoT honeypots is challenging due to the heterogeneity of IoT devices. Vulnerabilities in IoT devices typically depend on specific device types or firmware versions, which encourages attackers to perform pre-attack checks to gather device information before launching attacks. Moreover, conventional honeypots are easily detected because their replying logic differs from that of the IoT devices they try to mimic. To address these problems, we develop an adaptive high-interaction honeypot for IoT devices, called HoneyIoT. We first build a real device based attack trace collection system to learn how attackers interact with IoT devices. We then model the attack behavior through markov decision process and leverage reinforcement learning techniques to learn the best responses to engage attackers based on the attack trace. We also use differential analysis techniques to mutate response values in some fields to generate high-fidelity responses. HoneyIoT has been deployed on the public Internet. Experimental results show that HoneyIoT can effectively bypass the pre-attack checks and mislead the attackers into uploading malware. Furthermore, HoneyIoT is covert against widely used reconnaissance and honeypot detection tools. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: 11 pages, Wisec 2023

arXiv:2304.03946 [pdf, other]

doi 10.1145/3588964

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

Authors: Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, Bin Cui

Abstract: With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scala… ▽ More With the increasing data volume, there is a trend of using large-scale pre-trained models to store the knowledge into an enormous number of model parameters. The training of these models is composed of lots of dense algebras, requiring a huge amount of hardware resources. Recently, sparsely-gated Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks. However, such a sparse conditional computation may not be effective as expected in practical systems due to the routing imbalance and fluctuation problems. Generally, MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible. In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow. We first present an empirical analysis on the problems and opportunities of training MoE models, which motivates us to overcome the routing imbalance and fluctuation problems by a dynamic expert management and device placement mechanism. Then we introduce a novel scheduling module over the existing DNN runtime to monitor the data flow, make the scheduling plans, and dynamically adjust the model-to-hardware mapping guided by the real-time data traffic. A simple but efficient heuristic algorithm is exploited to dynamically optimize the device placement during training. We have conducted experiments on both NLP models (e.g., BERT and GPT) and vision models (e.g., Swin). And results show FlexMoE can achieve superior performance compared with existing systems on real-world workloads -- FlexMoE outperforms DeepSpeed by 1.70x on average and up to 2.10x, and outperforms FasterMoE by 1.30x on average and up to 1.45x. △ Less

Submitted 8 April, 2023; originally announced April 2023.

Comments: Accepted by SIGMOD 2023

Journal ref: Proc. ACM Manag. Data, Vol. 1, No. 1, Article 110. Publication date: May 2023

arXiv:2304.00157 [pdf, other]

Robotic Perception of Transparent Objects: A Review

Authors: Jiaqi Jiang, Guanqun Cao, Jiankang Deng, Thanh-Toan Do, Shan Luo

Abstract: Transparent object perception is a rapidly developing research problem in artificial intelligence. The ability to perceive transparent objects enables robots to achieve higher levels of autonomy, unlocking new applications in various industries such as healthcare, services and manufacturing. Despite numerous datasets and perception methods being proposed in recent years, there is still a lack of i… ▽ More Transparent object perception is a rapidly developing research problem in artificial intelligence. The ability to perceive transparent objects enables robots to achieve higher levels of autonomy, unlocking new applications in various industries such as healthcare, services and manufacturing. Despite numerous datasets and perception methods being proposed in recent years, there is still a lack of in-depth understanding of these methods and the challenges in this field. To address this gap, this article provides a comprehensive survey of the platforms and recent advances for robotic perception of transparent objects. We highlight the main challenges and propose future directions of various transparent object perception tasks, i.e., segmentation, reconstruction, and pose estimation. We also discuss the limitations of existing datasets in diversity and complexity, and the benefits of employing multi-modal sensors, such as RGB-D cameras, thermal cameras, and polarised imaging, for transparent object perception. Furthermore, we identify perception challenges in complex and dynamic environments, as well as for objects with changeable geometries. Finally, we provide an interactive online platform to navigate each reference: \url{https://sites.google.com/view/transperception}. △ Less

Submitted 17 October, 2023; v1 submitted 31 March, 2023; originally announced April 2023.

Comments: 21 pages, 11 figures, Accepted by IEEE Transactions on Artificial Intelligence

arXiv:2303.06311 [pdf, other]

doi 10.1088/1748-0221/18/06/P06005

Generative Adversarial Networks for Scintillation Signal Simulation in EXO-200

Authors: S. Li, I. Ostrovskiy, Z. Li, L. Yang, S. Al Kharusi, G. Anton, I. Badhrees, P. S. Barbeau, D. Beck, V. Belov, T. Bhatta, M. Breidenbach, T. Brunner, G. F. Cao, W. R. Cen, C. Chambers, B. Cleveland, M. Coon, A. Craycraft, T. Daniels, L. Darroch, S. J. Daugherty, J. Davis, S. Delaquis, A. Der Mesrobian-Kabakian , et al. (65 additional authors not shown)

Abstract: Generative Adversarial Networks trained on samples of simulated or actual events have been proposed as a way of generating large simulated datasets at a reduced computational cost. In this work, a novel approach to perform the simulation of photodetector signals from the time projection chamber of the EXO-200 experiment is demonstrated. The method is based on a Wasserstein Generative Adversarial N… ▽ More Generative Adversarial Networks trained on samples of simulated or actual events have been proposed as a way of generating large simulated datasets at a reduced computational cost. In this work, a novel approach to perform the simulation of photodetector signals from the time projection chamber of the EXO-200 experiment is demonstrated. The method is based on a Wasserstein Generative Adversarial Network - a deep learning technique allowing for implicit non-parametric estimation of the population distribution for a given set of objects. Our network is trained on real calibration data using raw scintillation waveforms as input. We find that it is able to produce high-quality simulated waveforms an order of magnitude faster than the traditional simulation approach and, importantly, generalize from the training sample and discern salient high-level features of the data. In particular, the network correctly deduces position dependency of scintillation light response in the detector and correctly recognizes dead photodetector channels. The network output is then integrated into the EXO-200 analysis framework to show that the standard EXO-200 reconstruction routine processes the simulated waveforms to produce energy distributions comparable to that of real waveforms. Finally, the remaining discrepancies and potential ways to improve the approach further are highlighted. △ Less

Submitted 8 May, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

Comments: As accepted by JINST

Journal ref: JINST 18 P06005 2023

arXiv:2301.08413 [pdf, other]

Chaos to Order: A Label Propagation Perspective on Source-Free Domain Adaptation

Authors: Chunwei Wu, Guitao Cao, Yan Li, Xidong Xi, Wenming Cao, Hong Wang

Abstract: Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance… ▽ More Source-free domain adaptation (SFDA), where only a pre-trained source model is used to adapt to the target distribution, is a more general approach to achieving domain adaptation in the real world. However, it can be challenging to capture the inherent structure of the target features accurately due to the lack of supervised information on the target domain. By analyzing the clustering performance of the target features, we show that they still contain core features related to discriminative attributes but lack the collation of semantic information. Inspired by this insight, we present Chaos to Order (CtO), a novel approach for SFDA that strives to constrain semantic credibility and propagate label information among target subpopulations. CtO divides the target data into inner and outlier samples based on the adaptive threshold of the learning state, customizing the learning strategy to fit the data properties best. Specifically, inner samples are utilized for learning intra-class structure thanks to their relatively well-clustered properties. The low-density outlier samples are regularized by input consistency to achieve high accuracy with respect to the ground truth labels. In CtO, by employing different learning strategies to propagate the labels from the inner local to outlier instances, it clusters the global samples from chaos to order. We further adaptively regulate the neighborhood affinity of the inner samples to constrain the local semantic credibility. In theoretical and empirical analyses, we demonstrate that our algorithm not only propagates from inner to outlier but also prevents local clustering from forming spurious clusters. Empirical evidence demonstrates that CtO outperforms the state of the arts on three public benchmarks: Office-31, Office-Home, and VisDA. △ Less

Submitted 14 August, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

Comments: Accepted by ACM MM2023

arXiv:2301.06826 [pdf, other]

Vis2Hap: Vision-based Haptic Rendering by Cross-modal Generation

Authors: Guanqun Cao, Jiaqi Jiang, Ningtao Mao, Danushka Bollegala, Min Li, Shan Luo

Abstract: To assist robots in teleoperation tasks, haptic rendering which allows human operators access a virtual touch feeling has been developed in recent years. Most previous haptic rendering methods strongly rely on data collected by tactile sensors. However, tactile data is not widely available for robots due to their limited reachable space and the restrictions of tactile sensors. To eliminate the nee… ▽ More To assist robots in teleoperation tasks, haptic rendering which allows human operators access a virtual touch feeling has been developed in recent years. Most previous haptic rendering methods strongly rely on data collected by tactile sensors. However, tactile data is not widely available for robots due to their limited reachable space and the restrictions of tactile sensors. To eliminate the need for tactile data, in this paper we propose a novel method named as Vis2Hap to generate haptic rendering from visual inputs that can be obtained from a distance without physical interaction. We take the surface texture of objects as key cues to be conveyed to the human operator. To this end, a generative model is designed to simulate the roughness and slipperiness of the object's surface. To embed haptic cues in Vis2Hap, we use height maps from tactile sensors and spectrograms from friction coefficients as the intermediate outputs of the generative model. Once Vis2Hap is trained, it can be used to generate height maps and spectrograms of new surface textures, from which a friction image can be obtained and displayed on a haptic display. The user study demonstrates that our proposed Vis2Hap method enables users to access a realistic haptic feeling similar to that of physical objects. The proposed vision-based haptic rendering has the potential to enhance human operators' perception of the remote environment and facilitate robotic manipulation. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: This paper is accepted at ICRA 2023

arXiv:2212.08272 [pdf, other]

Communication-Efficient Federated Learning for Heterogeneous Edge Devices Based on Adaptive Gradient Quantization

Authors: Heting Liu, Fang He, Guohong Cao

Abstract: Federated learning (FL) enables geographically dispersed edge devices (i.e., clients) to learn a global model without sharing the local datasets, where each client performs gradient descent with its local data and uploads the gradients to a central server to update the global model. However, FL faces massive communication overhead resulted from uploading the gradients in each training round. To ad… ▽ More Federated learning (FL) enables geographically dispersed edge devices (i.e., clients) to learn a global model without sharing the local datasets, where each client performs gradient descent with its local data and uploads the gradients to a central server to update the global model. However, FL faces massive communication overhead resulted from uploading the gradients in each training round. To address this problem, most existing research compresses the gradients with fixed and unified quantization for all the clients, which neither seeks adaptive quantization due to the varying gradient norms at different rounds, nor exploits the heterogeneity of the clients to accelerate FL. In this paper, we propose a novel adaptive and heterogeneous gradient quantization algorithm (AdaGQ) for FL to minimize the wall-clock training time from two aspects: i) adaptive quantization which exploits the change of gradient norm to adjust the quantization resolution in each training round; and ii) heterogeneous quantization which assigns lower quantization resolution to slow clients to align their training time with other clients to mitigate the communication bottleneck, and higher quantization resolution to fast clients to achieve a better communication efficiency and accuracy tradeoff. Evaluations based on various models and datasets validate the benefits of AdaGQ, reducing the total training time by up to 52.1% compared to baseline algorithms (e.g., FedAvg, QSGD). △ Less

Submitted 15 December, 2022; originally announced December 2022.

arXiv:2212.02992 [pdf, other]

Sparse Message Passing Network with Feature Integration for Online Multiple Object Tracking

Authors: Bisheng Wang, Horst Possegger, Horst Bischof, Guo Cao

Abstract: Existing Multiple Object Tracking (MOT) methods design complex architectures for better tracking performance. However, without a proper organization of input information, they still fail to perform tracking robustly and suffer from frequent identity switches. In this paper, we propose two novel methods together with a simple online Message Passing Network (MPN) to address these limitations. First,… ▽ More Existing Multiple Object Tracking (MOT) methods design complex architectures for better tracking performance. However, without a proper organization of input information, they still fail to perform tracking robustly and suffer from frequent identity switches. In this paper, we propose two novel methods together with a simple online Message Passing Network (MPN) to address these limitations. First, we explore different integration methods for the graph node and edge embeddings and put forward a new IoU (Intersection over Union) guided function, which improves long term tracking and handles identity switches. Second, we introduce a hierarchical sampling strategy to construct sparser graphs which allows to focus the training on more difficult samples. Experimental results demonstrate that a simple online MPN with these two contributions can perform better than many state-of-the-art methods. In addition, our association method generalizes well and can also improve the results of private detection based methods. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: 8 pages, 2 figures

arXiv:2212.02220 [pdf]

Semi-Supervised Representative Region Texture Extraction of Façade

Authors: Zhen Ni, Guitao Cao, Ye Duan

Abstract: Researches of analysis and parsing around façades to enrich the 3D feature of façade models by semantic information raised some attention in the community, whose main idea is to generate higher resolution components with similar shapes and textures to increase the overall resolution at the expense of reconstruction accuracy. While this approach works well for components like windows and doors, the… ▽ More Researches of analysis and parsing around façades to enrich the 3D feature of façade models by semantic information raised some attention in the community, whose main idea is to generate higher resolution components with similar shapes and textures to increase the overall resolution at the expense of reconstruction accuracy. While this approach works well for components like windows and doors, there is no solution for façade background at present. In this paper, we introduce the concept of representative region texture, which can be used in the above modeling approach by tiling the representative texture around the façade region, and propose a semi-supervised way to do representative region texture extraction from a façade image. Our method does not require any additional labelled data to train as long as the semantic information is given, while a traditional end-to-end model requires plenty of data to increase its performance. Our method can extract texture from any repetitive images, not just façade, which is not capable in an end-to-end model as it relies on the distribution of training set. Clustering with weighted distance is introduced to further increase the robustness to noise or an imprecise segmentation, and make the extracted texture have a higher resolution and more suitable for tiling. We verify our method on various façade images, and the result shows our method has a significant performance improvement compared to only a random crop on façade. We also demonstrate some application scenarios and proposed a façade modeling workflow with the representative region texture, which has a better visual resolution for a regular façade. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: 13 pages, 10 figures

arXiv:2211.03509 [pdf]

Black-Box Attack against GAN-Generated Image Detector with Contrastive Perturbation

Authors: Zijie Lou, Gang Cao, Man Lin

Abstract: Visually realistic GAN-generated facial images raise obvious concerns on potential misuse. Many effective forensic algorithms have been developed to detect such synthetic images in recent years. It is significant to assess the vulnerability of such forensic detectors against adversarial attacks. In this paper, we propose a new black-box attack method against GAN-generated image detectors. A novel… ▽ More Visually realistic GAN-generated facial images raise obvious concerns on potential misuse. Many effective forensic algorithms have been developed to detect such synthetic images in recent years. It is significant to assess the vulnerability of such forensic detectors against adversarial attacks. In this paper, we propose a new black-box attack method against GAN-generated image detectors. A novel contrastive learning strategy is adopted to train the encoder-decoder network based anti-forensic model under a contrastive loss function. GAN images and their simulated real counterparts are constructed as positive and negative samples, respectively. Leveraging on the trained attack model, imperceptible contrastive perturbation could be applied to input synthetic images for removing GAN fingerprint to some extent. As such, existing GAN-generated image detectors are expected to be deceived. Extensive experimental results verify that the proposed attack effectively reduces the accuracy of three state-of-the-art detectors on six popular GANs. High visual quality of the attacked images is also achieved. The source code will be available at https://github.com/ZXMMD/BAttGAND. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2209.09427 [pdf, other]

Spatiotemporal-Enhanced Network for Click-Through Rate Prediction in Location-based Services

Authors: Shaochuan Lin, Yicong Yu, Xiyu Ji, Taotao Zhou, Hengxu He, Zisen Sang, Jia Jia, Guodong Cao, Ning Hu

Abstract: In Location-Based Services(LBS), user behavior naturally has a strong dependence on the spatiotemporal information, i.e., in different geographical locations and at different times, user click behavior will change significantly. Appropriate spatiotemporal enhancement modeling of user click behavior and large-scale sparse attributes is key to building an LBS model. Although most of existing methods… ▽ More In Location-Based Services(LBS), user behavior naturally has a strong dependence on the spatiotemporal information, i.e., in different geographical locations and at different times, user click behavior will change significantly. Appropriate spatiotemporal enhancement modeling of user click behavior and large-scale sparse attributes is key to building an LBS model. Although most of existing methods have been proved to be effective, they are difficult to apply to takeaway scenarios due to insufficient modeling of spatiotemporal information. In this paper, we address this challenge by seeking to explicitly model the timing and locations of interactions and proposing a Spatiotemporal-Enhanced Network, namely StEN. In particular, StEN applies a Spatiotemporal Profile Activation module to capture common spatiotemporal preference through attribute features. A Spatiotemporal Preference Activation is further applied to model the personalized spatiotemporal preference embodied by behaviors in detail. Moreover, a Spatiotemporal-aware Target Attention mechanism is adopted to generate different parameters for target attention at different locations and times, thereby improving the personalized spatiotemporal awareness of the model.Comprehensive experiments are conducted on three large-scale industrial datasets, and the results demonstrate the state-of-the-art performance of our methods. In addition, we have also released an industrial dataset for takeaway industry to make up for the lack of public datasets in this community. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: accepted by CIKM workshop 2022

arXiv:2208.13739 [pdf]

Effective Image Tampering Localization with Multi-Scale ConvNeXt Feature Fusion

Authors: Haochen Zhu, Gang Cao, Mo Zhao

Abstract: With the widespread use of powerful image editing tools, image tampering becomes easy and realistic. Existing image forensic methods still face challenges of low generalization performance and robustness. In this letter, we propose an effective image tampering localization scheme based on ConvNeXt network and multi-scale feature fusion. Stacked ConvNeXt blocks are used as an encoder to capture hie… ▽ More With the widespread use of powerful image editing tools, image tampering becomes easy and realistic. Existing image forensic methods still face challenges of low generalization performance and robustness. In this letter, we propose an effective image tampering localization scheme based on ConvNeXt network and multi-scale feature fusion. Stacked ConvNeXt blocks are used as an encoder to capture hierarchical multi-scale features, which are then fused in decoder for locating tampered pixels accurately. Combined loss and effective data augmentation are adopted to further improve the model performance. Extensive experimental results show that localization performance of our proposed scheme outperforms other state-of-the-art ones. The source code will be available at https://github.com/ZhuHC98/ITL-SSN. △ Less

Submitted 16 January, 2023; v1 submitted 29 August, 2022; originally announced August 2022.

arXiv:2208.09743 [pdf, other]

Where Shall I Touch? Vision-Guided Tactile Poking for Transparent Object Grasping

Authors: Jiaqi Jiang, Guanqun Cao, Aaron Butterworth, Thanh-Toan Do, Shan Luo

Abstract: Picking up transparent objects is still a challenging task for robots. The visual properties of transparent objects such as reflection and refraction make the current grasping methods that rely on camera sensing fail to detect and localise them. However, humans can handle the transparent object well by first observing its coarse profile and then poking an area of interest to get a fine profile for… ▽ More Picking up transparent objects is still a challenging task for robots. The visual properties of transparent objects such as reflection and refraction make the current grasping methods that rely on camera sensing fail to detect and localise them. However, humans can handle the transparent object well by first observing its coarse profile and then poking an area of interest to get a fine profile for grasping. Inspired by this, we propose a novel framework of vision-guided tactile poking for transparent objects grasping. In the proposed framework, a segmentation network is first used to predict the horizontal upper regions named as poking regions, where the robot can poke the object to obtain a good tactile reading while leading to minimal disturbance to the object's state. A poke is then performed with a high-resolution GelSight tactile sensor. Given the local profiles improved with the tactile reading, a heuristic grasp is planned for grasping the transparent object. To mitigate the limitations of real-world data collection and labelling for transparent objects, a large-scale realistic synthetic dataset was constructed. Extensive experiments demonstrate that our proposed segmentation network can predict the potential poking region with a high mean Average Precision (mAP) of 0.360, and the vision-guided tactile poking can enhance the grasping success rate significantly from 38.9% to 85.2%. Thanks to its simplicity, our proposed approach could also be adopted by other force or tactile sensors and could be used for grasping of other challenging objects. All the materials used in this paper are available at https://sites.google.com/view/tactilepoking. △ Less

Submitted 20 August, 2022; originally announced August 2022.

Comments: 11 pages, 11 figures, accepted by T-Mech

arXiv:2207.06754 [pdf, other]

E2-AEN: End-to-End Incremental Learning with Adaptively Expandable Network

Authors: Guimei Cao, Zhanzhan Cheng, Yunlu Xu, Duo Li, Shiliang Pu, Yi Niu, Fei Wu

Abstract: Expandable networks have demonstrated their advantages in dealing with catastrophic forgetting problem in incremental learning. Considering that different tasks may need different structures, recent methods design dynamic structures adapted to different tasks via sophisticated skills. Their routine is to search expandable structures first and then train on the new tasks, which, however, breaks tas… ▽ More Expandable networks have demonstrated their advantages in dealing with catastrophic forgetting problem in incremental learning. Considering that different tasks may need different structures, recent methods design dynamic structures adapted to different tasks via sophisticated skills. Their routine is to search expandable structures first and then train on the new tasks, which, however, breaks tasks into multiple training stages, leading to suboptimal or overmuch computational cost. In this paper, we propose an end-to-end trainable adaptively expandable network named E2-AEN, which dynamically generates lightweight structures for new tasks without any accuracy drop in previous tasks. Specifically, the network contains a serial of powerful feature adapters for augmenting the previously learned representations to new tasks, and avoiding task interference. These adapters are controlled via an adaptive gate-based pruning strategy which decides whether the expanded structures can be pruned, making the network structure dynamically changeable according to the complexity of the new tasks. Moreover, we introduce a novel sparsity-activation regularization to encourage the model to learn discriminative features with limited parameters. E2-AEN reduces cost and can be built upon any feed-forward architectures in an end-to-end manner. Extensive experiments on both classification (i.e., CIFAR and VDD) and detection (i.e., COCO, VOC and ICCV2021 SSLAD challenge) benchmarks demonstrate the effectiveness of the proposed method, which achieves the new remarkable results. △ Less

Submitted 14 July, 2022; originally announced July 2022.

arXiv:2207.04907 [pdf, other]

A4T: Hierarchical Affordance Detection for Transparent Objects Depth Reconstruction and Manipulation

Authors: Jiaqi Jiang, Guanqun Cao, Thanh-Toan Do, Shan Luo

Abstract: Transparent objects are widely used in our daily lives and therefore robots need to be able to handle them. However, transparent objects suffer from light reflection and refraction, which makes it challenging to obtain the accurate depth maps required to perform handling tasks. In this paper, we propose a novel affordance-based framework for depth reconstruction and manipulation of transparent obj… ▽ More Transparent objects are widely used in our daily lives and therefore robots need to be able to handle them. However, transparent objects suffer from light reflection and refraction, which makes it challenging to obtain the accurate depth maps required to perform handling tasks. In this paper, we propose a novel affordance-based framework for depth reconstruction and manipulation of transparent objects, named A4T. A hierarchical AffordanceNet is first used to detect the transparent objects and their associated affordances that encode the relative positions of an object's different parts. Then, given the predicted affordance map, a multi-step depth reconstruction method is used to progressively reconstruct the depth maps of transparent objects. Finally, the reconstructed depth maps are employed for the affordance-based manipulation of transparent objects. To evaluate our proposed method, we construct a real-world dataset TRANS-AFF with affordances and depth maps of transparent objects, which is the first of its kind. Extensive experiments show that our proposed methods can predict accurate affordance maps, and significantly improve the depth reconstruction of transparent objects compared to the state-of-the-art method, with the Root Mean Squared Error in meters significantly decreased from 0.097 to 0.042. Furthermore, we demonstrate the effectiveness of our proposed method with a series of robotic manipulation experiments on transparent objects. See supplementary video and results at https://sites.google.com/view/affordance4trans. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 8 pages, 9 figures, Accepted by RAL-CASE2022

arXiv:2206.02158 [pdf, other]

Vanilla Feature Distillation for Improving the Accuracy-Robustness Trade-Off in Adversarial Training

Authors: Guodong Cao, Zhibo Wang, Xiaowei Dong, Zhifei Zhang, Hengchang Guo, Zhan Qin, Kui Ren

Abstract: Adversarial training has been widely explored for mitigating attacks against deep models. However, most existing works are still trapped in the dilemma between higher accuracy and stronger robustness since they tend to fit a model towards robust features (not easily tampered with by adversaries) while ignoring those non-robust but highly predictive features. To achieve a better robustness-accuracy… ▽ More Adversarial training has been widely explored for mitigating attacks against deep models. However, most existing works are still trapped in the dilemma between higher accuracy and stronger robustness since they tend to fit a model towards robust features (not easily tampered with by adversaries) while ignoring those non-robust but highly predictive features. To achieve a better robustness-accuracy trade-off, we propose the Vanilla Feature Distillation Adversarial Training (VFD-Adv), which conducts knowledge distillation from a pre-trained model (optimized towards high accuracy) to guide adversarial training towards higher accuracy, i.e., preserving those non-robust but predictive features. More specifically, both adversarial examples and their clean counterparts are forced to be aligned in the feature space by distilling predictive representations from the pre-trained/clean model, while previous works barely utilize predictive features from clean models. Therefore, the adversarial training model is updated towards maximally preserving the accuracy as gaining robustness. A key advantage of our method is that it can be universally adapted to and boost existing works. Exhaustive experiments on various datasets, classification models, and adversarial training algorithms demonstrate the effectiveness of our proposed method. △ Less

Submitted 5 June, 2022; originally announced June 2022.

Comments: 12 pages

arXiv:2205.08592 [pdf, other]

Deep Neural Network Classifier for Multi-dimensional Functional Data

Authors: Shuoyang Wang, Guanqun Cao, Zuofeng Shang

Abstract: We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption,… ▽ More We propose a new approach, called as functional deep neural network (FDNN), for classifying multi-dimensional functional data. Specifically, a deep neural network is trained based on the principle components of the training data which shall be used to predict the class label of a future data function. Unlike the popular functional discriminant analysis approaches which rely on Gaussian assumption, the proposed FDNN approach applies to general non-Gaussian multi-dimensional functional data. Moreover, when the log density ratio possesses a locally connected functional modular structure, we show that FDNN achieves minimax optimality. The superiority of our approach is demonstrated through both simulated and real-world datasets. △ Less

Submitted 17 May, 2022; originally announced May 2022.

arXiv:2201.11853 [pdf, other]

Prediction of GPU Failures Under Deep Learning Workloads

Authors: Heting Liu, Zhichao Li, Cheng Tan, Rongqiu Yang, Guohong Cao, Zherui Liu, Chuanxiong Guo

Abstract: Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause severe consequences in DL tasks: they disrupt distributed trainings, crash inference services, and result in service level agreement violations. To mitigate the problem caused by GPU failures, we propose to predict failures by using ML models. Thi… ▽ More Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause severe consequences in DL tasks: they disrupt distributed trainings, crash inference services, and result in service level agreement violations. To mitigate the problem caused by GPU failures, we propose to predict failures by using ML models. This paper is the first to study prediction models of GPU failures under large-scale production deep learning workloads. As a starting point, we evaluate classic prediction models and observe that predictions of these models are both inaccurate and unstable. To improve the precision and stability of predictions, we propose several techniques, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performances of our various techniques on a four-month production dataset including 350 million entries. The results show that our proposed techniques improve the prediction precision from 46.3\% to 84.0\%. △ Less

Submitted 27 January, 2022; originally announced January 2022.

arXiv:2201.04924 [pdf, other]

Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners

Authors: Duo Li, Guimei Cao, Yunlu Xu, Zhanzhan Cheng, Yi Niu

Abstract: In the SSLAD-Track 3B challenge on continual learning, we propose the method of COntinual Learning with Transformer (COLT). We find that transformers suffer less from catastrophic forgetting compared to convolutional neural network. The major principle of our method is to equip the transformer based feature extractor with old knowledge distillation and head expanding strategies to compete catastro… ▽ More In the SSLAD-Track 3B challenge on continual learning, we propose the method of COntinual Learning with Transformer (COLT). We find that transformers suffer less from catastrophic forgetting compared to convolutional neural network. The major principle of our method is to equip the transformer based feature extractor with old knowledge distillation and head expanding strategies to compete catastrophic forgetting. In this report, we first introduce the overall framework of continual learning for object detection. Then, we analyse the key elements' effect on withstanding catastrophic forgetting in our solution. Our method achieves 70.78 mAP on the SSLAD-Track 3B challenge test set. △ Less

Submitted 13 January, 2022; originally announced January 2022.

Comments: Rank 1st on ICCV2021 SSLAD-Track 3B

arXiv:2112.14298 [pdf, ps, other]

Multimodal perception for dexterous manipulation

Authors: Guanqun Cao, Shan Luo

Abstract: Humans usually perceive the world in a multimodal way that vision, touch, sound are utilised to understand surroundings from various dimensions. These senses are combined together to achieve a synergistic effect where the learning is more effectively than using each sense separately. For robotics, vision and touch are two key senses for the dexterous manipulation. Vision usually gives us apparent… ▽ More Humans usually perceive the world in a multimodal way that vision, touch, sound are utilised to understand surroundings from various dimensions. These senses are combined together to achieve a synergistic effect where the learning is more effectively than using each sense separately. For robotics, vision and touch are two key senses for the dexterous manipulation. Vision usually gives us apparent features like shape, color, and the touch provides local information such as friction, texture, etc. Due to the complementary properties between visual and tactile senses, it is desirable for us to combine vision and touch for a synergistic perception and manipulation. Many researches have been investigated about multimodal perception such as cross-modal learning, 3D reconstruction, multimodal translation with vision and touch. Specifically, we propose a cross-modal sensory data generation framework for the translation between vision and touch, which is able to generate realistic pseudo data. By using this cross-modal translation method, it is desirable for us to make up inaccessible data, helping us to learn the object's properties from different views. Recently, the attention mechanism becomes a popular method either in visual perception or in tactile perception. We propose a spatio-temporal attention model for tactile texture recognition, which takes both spatial features and time dimension into consideration. Our proposed method not only pays attention to the salient features in each spatial feature, but also models the temporal correlation in the through the time. The obvious improvement proves the efficiency of our selective attention mechanism. The spatio-temporal attention method has potential in many applications such as grasping, recognition, and multimodal perception. △ Less

Submitted 28 December, 2021; originally announced December 2021.

Comments: 19 pages, 10 figures

arXiv:2112.09216 [pdf, other]

A Deep-Learning Framework for Improving COVID-19 CT Image Quality and Diagnostic Accuracy

Authors: Garvit Goel, Jingyuan Qi, Wu-chun Feng, Guohua Cao

Abstract: We present a deep-learning based computing framework for fast-and-accurate CT (DL-FACT) testing of COVID-19. Our CT-based DL framework was developed to improve the testing speed and accuracy of COVID-19 (plus its variants) via a DL-based approach for CT image enhancement and classification. The image enhancement network is adapted from DDnet, short for DenseNet and Deconvolution based network. To… ▽ More We present a deep-learning based computing framework for fast-and-accurate CT (DL-FACT) testing of COVID-19. Our CT-based DL framework was developed to improve the testing speed and accuracy of COVID-19 (plus its variants) via a DL-based approach for CT image enhancement and classification. The image enhancement network is adapted from DDnet, short for DenseNet and Deconvolution based network. To demonstrate its speed and accuracy, we evaluated DL-FACT across several sources of COVID-19 CT images. Our results show that DL-FACT can significantly shorten the turnaround time from days to minutes and improve the COVID-19 testing accuracy up to 91%. DL-FACT could be used as a software tool for medical professionals in diagnosing and monitoring COVID-19. △ Less

Submitted 16 December, 2021; originally announced December 2021.

Comments: 10 pages

arXiv:2112.02439 [pdf, other]

Deep Learning on Mobile Devices Through Neural Processing Units and Edge Computing

Authors: Tianxiang Tan, Guohong Cao

Abstract: Deep Neural Network (DNN) is becoming adopted for video analytics on mobile devices. To reduce the delay of running DNNs, many mobile devices are equipped with Neural Processing Units (NPU). However, due to the resource limitations of NPU, these DNNs have to be compressed to increase the processing speed at the cost of accuracy. To address the low accuracy problem, we propose a Confidence Based Of… ▽ More Deep Neural Network (DNN) is becoming adopted for video analytics on mobile devices. To reduce the delay of running DNNs, many mobile devices are equipped with Neural Processing Units (NPU). However, due to the resource limitations of NPU, these DNNs have to be compressed to increase the processing speed at the cost of accuracy. To address the low accuracy problem, we propose a Confidence Based Offloading (CBO) framework for deep learning video analytics. The major challenge is to determine when to return the NPU classification result based on the confidence level of running the DNN, and when to offload the video frames to the server for further processing to increase the accuracy. We first identify the problem of using existing confidence scores to make offloading decisions, and propose confidence score calibration techniques to improve the performance. Then, we formulate the CBO problem where the goal is to maximize accuracy under some time constraint, and propose an adaptive solution that determines which frames to offload at what resolution based on the confidence score and the network condition. Through real implementations and extensive evaluations, we demonstrate that the proposed solution can significantly outperform other approaches. △ Less

Submitted 4 December, 2021; originally announced December 2021.

arXiv:2109.00590 [pdf, other]

WebQA: Multihop and Multimodal QA

Authors: Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, Yonatan Bisk

Abstract: Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation. In this work, we introduce WebQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel ob… ▽ More Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation. In this work, we introduce WebQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trivial for humans. WebQA mirrors the way humans use the web: 1) Ask a question, 2) Choose sources to aggregate, and 3) Produce a fluent language response. This is the behavior we should be expecting from IoT devices and digital assistants. Existing work prefers to assume that a model can either reason about knowledge in images or in text. WebQA includes a secondary text-only QA task to ensure improved visual performance does not come at the cost of language understanding. Our challenge for the community is to create unified multimodal reasoning models that answer questions regardless of the source modality, moving us closer to digital assistants that not only query language knowledge, but also the richer visual online world. △ Less

Submitted 27 March, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

Comments: CVPR Camera ready

Showing 1–50 of 88 results for author: Cao, G