Search | arXiv e-print repository

Optimal camera-robot pose estimation in linear time from points and lines

Authors: Guangyang Zeng, Biqiang Mu, Qingcheng Zeng, Yuchen Song, Chulin Dai, Guodong Shi, Junfeng Wu

Abstract: Camera pose estimation is a fundamental problem in robotics. This paper focuses on two issues of interest: First, point and line features have complementary advantages, and it is of great value to design a uniform algorithm that can fuse them effectively; Second, with the development of modern front-end techniques, a large number of features can exist in a single image, which presents a potential… ▽ More Camera pose estimation is a fundamental problem in robotics. This paper focuses on two issues of interest: First, point and line features have complementary advantages, and it is of great value to design a uniform algorithm that can fuse them effectively; Second, with the development of modern front-end techniques, a large number of features can exist in a single image, which presents a potential for highly accurate robot pose estimation. With these observations, we propose AOPnP(L), an optimal linear-time camera-robot pose estimation algorithm from points and lines. Specifically, we represent a line with two distinct points on it and unify the noise model for point and line measurements where noises are added to 2D points in the image. By utilizing Plucker coordinates for line parameterization, we formulate a maximum likelihood (ML) problem for combined point and line measurements. To optimally solve the ML problem, AOPnP(L) adopts a two-step estimation scheme. In the first step, a consistent estimate that can converge to the true pose is devised by virtue of bias elimination. In the second step, a single Gauss-Newton iteration is executed to refine the initial estimate. AOPnP(L) features theoretical optimality in the sense that its mean squared error converges to the Cramer-Rao lower bound. Moreover, it owns a linear time complexity. These properties make it well-suited for precision-demanding and real-time robot pose estimation. Extensive experiments are conducted to validate our theoretical developments and demonstrate the superiority of AOPnP(L) in both static localization and dynamic odometry systems. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.08931 [pdf, other]

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

Authors: Xingyu Peng, Yan Bai, Chen Gao, Lirong Yang, Fei Xia, Beipeng Mu, Xiaofei Wang, Si Liu

Abstract: Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous l… ▽ More Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous lidar-based OVD methods only focus on the usage of object-level features, ignoring the essence of scene-level information. In this paper, we propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task, which contains a local branch to generate object-level detection result and a global branch to obtain scene-level global feature. With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference, and the detection result can be refined accordingly. We further propose Reflected Pseudo Labels Generation (RPLG) to generate high-quality pseudo labels for supervision and Background-Aware Object Localization (BAOL) to select precise object proposals. Extensive experiments on ScanNetV2 and SUN RGB-D demonstrate the superiority of our methods. Code is released at https://github.com/GradiusTwinbee/GLIS. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: accepted by ECCV 2024

arXiv:2405.03152 [pdf, other]

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

Authors: Bingshen Mu, Yangze Li, Qijie Shao, Kun Wei, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie

Abstract: Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. H… ▽ More Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-accent scenarios. Experiments conducted on the multi-accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02132 [pdf, other]

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research. △ Less

Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2403.07372 [pdf, other]

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

Authors: Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, Si Liu

Abstract: Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflic… ▽ More Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Accepted by ICRA 2024

arXiv:2403.01174 [pdf, other]

Consistent and Asymptotically Statistically-Efficient Solution to Camera Motion Estimation

Authors: Guangyang Zeng, Qingcheng Zeng, Xinghan Li, Biqiang Mu, Jiming Chen, Ling Shi, Junfeng Wu

Abstract: Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and no… ▽ More Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose a two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimate. We prove that the proposed estimate owns the same asymptotic statistical properties as the ML estimate: The first is consistency, i.e., the estimate converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimate converges to the theoretical lower bound -- Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2401.00475 [pdf, other]

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

Authors: Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie

Abstract: This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emo… ▽ More This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline model, demonstrating its potential in emotional comprehension and human-machine interaction. △ Less

Submitted 27 July, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

Comments: 5 pages, 3 figures

arXiv:2312.09746 [pdf, other]

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Authors: Bingshen Mu, Pengcheng Guo, Dake Guo, Pan Zhou, Wei Chen, Lei Xie

Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition perf… ▽ More Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. First of all, we propose two attention-based automatic channel selection modules to select the most advantageous subset of multi-channel signals from multiple recording devices for each utterance. Furthermore, we introduce inter-channel spatial features to augment the effectiveness of multi-frame cross-channel attention, aiding it in improving the capability of spatial information awareness. Finally, we propose a multi-layer convolution fusion module drawing inspiration from the U-Net architecture to integrate the multi-channel output into a single-channel output. Experimental results on the CHiME-7 corpus with oracle segmentation demonstrate that the improvements introduced in our proposed ASR system lead to a relative reduction of 40.1% in the Macro Diarization Attributed Word Error Rates (DA-WER) when compared to the baseline ASR system on the Eval sets. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024

arXiv:2312.03408 [pdf, other]

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

Authors: Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Huilin Xu, Pinlong Cai, Li Chen, Junchi Yan, Feng Xu, Lu Xiong, Jingdong Wang, Futang Zhu, Chunjing Xu, Tiancai Wang, Fei Xia, Beipeng Mu, Zhihui Peng, Dahua Lin, Yu Qiao

Abstract: With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively sim… ▽ More With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI. △ Less

Submitted 22 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: This article is a simplified English translation of corresponding Chinese article. Please refer to Chinese version for the complete content

arXiv:2308.09346 [pdf, other]

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

Authors: Jiazheng Xing, Mengmeng Wang, Yudi Ruan, Bofan Chen, Yaowei Guo, Boyu Mu, Guang Dai, Jingdong Wang, Yong Liu

Abstract: Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every ta… ▽ More Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV2023

arXiv:2305.12493 [pdf, other]

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Authors: Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, Lei Xie

Abstract: Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context… ▽ More Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list. △ Less

Submitted 12 July, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted by interspeech2023

arXiv:2301.07944 [pdf, other]

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Authors: Jiazheng Xing, Mengmeng Wang, Yong Liu, Boyu Mu

Abstract: Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter featu… ▽ More Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets. △ Less

Submitted 7 April, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

arXiv:2209.06779 [pdf, ps, other]

Efficient Planar Pose Estimation via UWB Measurements

Authors: Haodong Jiang, Wentao Wang, Yuan Shen, Xinghan Li, Xiaoqiang Ren, Biqiang Mu, Junfeng Wu

Abstract: State estimation is an essential part of autonomous systems. Integrating the Ultra-Wideband(UWB) technique has been shown to correct the long-term estimation drift and bypass the complexity of loop closure detection. However, few works on robotics adopt UWB as a stand-alone state estimation solution. The primary purpose of this work is to investigate planar pose estimation using only UWB range mea… ▽ More State estimation is an essential part of autonomous systems. Integrating the Ultra-Wideband(UWB) technique has been shown to correct the long-term estimation drift and bypass the complexity of loop closure detection. However, few works on robotics adopt UWB as a stand-alone state estimation solution. The primary purpose of this work is to investigate planar pose estimation using only UWB range measurements and study the estimator's statistical efficiency. We prove the excellent property of a two-step scheme, which says that we can refine a consistent estimator to be asymptotically efficient by one step of Gauss-Newton iteration. Grounded on this result, we design the GN-ULS estimator and evaluate it through simulations and collected datasets. GN-ULS attains millimeter and sub-degree level accuracy on our static datasets and attains centimeter and degree level accuracy on our dynamic datasets, presenting the possibility of using only UWB for real-time state estimation. △ Less

Submitted 27 February, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

Comments: Update the content and improve consistency with the ICRA version

arXiv:2209.05824 [pdf, other]

CPnP: Consistent Pose Estimator for Perspective-n-Point Problem with Bias Elimination

Authors: Guangyang Zeng, Shiyu Chen, Biqiang Mu, Guodong Shi, Junfeng Wu

Abstract: The Perspective-n-Point (PnP) problem has been widely studied in both computer vision and photogrammetry societies. With the development of feature extraction techniques, a large number of feature points might be available in a single shot. It is promising to devise a consistent estimator, i.e., the estimate can converge to the true camera pose as the number of points increases. To this end, we pr… ▽ More The Perspective-n-Point (PnP) problem has been widely studied in both computer vision and photogrammetry societies. With the development of feature extraction techniques, a large number of feature points might be available in a single shot. It is promising to devise a consistent estimator, i.e., the estimate can converge to the true camera pose as the number of points increases. To this end, we propose a consistent PnP solver, named \emph{CPnP}, with bias elimination. Specifically, linear equations are constructed from the original projection model via measurement model modification and variable elimination, based on which a closed-form least-squares solution is obtained. We then analyze and subtract the asymptotic bias of this solution, resulting in a consistent estimate. Additionally, Gauss-Newton (GN) iterations are executed to refine the consistent solution. Our proposed estimator is efficient in terms of computations -- it has $O(n)$ computational complexity. Experimental tests on both synthetic data and real images show that our proposed estimator is superior to some well-known ones for images with dense visual features, in terms of estimation precision and computing time. △ Less

Submitted 13 September, 2022; originally announced September 2022.

arXiv:2202.06312 [pdf, other]

Progressive Backdoor Erasing via connecting Backdoor and Adversarial Attacks

Authors: Bingxu Mu, Zhenxing Niu, Le Wang, Xue Wang, Rong Jin, Gang Hua

Abstract: Deep neural networks (DNNs) are known to be vulnerable to both backdoor attacks as well as adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, in this paper we find an intriguing connection between them: for a model planted with backdo… ▽ More Deep neural networks (DNNs) are known to be vulnerable to both backdoor attacks as well as adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, in this paper we find an intriguing connection between them: for a model planted with backdoors, we observe that its adversarial examples have similar behaviors as its triggered images, i.e., both activate the same subset of DNN neurons. It indicates that planting a backdoor into a model will significantly affect the model's adversarial examples. Based on these observations, a novel Progressive Backdoor Erasing (PBE) algorithm is proposed to progressively purify the infected model by leveraging untargeted adversarial attacks. Different from previous backdoor defense methods, one significant advantage of our approach is that it can erase backdoor even when the clean extra dataset is unavailable. We empirically show that, against 5 state-of-the-art backdoor attacks, our PBE can effectively erase the backdoor without obvious performance degradation on clean samples and significantly outperforms existing defense methods. △ Less

Submitted 26 December, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

arXiv:2106.03947 [pdf, other]

TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion

Authors: Saeed Soori, Bugra Can, Baourun Mu, Mert Gürbüzbalaban, Maryam Mehri Dehnavi

Abstract: This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with… ▽ More This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with approximation. However, the approximations do not reduce the overall time significantly and lead to less accurate parameter updates and loss of curvature information. TENGraD improves the time efficiency of NGD by computing Fisher block inverses with a computationally efficient covariance factorization and reuse method. It computes the inverse of each block exactly using the Woodbury matrix identity to preserve curvature information while admitting (linear) fast convergence rates. Our experiments on image classification tasks for state-of-the-art deep neural architecture on CIFAR-10, CIFAR-100, and Fashion-MNIST show that TENGraD significantly outperforms state-of-the-art NGD methods and often stochastic gradient descent in wall-clock time. △ Less

Submitted 3 March, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:1911.03615 [pdf, ps, other]

Universal Flying Objects: Modular Multirotor System for Flight of Rigid Objects

Authors: Bingguo Mu, Pakpong Chirarattananon

Abstract: We introduce UFO, a modular aerial robotic platform for transforming a rigid object into a multirotor robot. To achieve this, we develop flight modules, in the form of a control module and propelling modules, that can be affixed to an object. The object, or payload, serves as the airframe of the vehicle. The modular design produces a highly versatile platform as it is reconfigurable by the additio… ▽ More We introduce UFO, a modular aerial robotic platform for transforming a rigid object into a multirotor robot. To achieve this, we develop flight modules, in the form of a control module and propelling modules, that can be affixed to an object. The object, or payload, serves as the airframe of the vehicle. The modular design produces a highly versatile platform as it is reconfigurable by the addition or removal of flight modules, adjustment of the modules' arrangement, or change of payloads. To facilitate the flight control, we propose an IMU-based estimation strategy for rapid computation of the robot's configuration. When combined with the adaptive geometric controller for further refinement of uncertain parameters, stable flights are accomplished with minimal manual intervention or tuning required by a user. To this end, we demonstrate hovering and trajectory tracking flights through various robot's configurations with different dummy payloads, weighing ~200-800 grams, using four to eight propelling modules. The results reveal that stable flights are attainable thanks to the proposed IMU-based estimation method. The flight performance is markedly improved over time through the adaptive scheme, with the position errors of a few centimeters after the parameter convergence. △ Less

Submitted 13 January, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: submitted to Transactions on Robotics

arXiv:1905.06496 [pdf, ps, other]

Trajectory Generation for Underactuated Multirotor Vehicles with Tilted Propellers via a Flatness-based Method

Authors: Bingguo Mu, Pakpong Chirarattananon

Abstract: This paper considers a class of rotary-wing aerial robots with unaligned propellers. By studying the dynamics of these vehicles, we show that the position and heading angle remain flat outputs of the system (similar to conventional quadrotors). The implication is that they can be commanded to follow desired trajectory setpoints in 3D space. We propose a numerical strategy based on the collocation… ▽ More This paper considers a class of rotary-wing aerial robots with unaligned propellers. By studying the dynamics of these vehicles, we show that the position and heading angle remain flat outputs of the system (similar to conventional quadrotors). The implication is that they can be commanded to follow desired trajectory setpoints in 3D space. We propose a numerical strategy based on the collocation method to facilitate the trajectory generation. This enables convenient computation of the nominal robot's attitude and control inputs. The proposed methods are numerically verified for three multirotor robots with different dynamics. These include a tricopter with a tilting propeller, and a quadrotor and a hexacopter with unparalleled propellers. △ Less

Submitted 15 May, 2019; originally announced May 2019.

Comments: 6 pages, 6 figures, accepted at IEEE AIM 2019

arXiv:1704.05959 [pdf, other]

SLAM with Objects using a Nonparametric Pose Graph

Authors: Beipeng Mu, Shih-Yuan Liu, Liam Paull, John Leonard, Jonathan How

Abstract: Mapping and self-localization in unknown environments are fundamental capabilities in many robotic applications. These tasks typically involve the identification of objects as unique features or landmarks, which requires the objects both to be detected and then assigned a unique identifier that can be maintained when viewed from different perspectives and in different images. The \textit{data asso… ▽ More Mapping and self-localization in unknown environments are fundamental capabilities in many robotic applications. These tasks typically involve the identification of objects as unique features or landmarks, which requires the objects both to be detected and then assigned a unique identifier that can be maintained when viewed from different perspectives and in different images. The \textit{data association} and \textit{simultaneous localization and mapping} (SLAM) problems are, individually, well-studied in the literature. But these two problems are inherently tightly coupled, and that has not been well-addressed. Without accurate SLAM, possible data associations are combinatorial and become intractable easily. Without accurate data association, the error of SLAM algorithms diverge easily. This paper proposes a novel nonparametric pose graph that models data association and SLAM in a single framework. An algorithm is further introduced to alternate between inferring data association and performing SLAM. Experimental results show that our approach has the new capability of associating object detections and localizing objects at the same time, leading to significantly better performance on both the data association and SLAM problems than achieved by considering only one and ignoring imperfections in the other. △ Less

Submitted 19 April, 2017; originally announced April 2017.

Comments: published at IROS 2016

arXiv:1509.08155 [pdf, ps, other]

Information-based Active SLAM via Topological Feature Graphs

Authors: Beipeng Mu, Matthew Giamou, Liam Paull, Ali-akbar Agha-mohammadi, John Leonard, Jonathan How

Abstract: Active SLAM is the task of actively planning robot paths while simultaneously building a map and localizing within. Existing work has focused on planning paths with occupancy grid maps, which do not scale well and suffer from long term drift. This work proposes a Topological Feature Graph (TFG) representation that scales well and develops an active SLAM algorithm with it. The TFG uses graphical mo… ▽ More Active SLAM is the task of actively planning robot paths while simultaneously building a map and localizing within. Existing work has focused on planning paths with occupancy grid maps, which do not scale well and suffer from long term drift. This work proposes a Topological Feature Graph (TFG) representation that scales well and develops an active SLAM algorithm with it. The TFG uses graphical models, which utilize independences between variables, and enables a unified quantification of exploration and exploitation gains with a single entropy metric. Hence, it facilitates a natural and principled balance between map exploration and refinement. A probabilistic roadmap path-planner is used to generate robot paths in real time. Experimental results demonstrate that the proposed approach achieves better accuracy than a standard grid-map based approach while requiring orders of magnitude less computation and memory resources. △ Less

Submitted 29 August, 2016; v1 submitted 27 September, 2015; originally announced September 2015.

Comments: published in CDC 2016

arXiv:1409.7808 [pdf, ps, other]

Resource-Constrained Adaptive Search for Sparse Multi-Class Targets with Varying Importance

Authors: Gregory E. Newstadt, Beipeng Mu, Dennis Wei, Jonathan P. How, Alfred O. Hero III

Abstract: In sparse target inference problems it has been shown that significant gains can be achieved by adaptive sensing using convex criteria. We generalize previous work on adaptive sensing to (a) include multiple classes of targets with different levels of importance and (b) accommodate multiple sensor models. New optimization policies are developed to allocate a limited resource budget to simultaneous… ▽ More In sparse target inference problems it has been shown that significant gains can be achieved by adaptive sensing using convex criteria. We generalize previous work on adaptive sensing to (a) include multiple classes of targets with different levels of importance and (b) accommodate multiple sensor models. New optimization policies are developed to allocate a limited resource budget to simultaneously locate, classify and estimate a sparse number of targets embedded in a large space. Upper and lower bounds on the performance of the proposed policies are derived by analyzing a baseline policy, which allocates resources uniformly across the scene, and an oracle policy which has a priori knowledge of the target locations/classes. These bounds quantify analytically the potential benefit of adaptive sensing as a function of target frequency and importance. Numerical results indicate that the proposed policies perform close to the oracle bound (<3dB) when signal quality is sufficiently high (e.g.~performance within 3 dB for SNR above 15 dB). Moreover, the proposed policies improve on previous policies in terms of reducing estimation error, reducing misclassification probability, and increasing expected return. To account for sensors with different levels of agility, three sensor models are considered: global adaptive (GA), which can allocate different amounts of resource to each location in the space; global uniform (GU), which can allocate resources uniformly across the scene; and local adaptive (LA), which can allocate fixed units to a subset of locations. Policies that use a mixture of GU and LA sensors are shown to perform similarly to those that use GA sensors while being more easily implementable. △ Less

Submitted 27 September, 2014; originally announced September 2014.

Comments: 49 pages, 9 figures

Showing 1–21 of 21 results for author: Mu, B