Towards Real-World Aerial Vision Guidance with Categorical 6D Pose Tracker

Jingtao Sun, Yaonan Wang, Danwei Wang Jingtao Sun, Yaonan Wang are with the College of Electrical and Information Engineering and the National Engineering Research Centre for Robot Visual Perception and Control, Hunan University, Changsha 410082, China. E-mail: {jingtaosun, yaonan}@hnu.edu.cn. Danwei Wang is with the School of Electrical and Electronic Engineering, Nanyang Technological University, SG 639798, Singapore. E-mail: edwwang@ntu.edu.sg. (Corresponding author: Yaonan Wang.)Manuscript received January 9, 2024; revised January 9, 2024.

Abstract

Tracking the object 6-DoF pose is crucial for various downstream robot tasks and real-world applications. In this paper, we investigate the real-world robot task of aerial vision guidance for aerial robotics manipulation, utilizing category-level 6-DoF pose tracking. Aerial conditions inevitably introduce special challenges, such as rapid viewpoint changes in pitch and roll and inter-frame differences. To support these challenges in task, we firstly introduce a robust category-level 6-DoF pose tracker (Robust6DoF). This tracker leverages shape and temporal prior knowledge to explore optimal inter-frame keypoint pairs, generated under a priori structural adaptive supervision in a coarse-to-fine manner. Notably, our Robust6DoF employs a Spatial-Temporal Augmentation module to deal with the problems of the inter-frame differences and intra-class shape variations through both temporal dynamic filtering and shape-similarity filtering. We further present a Pose-Aware Discrete Servo strategy (PAD-Servo), serving as a decoupling approach to implement the final aerial vision guidance task. It contains two servo action policies to better accommodate the structural properties of aerial robotics manipulation. Exhaustive experiments on four well-known public benchmarks demonstrate the superiority of our Robust6DoF. Real-world tests directly verify that our Robust6DoF along with PAD-Servo can be readily used in real-world aerial robotic applications. The project homepage is released at Robust6DoF.

Index Terms:

6-DoF pose estimation and tracking, 3D Transformer, visual servo, embedded robotic system.

1 Introduction

Tracking object Six Degree-of-Freedom (6-DoF) pose is one of the most fundamental tasks in computer vision and robotic applications, such as manipulation [1], aerial tracking [2, 3, 4] and navigation. Pioneering works in object 6-DoF pose tracking mostly adopt the standard format, where the 3D CAD model of the object instance is used to achieve remarkable accuracy, often referred as instance-level 6-DoF pose tracking. However, acquiring the prefect 3D model is challenging in realistic settings. In this end, we focus on the more demanding study of the problem of aerial category-level 6-DoF pose tracking. The objective is to real-time estimate the 6-DoF pose of novel object instances within any one category in the aerial down-look scene, while assuming that 3D geometry model of the instance is unavailable. Furthermore, visual tracking-based methods have drawn considerable attention for unmanned aerial vehicles (UAVs), such as aerial cinematography, visual localization and geographical survey. In this work, we also aim to develop these aerial category-level pose tracking apporaches to tackle the visual guidance task in the field of UAVs, especially for aerial robotics manipulation. The goal of this task is to allow aerial robot to self-guide to the static object or actively follow the moving target.

To date, most currently available category-level 6-DoF pose tracking methods [5, 6, 7, 8, 9] adopt either headmap-based pipline or tracking-by-detection framework, e.g., 6-PACK [5]. However, these methods neglect the strong correlations inherently existing among consecutive frames, such as inter-frame differences, making it challenging for them to capture changes in the camera’s viewpoint over time. Consequently, these pose trackers or estimators do not work as well in aerial scenarios where the captured object image data may exhibit severe perspective drifting. These are caused by different complex conditions, such as high-speed motions and occlusions in an aerial bird’s-eye view. In conclusion, aerial category-level 6-DoF pose tracking faces several challenges: 1) The intra-class shape variations within same category, a major challenge that remains limited so far. Canonical/normalizated spaces were introduced in prior works [10, 11] to address this issue, and several other methods [12, 13] employed a shape prior to adapt shape inconsistency within the same category, etc. However, these methods lack an explicit temporal representation between continuous frames, limiting their performance for aerial pose estimation; 2) The inter-frame differences. Aerial conditions inevitably introduce special challenges including motion blur, camera motion, occlusion and so on. In particular, fast-changing views in pitch and roll hinder the pose tracking performance in aerial scenes. To our knowledge, existing methods do not account for this situation; 3) the limited computing power of aerial platforms restricts the deployment of time-consuming state-of-the-art methods. Hence, an ideal tracker for aerial 6-DoF pose tracking must be both robust and efficient.

Refer to caption — ((a)) The configuration of guidance task for aerial robotics manipulation.

As for the following vision guidance task for aerial robotics manipulation, a significant challenge arises from the inherent instability of UAVs. The mounting of an onboard manipulator further increases the nonlinearity of the UAV system. This complexity renders traditional 2D visual-based technologies suboptimal for for solving aerial vision guidance task. Visual servoing technology is particularly appealing to us due to its scalability in applications and practicality in operations. However, previous related works [14, 15, 16] maninly focus on Image-Based Visual Servoing (IBVS) and do not address the problem of the object’s 6-DoF pose, where the pose state of the targeted object is usually assumed to be known. Unlike these methods, we propose a robust and efficient pose tracker and utilize its real-time object’s 6-DoF pose information to achieve the vision guidance task via a decoupling servo strategy.

To address the aforementioned challenges, we propose an pose-driven technology to accomplish the aerial vision guidance task for aerial robotics manipulation. As shown in the right of Fig. 1 (a), to solve primary problem in this task, namely aerial category-level 6-DoF pose tracking, we firstly introduce a robust category-level 6-DoF pose tracker named ”Robust6DoF”, employing a three-stage pipeline. At the first stage, to conduct the object-aware aggregated descriptor with point-pixel information, our Robust6DoF employs a 2D-3D Dense Fusion Transformer to learn dense per-point local correspondences for arbitrary objects in the current observation. At the second stage, unlike related category-level methods in [5, 6, 7, 8, 9], we present a Shape-Based Spatial-Temporal Augmentation module to address the challenge of inter-frame differences and intra-class shape variations. This module employs both temporal dynamic filtering and shape-similarity filtering with an encoder-decoder structure. In this way, these aggregated descriptor is updated to a group of augmented embeddings. At the final third stage, a Prior-Guided Keypoints Generation and Match module is proposed to seek the optimal inter-frame key-point pairs based on these augmented embeddings. Specifically, we apply a priori structural adaptive supervision mechanism in a coarse-to-fine manner to enhance the robustness of the generated keypoints. The final 6-DoF pose is obtained using the Perspective-n-Point algorithm and RANSAC.

As displayed in the middle of Fig. 1 (a), to tackle second problem in this task, namely visual guidance for aerial robot, we further propose a Pose-Aware Discrete Servo Policy called ”PAD-Servo”, including two servo action schemes: (1) Rotational Action Loop, generates the rotational action signal for onboard manipulator in 3D Cartesian space. This signal is derived from the rotation matrix of 6-DoF pose tracked by our Robust6DoF; (2) Translation Action Loop, produces the translational action signal for aerial vehicle in 2D image space, This signal comes from the location vector of our Robust6DoF’s pose tracking results. This separated design can be perfectly adapted to the aerial robot’s kinematic model to realize the collaborative actions for both aerial vehicle and onboard manipulator.

We evaluate the performance of our Robust6DoF on four publicly available datasets and achieves state-of-the-art results. It’s noteworthy that our Robust6DoF achieves top performance on the metric of $IoU25$ along with the best tracking speed in NOCS-REAL275 dataset, as depicted in Fig. 1 (b). Furthermore, we conduct the real experiment in a real-world aerial robotic platform to validate the practicality of our PAD-Servo using trained Robust6DoF and realize robust real-world results. The original contributions of this paper can be summarized as follows:

•

To address the problem of aerial category-level 6-DoF pose tracking, we introduce a robust category-level 6-DoF pose tracker utilizing temporal and shape prior knowledge, along with a priori structural adaptive supervision mechanism for keypoint pairs generation. To our best knowledge, we are the first to solve the problem of aerial category-level object 6-DoF pose tracking in aerial high-mobility scenario.
•

To tackle the challenges of inter-frame differences and intra-class variations, we present a Shape-Based Spatial-Temporal Augmentation module through both temporal dynamic filtering and shape-similarity filtering. It improves the robustness of pose tracking for different instances in real-time aerial scene.
•

From the robotic system view, we design an efficient Pose-Aware Discrete Servo strategy to achieve the visual guidance task for aerial robotics manipulation, that is fully adapted to our Robust6DoF pose tracker.
•

We conduct a series of experiemnts on NOCS-REAL275 [10], YCB-Video [17], YCBInEOAT [18] and Wild6D [19] datasets. Our Robust6DoF achieves new state-of-the-art performance. Moreover, the real-world experiemnts show that the feasibility of our techniques in realistic aerial robotics scenes.

The remainder of this article is organized as follows. In the next section, we discuss the related works. Sec. 3 analyses the notation and task description. Sec. 4 describes the proposed approach and its core modules. The experiments are reported in Sec. 5. Finally, we summarizes the proposed method’s limitations, discusses future work and concludes the paper in Sec. 6.

2 Related Work

This section will review the related works on aerial object tracking, 6-DoF pose estimation and tracking, and the visual servoing for aerial robotics manipulation, respectively.

2.1 Aerial Visual Object Tracking

Visual object tracking can be boardly divided into three categories: Siamese-based, DCF-based and Transformer-based. In recent, several DCF-based methods have been deployed for aerial visual object tracking, including SARCT [20], ARCF [21], MRCF [22]. In [20], Xue et al. presented a semantic-aware correlation approach with low computing cost to enhance the performance of DCF-tracker. Another representative efforts include AutoTrack [23] and TCTrack [24]. Most of these methods continuously update the model from past historical information. In these aerial tracking methods, the targets are relatively small and are often in a state of fast motion. In this work, we aim to track the tabletop objects that have a big size in camera’s view. Additionally, aerial 6-DoF pose tracking is more challenging due to the rapid viewpoint changes in pitch and roll.

2.2 Object 6-DoF Pose Estimation

Early advancements in 6-DoF pose estimation can be broadly categorized into two groups: instance-level [25, 26, 27, 28, 29, 30, 31, 32] and category-level [33, 34, 13, 10, 35, 36, 37, 11, 35, 38, 39, 40, 12, 41, 42, 43, 44]. Instance-level methods predict object pose using known 3D CAD models and can also be classified into template-based methods and feature-based methods. However, obtaining accurate CAD models of unseen objects is a challenge for these type of methods. In contrast, category-level methods aim to predict the pose of instances without specific models. NOCS [10] pioneered direct regression of canonical coordinates for each instances, while CASS [11] developed a variational autoencoder for reconstructing object models. Category-level methods still face challenges due to RGB or RGB-D features sensitivity to surface texture and the problem of the intra-shape variation.

2.3 Object 6-DoF Pose Tracking

Object 6-DoF pose tracking is an important task in robotics and computer vision. Researchers have primarily concentrated on this task in two distinct manners: (1) instance-level pose tracking: This type of approach relies on the complete instance’s 3D CAD models, and the notable efforts include PA-Pose [45], Deep-AC [46], BundleSDF [47], PoseRBPF [48] and so on [18]; (2) category-level pose tracking: This type of method operates without specific 3D model requirements. Wang et al. [5] first introduced a novel category-level tracking benchmark, constructing a set of 3D unsupervised keypoints for pose tracking, named 6-PACK. To address per-part pose tracking for articulated objects, CAPTRA [6] presented an end-to-end differentiable pipline without any pre- or post- processing. ICK-Track [7] also introduced a inter-frame consistent keypoints generation network to generate the corresponding keypoint pairs in pose tracking. Lin et al. proposed CenterTrack [8], using the CenterNet framework to achieve the categorical pose tracking. In [9], Yu et al. proposed CatTrack to solve this problem with the single-stage keypoints-based registration.

Our proposed Robust6DoF is also a category-level 6-DoF pose tracking method. However, different from existing category-level track-based methods [5, 6, 7, 8, 9], we address the challenges of the intra-shape and inter-frame variations by leveraging both temporal prior and the shape prior knowledge. Meanwhile, we also consider the inter-frame key-points generation under the supervision of canonical shape priors, facilitating real-time adaptation of generated key-points to variations in inter-frame differences and distinguishes between observation and shape prior. Notably, Our method stands as the pioneering solution to the special aerial challenge in category-level 6-DoF pose tracking.

2.4 Visual Servoing for Aerial Robotics Manipulation

The standard solution to the visual servoing task relies on Position-Based Visual Servo (PBVS) or Image-Based Visual Servoing (IBVS). IBVS is more robust than PBVS in handling uncertainties and disturbances that affect the robot’s model, has proven to be a viable method for addressing aerial robotics manipulation tasks [14, 15, 49, 50, 16]. In [14], Chen et al. introduced an robust adaptive visual servoing method to achieve a compliant physical interaction task for aerial robotics manipulation. In [16], Oussama et al. proposed to use a deep neural networks (DNNs) for visual servoing applications of UAVs. In [50], a typical vision guidance system based on IBVS was integrated with passivity-based adaptive control for aerial robotics manipulation, showcasing promising results in simulation experiments and indicating potential real-world applications. Additionally, other methods have been developed to address visual-based tasks in aerial robotics manipulation, such as [51, 52]. Although IBVS-based techniques are well-established, they exhibit insensitivity to manipulator calibration and susceptibility to local optima. In our work, we advance the field by leveraging both PBVS and IBVS methods and directly generates the movement actions of the aerial vehicle and manipulator through their respective servo action loops. Our proposed method is better adapted to the nonlinear nature of aerial manipulator.

3 Preliminary and task statement

3.1 Robot Frame and Velocity Transmission

In our work, we use a general aerial manipulator as the robotics platform, consisting of a multirotor UAV with a 4-link serial robotic manipulator and an RGB-D camera configured as eye in hand. The schematic diagram and the reference frames of this platform are shown in Fig. 2. We define the following representations: $\dot{p}={({v_{x}},{v_{y}},{v_{z}})}\in{\mathbb{R}^{3}}$ and $\omega={({\omega_{x}},{\omega_{y}},{\omega_{z}})}\in{\mathbb{R}^{3}}$ , representing the linear and angular velocities of the aerial vehicle respectively. And $\dot{\eta}={({\dot{\eta}_{1}},{\dot{\eta}_{2}},{\dot{\eta}_{3}},{\dot{\eta}_{4% }})}\in{\mathbb{R}^{4}}$ denotes the joint angular velocity of the onboard manipulator. An expression that eventually contains all generalized velocities is given by:

\displaystyle q={({\dot{p}^{T}},{\omega^{T}},{\dot{\eta}^{T}})}.

(1)

Following the differential kinematic propagation, the velocity transmission between all generalized velocities and the velocity of the onboard camera can be derived as:

\displaystyle{V_{C}}=Jq^{T},

(2)

where ${V_{C}}={{[^{c}}{{\dot{p}}^{T}}_{c}{,^{c}}\omega_{c}^{T}]^{T}}\in{\mathbb{R}^{% 6\times 1}}$ is the velocity vector of camera frame expressed in its own frame, consisting of linear and angular velocities. The generalized Jacobian matrix $J\in{\mathbb{R}^{6\times 10}}$ is given by:

(4)

where $z={\left[{\begin{array}[]{*{20}{c}}{{0_{1\times 5}}}&1\end{array}}\right]^{T}}$ . And $U_{\alpha}^{\beta}$ is the generalized transformaton matrix between any two adjacent cooridnate frames $\{\alpha\}$ and $\{\beta\}$ , expressed as:

\displaystyle U_{\alpha}^{\beta}=\left[{\begin{array}[]{*{20}{c}}{R_{\alpha}^{% \beta}}&{{0_{3\times 3}}}\\ {P(r_{\beta,\alpha}^{\beta})R_{\alpha}^{\beta}}&{R_{\alpha}^{\beta}}\end{array% }}\right],

(7)

where ${R_{\alpha}^{\beta}}$ is the rotation matrix between frame ${\left\{\alpha\right\}}$ and frame $\left\{\beta\right\}$ , and $r_{\beta,\alpha}^{\beta}=({r_{x}},{r_{y}},{r_{z}})$ is the translation vector of frame $\{\alpha\}$ with respect to and expressed in frame $\{\beta\}$ . Here, $\{\alpha,\beta\}$ belongs to the any pair of all body-fixed frames in the system, i.e., $\{\alpha,\beta\}\in\{B,{L_{1}},\ldots,C\}$ , as depicted in Fig. 2. After calculation, the Eq. (2) can also be expressed as:

(15)

where ${{T_{3\times 4}}}$ is a matrix composed of the last row of ${R_{\alpha}^{\beta}}$ .

3.2 Task Description

Eq. (5) clearly shows that the linear velocity of the onboard camera is a result of the linear velocity of the aerial vehicle. Similarly, considering the underactuation of the aerial vehicle in two degrees of freedom ( ${\omega_{x}}$ and ${\omega_{y}}$ ), the angular velocity of the camera primarily arises from the motion of each joint of the manipulator ( ${{{\dot{\eta}}^{T}}}$ ) and ${\omega_{z}}$ . In other word, hidden relationships exist between the linear velocity of the aerial vehicle and the camera’s translation ( $T$ ), as well as between the angular velocity of the manipulator and the camera’s rotation ( $R$ ). Thus, the mentioned visual guidance task can be decomposed into two processes, namely, (1) 6-DoF pose tracking for object, and (2) visual servoing for aerial robot. Based on the above analysis and the formulation expressed in the right of Fig.1, we can address this task as follows: our method first take the RGB-D video stream captured by the onboard camera as inputs to real-time tracking object’s 6-DoF pose and subsequently generate the servo action signals for the aerial vehicle and onboard manipulator, respectively:

$\begin{aligned} \left\{{\begin{array}[]{l}{{{\cal P}^{(t)}}=[{R^{(t)}}|{T^{(t)% }}]=\Delta{{\cal P}^{(t)}}\cdot\ldots\cdot{{\cal P}^{(0)}}}\\ {{{\dot{\eta}}^{(t)}}={\partial_{rot}}\mapsto\mathop{\min}\limits_{{% \varepsilon_{r}}}[{R^{(t)}},{R^{*}}]}\\ {{v^{(t)}}={\partial_{tra}}\mapsto\mathop{\min}\limits_{{\varepsilon_{p}}}[{T^% {(t)}},{T^{*}}]}\end{array}}\right.,\end{aligned}$

(16)

where ${\partial_{rot}}$ and ${\partial_{tra}}$ are denotes the servo action function. The change of pose $\Delta{{\cal P}^{(t)}}\in\bf{SE}(3)$ contains the change in rotation $\Delta R^{(t)}\in\bf{SO}(3)$ and the change in translation $\Delta T^{(t)}\in{\mathbb{R}^{3}}$ , $\Delta{{\cal P}^{(t)}}=[\Delta{R^{(t)}}|\Delta{T^{(t)}}]$ . The absolute 6-DoF pose ${{\cal P}^{(t)}}=[{R}^{(t)}|{T}^{(t)}]$ in current observation can then be derived by recursing the previous tracking results over time. The initial pose ${{\cal P}^{(0)}}$ is set to the estimated pose state at the beginning of the guidance task.

4 Approach

In this section, we will first introduce our proposed category-level object 6-DoF pose tracker, namely Robust6DoF in Sec. 4.1 and then present the detailed scheme of our Pose-Aware Discrete Servo Policy called PAD-Servo for aerial robotics manipulator in Sec. 4.2.

4.1 Categorical 6-DoF Pose Tracker: Robust6DoF

4.1.1 Network Overview

In this subsection, we will present an overview of our proposed category-level 6-DoF pose tracker and then provide detailed introductions to each component in our designed network. As depicted in Fig. 3, our goal is to estimate continuously the change of the 6-DoF pose, denoted as $\Delta{{\cal P}^{(t)}}$ , for the target object within an arbitrary known category. The core inputs of our network include the observed RGB-D video stream captured by the onboard camera and the corresponding categorical shape prior ${P_{r}}\in{\mathbb{R}^{{N_{r}}\times 3}}$ , which is converted in advance into the same coordinate as the camera. For simplicity, the number of shape prior models is uniformly sampled to be consistent with ${P^{(t)}}$ , w.r.t., ${N_{r}}=N$ . Different from the recent methods [5, 6, 7, 8, 9] for category-level pose tracking, we employ a three-stage pipline, as displayed in Fig. 3. stage-1: We first integrate the local pixel-point dense feature descriptor for each target object using the proposed 2D-3D Dense Fusion Transformer (Sec. 4.1.2); stage-2: Subsequently, we introduce a Shape-Based Spatial-Temporal Augmentation module with a encoder-decoder structure to dynamically enhance this object-aware descriptor utilizing both temporal prior and shape prior knowledges. It ensures the adaptability of final augmented representations to inter-class variations and inter-frame differences (Sec. 4.1.3); stage-3: All enhanced embeddings are passed through the proposed Prior-Guided Keypoints Generation and Match module to build the 3D-3D inter-frame keypoint pair correspondences in a coarse-to-fine manner (Sec. 4.1.4). The final pose tracking is solved with the Perspective-n-Point (PnP) algorithm and RANSAC using these generated and aligned sets of keypoint pairs.

4.1.2 2D-3D Dense Fusion Transformer

The objective of this module is to build a local aggregated descriptor for each object by establishing dense per-point feature correspondences between the 3D point patch and the 2D image crop, that serves as the base embeddings for next embedding argumantation. In earlier works such as [53] and [5], 2D image and 3D depth information were used separately as inputs without considering the combination of modal-wise features, that resulted in the loss of intermodal correlation during the feature extraction process. In this regard, we present a pixel-point dense fusion module that utilizes the similarity properties of Transformer to enhance the selection of highly correlated feature pairs, as shown in Fig. 4 (a). Concretely, given the current image pixel crop ${I^{(t)}}\in{\mathbb{R}^{H\times W\times 3}}$ , along with the observable geometric point patch ${P^{(t)}}\in{\mathbb{R}^{N\times 3}}$ with a one-to-one correspondence through back-projection, we first employ our proposed Weight-Shared Attention (WSA) to map each pixel in the image crop to a color feature embedding ${F_{c}}\in{\mathbb{R}^{N\times{d_{rgb}}}}$ , meanwhile, process the corresponding point in the 3D point patch to a geometric feature embedding ${F_{g}}\in{\mathbb{R}^{N\times{d_{geo}}}}$ . The WSA layer adopts an offset-attention structure:

(17)

(18)

where $\varphi$ represents the linear and ReLU layer applied to the output features and $\alpha$ denotes the softmax function. ${\mathcal{F}_{i}},i=q,k,v$ represents the convolutional operation for query, key and value, respectively.

After the initial dense fusion, we aggregate these base dense information and then encode the context-dependent local feature descriptor $\tilde{F}_{obj}^{(t)}$ for each object in current frame. Inspired by the standpoint proposed by Wang et al. [54], that the low-rank nature of the context mapping matrix in the self-attention mechanism, we utilize this property not only to reduce the complexity time from $O({N^{2}})$ to $O(N)$ but also to enhance the instance’s pose representation in term of local per-point fusion. Specifically, ${F_{c}}$ and ${F_{g}}$ undergo an MLP operation and the color embedding ${F_{c}}$ is projected into two identical projection matrices $X_{c},Y_{c}\in{\mathbb{R}^{N\times k}}$ . As shown in Fig. 4 (a), we then incorporate them when computing the key and value vectors. This allows us to calculate an ${(N\times k)}$ -dimensional context mapping matrix using scaled dot-product attention with multi-heads:

$\begin{split}f_{g}^{i}&=Attention(\bar{F}_{g}^{(q)},\bar{F}_{g}^{(k)},\bar{F}_% {g}^{(v)})\\ &=Softmax\left({\frac{{\mathcal{F}(\bar{F}_{g}^{(q)}){{({X_{c}}\cdot\mathcal{F% }(\bar{F}_{g}^{(k)}))}^{T}}}}{{\sqrt{d}}}}\right)\cdot{Y_{c}}\cdot\mathcal{F}(% \bar{F}_{g}^{(v)})\end{split},$

(19)

\displaystyle\tilde{F}_{obj}^{(t)}=Cat(f_{g}^{1},f_{g}^{2},\ldots,f_{g}^{h}),

(20)

where ${d}$ is the embedding dimension and $h$ is the number of heads. $\mathcal{F}$ denotes the the linear layer.

4.1.3 Shape-Based Spatial-Temporal Augmentation

Unlike common indoor tabletop scene, the fast change of the onboard camera’s view in the aerial brid’s-eye perspective, such as pitch or roll, may induce the motion blur or significant inter-frame variations in space scale, etc. These challenges are unavoidable in real-time aerial 6-DoF pose tracking. Additionally, the intra-class shape variation with the same class can notably impact the performance of pose tracking for different instances. To our best knowledge, existing category-level pose tracking methods [5, 6, 7, 8, 9] have not completely solved these problems. In this end, we introduce a shape-based spatial-temporal augmentation strategy in this module. This strategy has an encoder-decoder structure leveraging both temporal knowledge dynamic filtering and the shape-similarity filtering, as depicted in the middle of Fig. 3.

The spatial-temporal filtering encoder aims to solve the challenge of inter-frame differences and initially construct a base temporal embedding $\tilde{F}_{temp}^{(t)}$ by transforming the prior knowledge from previous frame to current frame. As presented in the Fig. 4 (b), given the local descriptor $\tilde{F}_{obj}^{(t)}=\{f_{i}^{(t)}\in{\mathbb{R}^{d}}\}_{i={\rm{1}}}^{N}$ , we first apply a multi-head attention layer to generate ${{\dot{F}}^{(t)}}=\{\dot{f}_{i}^{(t)}\in{\mathbb{R}^{d}}\}_{i={\rm{1}}}^{N}$ :

\displaystyle{{\dot{F}}^{(t)}}=Norm(\tilde{F}_{obj}^{(t)}+MultiHead(\tilde{F}_% {obj}^{(t)},\tilde{F}_{obj}^{(t)},\tilde{F}_{obj}^{(t)})),

(21)

where the ” ${Norm}$ ” indicates the normalization layer. Considering that a faster change in viewpoint results in fewer overlaps among continous frames, making it difficult to capture useful inter-frame information, we need to retain the observable shape features while filtering out irrelevant data. Therefore, based on $\tilde{F}_{obj}^{(t-{\rm{1}})}=\{f_{j}^{(t-{\rm{1}})}\in{\mathbb{R}^{d}}\}_{j=% {\rm{1}}}^{N}$ in the previous ${t-1}$ -th frame, we compute a spatial-temporal similarity map matrix ${S_{\Delta t}}$ using the vector inner product:

\displaystyle{S_{\Delta t}}(i,j)=<\dot{f}_{i}^{(t)},f_{j}^{(t-{\rm{1}})}>~{}% \in{\mathbb{R}^{N\times N}},

(22)

and then is row-aware normalized through a softmax function to constrained the column elements of ${S_{\Delta t}}$ into the range of $[0,{\rm{1}}]$ , i.e., ${{\bar{S}}_{\Delta t}}=Softmax{({S_{\Delta t}}(i,\cdot))_{j}}$ . Where $<,>$ is the inner product. Afterward, we employ a max-pooling operator along with a convolution layer to obtain the filtered descriptor denoted by ${{\ddot{F}}^{(t)}}=\{\ddot{f}_{i}^{(t)}\in{\mathbb{R}^{d}}\}_{i={\rm{1}}}^{N}$ :

\displaystyle\ddot{f}_{i}^{(t)}=MaxPool{(\mathcal{F}([{{\bar{S}}_{\Delta t}}(i% ,j)\odot f_{j}^{(t-{\rm{1}})};f_{i}^{(t)}]))_{j}},

(23)

where $\mathcal{F}$ denotes the convolution layer and $[;]$ indicates vector concatenation. In this process, we effectively assign weights according to the impact of the previous frame features using the map matrix ${{\bar{S}}_{\Delta t}}$ and prioritize its the most relevant shape points. With this, the current $t$ -th temporal embedding can be obtained as follows:

\displaystyle\tilde{F}_{temp}^{(t)}=Norm({{\ddot{F}}^{(t)}}+MultiHead({{\dot{F% }}^{(t)}},{{\ddot{F}}^{(t)}},{{\ddot{F}}^{(t)}})).

(24)

To address another challenge of intra-category shape variability, we then employ the canonical shape prior information to augment obtained temporal embedding in the following augmentation decoder. In this end, a shape-similarity filtering block, as depicted in the bottom of Fig. 4 (c), is adopted before the decoding process to adaptively enhance the local object descriptor based on the shape prior ${P_{r}}$ . To jointly optimize the static prior model with the primary network, ${P_{r}}$ is first fed into a learnable layer to generate the shape-point representation ${F_{pr}}=\{f_{j}^{pr}\in{\mathbb{R}^{d}}\}_{j={\rm{1}}}^{{N_{r}}}$ . Likewise, a shape-similarity map matrix ${S_{pr}}$ is computed as follows:

\displaystyle{S_{pr}}(i,j)=<f_{i}^{(t)},f_{j}^{pr}>~{}\in{\mathbb{R}^{N\times{% N_{r}}}},

(25)

and then normalized as ${{\bar{S}}_{pr}}$ using the same operation as before. With this, the filtered features can be denoted by ${{\dddot{F}}^{(t)}}=\{\dddot{f}_{i}^{(t)}\in{\mathbb{R}^{d}}\}_{i={\text{1}}}^% {N}$ :

\displaystyle\dddot{f}_{i}^{(t)}=\mathcal{F}([{{\bar{S}}_{pr}}(i,j)\odot f_{j}% ^{pr};{\rho_{j}}]),

(26)

where ${\rho_{j}}$ is the coordinate of the shape-point. We assign weights according to the impact of shape-point features using the map matrix ${{\bar{S}}_{pr}}$ to compensate the missing information in current observation. The final augmented output, $\tilde{F}_{aug}^{(t)}$ , is updated by adopting one multi-head attention layer with feed-forward, expressed as:

$\begin{split}&F_{aug}^{(t)}=Norm({{\dddot{F}}^{(t)}}+MultiHead({{\dddot{F}}^{(% t)}},\tilde{F}_{temp}^{(t)},\tilde{F}_{temp}^{(t)}))\\ &\tilde{F}_{aug}^{(t)}=Norm(F_{aug}^{(t)}+FFN(F_{aug}^{(t)})).\end{split}$

(27)

4.1.4 Prior-Guided Keypoints Generation and Match

According to these group of augmented representations $\{\tilde{F}_{obj}^{(t)},\tilde{F}_{temp}^{(t)},\tilde{F}_{aug}^{(t)}\}$ , we now employ these representations to generate the 3D keypoints for final 6-DoF pose tracking. Different from the prior work 6-PACK [5], where unsupervised keypoint generation may result in a local optimum, we dynamically adapt keypoint generation based on the structural similarity between categorical prior ${P_{r}}$ and observable point patch ${P^{(t)}}$ in the current frame. In the end, we introduce an auxiliary module to convert ${P^{(t)}}$ into $n$ object key-points $[{k_{1}},\ldots,{k_{n}}]$ , As shown in the right of Fig. 3, we apply a low-rank Transformer network with $\{\tilde{F}_{obj}^{(t)},\tilde{F}_{temp}^{(t)},\tilde{F}_{aug}^{(t)}\}$ as query, key and value to estimate a structure regularized projection matrix $M\in{\mathbb{R}^{n\times N}}$ for each category. To encourage the key-point transformation $M$ to adapt the intra-class structural variation among different instances, we utilize the shape prior ${P_{r}}$ to optimize this auxiliary module by minimizing the loss ${L_{aux}}$ during training step:

(28)

where $P_{r}^{K}=M\times{P_{r}}$ is the prior-base n object keypoints. This formulation effectively ensures that the 3D space of the key-points is structurally consistent with the shape prior, regardless of the pose change over time.

We then apply a 3D tri-plane as a compact feature representation of the projected keypoints, based on the 2D-base formulation in [55]. We align these embedding group $\{\tilde{F}_{obj}^{(t)},\tilde{F}_{temp}^{(t)},\tilde{F}_{aug}^{(t)}\}$ along three axis-aligned orthogonal feature planes by projecting them onto the triplane $\{{T_{XY}},{T_{YZ}},{T_{XZ}}\}$ using the known camera intrinsics. In our implementation, each plane has dimensions $N\times{d_{T}}$ . For any object key-point, we project it onto each planes, query the corresponding point feature $\{{T_{xy}},{T_{yx}},{T_{xz}}\}$ via nearest-neighbor point interpolation, which is then conrelated into final keypoint feature $\tilde{F}_{kp}^{(t)}$ .

Due to the identity of the projection matrix $M$ , a rough match has been revealed between the keypoint pairs between consecutive frames. Futhermore, we then perfrom the finer keypoint matching to filter possible outlier coarse matches. Following [56], a score matrix $S_{kp}$ is calculated based on the simularity between two sets of keypoint features $\tilde{F}_{kp}^{(t-1)}$ and $\tilde{F}_{kp}^{(t)}$ in previous and current frames:

\displaystyle S_{kp}(i,j)=\frac{1}{\tau}\cdot<\tilde{F}_{kp}^{(t-1)}[{k_{i}}],% \tilde{F}_{kp}^{(t)}[{k_{j}}]>,~{}i,j=1,\ldots n,

(29)

where $\tau$ is a scale factor. We also apply a dual-softmax operator [57] on both dimensions of $S_{kp}$ to obtain the keypoint pairs matching probability:

\displaystyle{\mathcal{P}_{c}}(i,j)=softmax{(S_{kp}(i,\cdot))_{j}}\cdot softmax% {(S_{kp}(\cdot,j))_{i}}.

(30)

With the confident matrix ${\mathcal{P}_{c}}$ , we select finer key-points with confidence higher than a threshold of ${\theta_{c}}$ , and further enforce the mutual nearest neighbor (MNN) criteria:

\displaystyle{\mathcal{M}_{c}}=\{(i,j)|\forall(i,j)\in MNN({\mathcal{P}_{c}}),% {\mathcal{P}_{c}}(i,j)\geq{\theta_{c}}\}.

(31)

4.1.5 Training Supervision

To improve the performance of our keypoint generation module, we use the following multi-view consistency loss to render the generated keypoints in each of two consecutive frames a better match, placing the keypoint in current view at the transformed corresponding keypoint using ground-truth pose change in previous frame:

\displaystyle{L_{{\rm{mvc}}}}=\frac{1}{n}\sum\limits_{i}{||k_{i}^{(t)}-[\Delta R% _{gt}^{(t)}|\Delta T_{gt}^{(t)}]\cdot k_{i}^{(t-1)}||},

(32)

Meanwhile, to supervise the matching probability matrix ${\mathcal{P}_{c}}$ , we follow LoFTR [56] to use the negative log-likelihood loss over the grids in $\mathcal{M}_{c}^{gt}$ . We likewise use camera poses and depth maps to compute the ground-truth for ${\mathcal{P}_{c}}^{gt}$ and $\mathcal{M}_{c}^{gt}$ :

\displaystyle{L_{\rm{c}}}=-\frac{1}{{|\mathcal{M}_{c}^{gt}|}}\sum\limits_{(i,j% )\in\mathcal{M}_{c}^{gt}}{\log}{\mathcal{P}_{c}}(i,j).

(33)

The above loss functions only guarantees that the generated keypoint pairs are robust to the change in pose. However, it does not ensure these keypoints are optimal for estimating the final pose. In this regard, we use a differentiable pose tracking loss function, which includes a translation loss and a rotation loss:

\displaystyle{L_{tra}}=||\frac{1}{n}\sum\limits_{i}{(k_{i}^{(t)}-k_{i}^{(t-1)}% )-\Delta T_{gt}^{(t)}||},

(34)

\displaystyle{L_{rot}}=2\arcsin(\frac{1}{{2\sqrt{2}}}||\Delta{R^{(t)}}-\Delta R% _{gt}^{(t)}||).

(35)

Therefore, the overall loss function can be determined as the weighted sum of all losses.

Input: Object 6-DoF pose

{{\cal P}^{(t)}}=[{R^{(t)}}|{T^{(t)}}]

t

time;

The desired object 6-DoF pose

{P^{*}}=[{R^{*}}|{T^{*}}]

Output: Current servo action

{\alpha^{(t)}}\in\{{\dot{\eta}^{(t)}},{v^{(t)}}\}

//

Output the low-level action to drive the aerial manipulator

\Delta R\leftarrow{({R^{(t)}})^{T}}\cdot{R^{*}}

;

\Delta T\leftarrow{\rm{abs}}({T^{*}}-{T^{(t)}})

; while $\Delta R\geq{\delta_{R}}$ and $\Delta T\geq{\delta_{T}}$ do

//

Obatin the rotational actions of the manipulator

u\theta\leftarrow\Delta R

;

{\varepsilon_{r}}\leftarrow\theta{u^{T}}-0

; Compute Jacobian matrix

L(u,\theta)

with Eq. (38); if $\theta\mapsto 0$ then

{{\dot{\eta}}^{(t)}}\leftarrow-{\lambda_{r}}J_{mr}^{+}{[\begin{array}[]{*{20}{% c}}{{0_{3\times 3}}}&{{I_{3\times 3}}}\end{array}]^{+}}{\varepsilon_{r}}

;

else

{{\dot{\eta}}^{(t)}}\leftarrow-{\lambda_{r}}J_{mr}^{+}{[\begin{array}[]{*{20}{% c}}{{0_{3\times 3}}}&{L(u,\theta)}\end{array}]^{+}}{\varepsilon_{r}}

;

end if

//

Obatin the translational actions of aerial vehicle

\Delta{m_{e}}\leftarrow\Delta T

;

{\varepsilon_{p}}\leftarrow\Delta{m_{e}}

; Compute Jacobian matrix

{J_{s}}

with Eq. (72);

{\upsilon^{(t)}}\leftarrow-J_{s}^{+}({\lambda_{p}}{L^{+}}{\varepsilon_{p}}+{{% \bar{J}}_{s}}\nabla)

;

end while

return ${\alpha^{(t)}}$ .

Algorithm 1 Pose-Aware Discrete Servo Policy (PAD-Servo) for Aerial Manipulator

4.2 Pose-Aware Discrete Servo Policy: PAD-Servo

This module is designed to generate the action signals ${\alpha^{(t)}}\in\{{\dot{\eta}^{(t)}},{v^{(t)}}\}$ to accomplish the vision guidance task for aerial manipulator based on the targeted object’s 6-DoF pose ${{\cal P}^{(t)}}=[{R^{(t)}}|{T^{(t)}}]$ in the current observation. As depicted in Fig. 5, we utilize the homography matrix decomposition between the current observation and the desired observation to split the servo process into two parts: the rotational action loop for onboard manipulator and the translational action loop for the aerial vehicle, respectively. Specifically, the rotational action signal is generated from the 3D rotation matrix ( ${R^{(t)}}$ ) in 3D Cartesian space, while the translational action signal is derived from the estimated 3D location ( ${T^{(t)}}$ ) in 2D image space. For a detailed algorithmic flow, please refer to Alg. 1. The desired observation refers to the image plane where the onboard camera is directly positioned over the targeted object. It is crucial to emphasize that due to the absence of payload and the minimal sway experienced by the manipulator during the overall guidance process, we primarily focus on the robot’s kinematic model, with less emphasis on its dynamic model.

4.2.1 Rotational Action Loop for Onboard Manipulator

Given the current estimated 3D rotation ${R^{(t)}}$ and the corresponding desired value ${R^{*}}$ , we can obtain the change of rotation, i.e., $\Delta R={({R^{(t)}})^{T}}\cdot{R^{*}}$ . Let’s use the vector $u\theta$ to express $\Delta R$ , where the $u$ represents the rotation axis, and $\theta$ is the rotation angle obtained from identity matrix $\Delta R$ . So the objective function for rotational action can be defined as the error of the $\theta{u^{T}}$ toward zero, i.e., ${\varepsilon_{r}}=\theta{u^{T}}-0$ , and its time derivative can be related to the camera velocity component generated from onboard manipulator ${V_{C}}^{(m)}$ :

\displaystyle{\dot{\varepsilon}_{r}}=[\begin{array}[]{*{20}{c}}0&{{L}(u,\theta% )}\end{array}]{V_{C}}^{(m)},

(37)

where the Jacobian matrix ${L}(u,\theta)$ is

(38)

$\sin c(\theta)={{\sin(\theta)}\mathord{\left/{\vphantom{{\sin(\theta)}\theta}}% \right.\kern-1.2pt}\theta}$ and ${L(u)}$ is antisymmetric matrix associated with $u$ . The error ${\varepsilon_{r}}$ can be converged exponentially by imposing ${\dot{\varepsilon}_{r}}={-\lambda_{r}}{\varepsilon_{r}}$ and $\lambda_{r}$ tunes the convergence rate. Meanwhile, based on the special form of ${L}(u,\theta)$ , we can set ${L}(u,\theta)={L}(u,\theta)^{-1}=I_{3\times 3}$ for the small value of $\theta$ . Then we compute the relationship between the vector ${V_{C}}^{(m)}$ and the angular velocity vector of each joint of the manipulator $\dot{\eta}$ , which can be expressed as:

	$\displaystyle{V_{C}}^{(m)}$	$\displaystyle=\left[{\begin{array}[]{{20}{c}}{R_{B}^{C}}&0\\ 0&{R_{B}^{C}}\end{array}}\right]{J_{m}}{\left[{\begin{array}[]{{20}{c}}{{{% \dot{\eta}}_{1}}}&{{{\dot{\eta}}_{2}}}&{{{\dot{\eta}}_{3}}}&{{{\dot{\eta}}_{4}% }}\end{array}}\right]^{T}}$		(42)
		$\displaystyle=\bar{R}_{B}^{C}{J_{m}}{\dot{\eta}^{T}},$		(43)

where $J_{m}$ is the arm Jacobian matrix, and ${R_{B}^{C}}$ is the rotation matrix of the base frame with respect to the camera frame. Finally, according to the Eq. (37) and (42), the rotational servo action law of the manipulator can be described as:

(52)

where ${J_{mr}}=\bar{R}_{B}^{C}{J_{m}}\in{\mathbb{R}^{6\times 4}}$ .

4.2.2 Translational Action Loop for Aerial Vehicle

For the translational action for aerial vehicle, we can define the corresponding objective function as the ${m_{e}}$ toward the desired value $m_{e}^{*}$ , i.e., ${\varepsilon_{p}}={({m_{e}}-m_{e}^{*})^{T}}$ , where ${m_{e}}$ is the extended image coordinate:

\displaystyle{m_{e}}={[\begin{array}[]{*{20}{c}}x&y&z\end{array}]^{T}}={[% \begin{array}[]{*{20}{c}}{{X\mathord{\left/{\vphantom{XZ}}\right.\kern-1.2pt}Z% }}&{{Y\mathord{\left/{\vphantom{YZ}}\right.\kern-1.2pt}Z}}&{\log Z}\end{array}% ]^{T}},

(55)

where ${T^{(t)}}={[\begin{array}[]{*{20}{c}}X&Y&Z\end{array}]^{T}}$ is 3D location of targeted object in the current observation. Similarly, the time derivate of this error function can be related to the camera velocity componemt from aerial vehicle ${V_{C}}^{(a)}$ :

\displaystyle{{\dot{\varepsilon}}_{p}}=[\begin{array}[]{*{20}{c}}{{L_{Z}}}&{{L% (x,y)}}\end{array}]{V_{C}}^{(a)}=L{V_{C}}^{(a)},

(57)

where ${L(x,y)}$ is:

\displaystyle{L(x,y)}=\left[{\begin{array}[]{*{20}{c}}{xy}&{-(1+{x^{2}})}&y\\ {\left({1+{y^{2}}}\right)}&{-xy}&{-x}\\ {-y}&x&0\end{array}}\right],

(61)

and the upper triangular matrix ${L_{Z}}$ is given by:

\displaystyle{L_{Z}}=\frac{1}{{Z}}\left[{\begin{array}[]{*{20}{c}}{{\rm{-}}1}&% 0&x\\ 0&{{\rm{-}}1}&y\\ 0&0&{{\rm{-}}1}\end{array}}\right],

(65)

where the matrix ${L_{Z}}$ is obtained from decomposing the homograph matrix between current and desired observations. For additional detailed content about the homography decomposition, we refer the reader to [58].

Similar to the ${\varepsilon_{r}}$ , we also impose ${\varepsilon_{p}}$ to the exponential convergence, i.e., ${{\dot{\varepsilon}}_{p}}=-{\lambda_{p}}{\varepsilon_{p}}$ by setting the convergence rate ${\lambda_{p}}$ . In this way, such camera velocity component ${V_{C}}^{(a)}$ can be expressed as:

\displaystyle{V_{C}}^{(a)}=-{\lambda_{p}}{L^{+}}{\varepsilon_{p}},

(66)

where ${L^{+}}$ is Moore-Penrose matrix of ${L}$ . Meanwhile, the relationship between the velocities of aerial vehicle and the camera velocity component from aerial vehicle ${V_{C}}^{(a)}$ can be expressed as:

(71)

where ${r_{C}^{B}}$ is the distance vector between the base frame and the camera frame. According to the underactuation of aerial vehicle, we remove the uncontrollable variables $\nabla={({\omega_{x}},{\omega_{y}})^{T}}$ from the translational and angular velocity vector of the aerial vehicle:

\displaystyle{V_{C}}^{(a)}={J_{s}}\upsilon+{{\bar{J}}_{s}}\nabla,

(72)

where ${{\bar{J}}_{s}}$ is the Jacobian formed by the columns of $\bar{r}_{C}^{B}$ corresponding to ${{\omega_{x}}}$ and ${{\omega_{y}}}$ , and ${J_{s}}$ is the Jacobian formed by all other columns of $\bar{r}_{C}^{B}$ corresponding to $\upsilon=({v_{x}},{v_{y}},{v_{z}},{\omega_{z}})^{T}$ . According to Eq. (66) and (72), the translational servo action law of aerial vehicle can be formulated as:

\displaystyle{\upsilon^{(t)}}=-J_{s}^{+}({\lambda_{p}}{L^{+}}{\varepsilon_{p}}% +{{\bar{J}}_{s}}\nabla).

(73)

5 Experiments

In this section, we first present extensive quantitative comparative experiments on the four widely-used public datasets to evaluate the performance of the presented category-level 6-DoF pose tracker Robust6DoF and compare it with currently available state-of-the-art baselines. We also perform numerous ablation studies and additional analyses to verify the advantages of each component in our method. In addition, to further test the effectiveness of the proposed completed pipline, we implement a visual guidance experiment directly using our model trained on the public dataset, along with the proposed PAD-Servo, to control a real-world aerial robot platform, namely, an aerial manipulator in our Robotic Laboratory.

5.1 Experimental Setup

5.1.1 Datasets

We evaluate Robust6DoF using four public datasets, i.e., NOCS-REAL275 [10], YCB-Video [17], YCBInEOAT [18] and Wild6D [19] dataset. The NOCS-REAL275 dataset was proposed by Wang [10]and contains six categories: bottle, bowl, camera, can, laptop and mug. It includes 13 real-world scenes, with 7 scenes (4.3K RGB-D images) for training and 6 scenes (2.7K RGB-D images) for testing. The training and testing sets include 18 real object instances across these six categories. The YCB-Video dataset was introduced in [17] and consists of both real-world and synthetic images. We use only its real-world data for training, which includes 92 videos captured in various settings using an RGB-D camera. During training, we utilize $80$ of these videos, reserving the remaining $12$ for testing. The YCBInEOAT dataset [18] considers five YCB-Video objects, including mustard bottle, tomato soup can, sugar box, bleach cleanser and cracker box. It contains 9 video sequences captured by a static RGB-D camera. The Wild6D [19] is a large-scale RGB-D dataset that consists of $5,166$ videos (over $1.1$ million images) featuring $1722$ different object instances across five categories: bottle, bowl, camera, laptop, and mug. Following the creator’s instuctions, we treat 486 videos of 162 instances as the test set.

5.1.2 Evaluation Metrics

We use the following four types of evaluation metrics:

•

${IoUx}$ . It measures the average percision for various IoU-overlap thresholds, which calculates the overlap between two 3D bounding boxes based on the predicted pose and the ground-truth pose.
•

${a^{\circ}}b\,cm$ . It quantifies the pose estimation error for rotation and translation, and the error is less than $a^{\circ}$ for rotation and $b~{}cm$ for translation. We adopt the ${5^{\circ}}2~{}cm$ , ${5^{\circ}}5~{}cm$ , ${10^{\circ}}2~{}cm$ and ${10^{\circ}}5~{}cm$ for evaluation.
•

${\rm{ADD(S)}}$ . Evaluating for instance-level 6-DoF pose tracking. ADD measures the distance between the ground truth 3D model and the posed model using predictions. ADD-S is for the symmetrical object.
•

${R_{\mathit{err}}}$ ( ${T_{\mathit{err}}}$ ). These terms measure the average error of rotation (degrees) and translation (centimeters), that are used for category-level pose tracking.

5.1.3 Implementation Details

All the building blocks in the Robust6DoF’s network are trained using an ADAM optimizer with an initial learning rate of ${10^{-3}}$ and a batch size of $32$ . The training epoch number is set as $50$ . The experiments on the public datasets were conducted on a desktop computer with an Intel Xeon Gold 6226R@2.90GHz processor and a single NVIDIA RTX A6000 GPU. We trained our model using the NOCS-REAL275 dataset and fine-tuned it on the YCB-Video dataset. The number of the partially visible point patch, ${P^{(t)}}$ and the priori shape-point ${P_{r}}$ are both set to $N={N_{r}}=2048$ . The number of generated key-points is $n=512$ . The confidence threshold is set to ${\theta_{c}}=0.45$ . In real-world experiment, we implement Mask-RCNN for segmentation, as in [10]. The gains of our PAD-Servo in Eq. (52) and Eq. (73) are empirically set as follows: ${\lambda_{r}}=0.25$ , ${\lambda_{p}}=0.27$ and the guidance end thresholds are set to ${\delta_{R}}=0.075$ , ${\delta_{T}}=0.040$ .

TABLE I: Quantitative comparison of category-level 6-DoF pose estimation on the pubilc NOCS-REAL275 dataset. Note that the best and the second best results are highlighted in bold and underlined. The results are averaged over all six categories. The comparison results of current state-of-the-art baselines are all summarized from their original papers and empty denotes no results are reported under their original paper.

Method	Training Data	Shape Prior	Evaluation Metrics							#Params (M)( $\downarrow$ )
Method	Training Data	Shape Prior	$IoU25\uparrow$	$IoU50\uparrow$	$IoU75\uparrow$	${5^{\circ}}2cm\uparrow$	${5^{\circ}}5cm\uparrow$	${10^{\circ}}2cm\uparrow$	${10^{\circ}}5cm\uparrow$	#Params (M)( $\downarrow$ )
NOCS [10] [CVPR2019]	RGB	\usym2717	84.9	80.5	30.1	7.2	10	13.8	25.2	-
SPD [12] [ECCV2020]	RGB-D	\usym2713	83.4	77.3	53.2	19.3	21.4	43.2	54.1	18.3
SGPA [13] [ICCV2021]	RGB-D	\usym2713	-	80.1	61.9	35.9	39.6	61.3	70.7	23.3
CR-Net [59] [IROS2021]	RGB-D	\usym2713	-	79.3	55.9	27.8	34.3	47.2	60.8	21.4
CenterSnap [60] [ICRA2022]	RGB-D	\usym2717	-	80.2	-	-	27.2	-	58.8	-
ShAPO [61] [ECCV2022]	RGB-D	\usym2717	-	79.0	-	-	48.8	-	66.8	-
TTA-COPE [43] [CVPR2023]	RGB-D	\usym2717	-	69.1	39.7	30.2	35.9	61.7	73.2	-
IST-Net [62] [ICCV2023]	RGB-D	\usym2717	84.3	82.5	76.6	47.5	53.4	72.1	80.5	-
FS-Net [37] [CVPR2021]	D	\usym2717	84.0	81.1	63.5	19.9	33.9	-	69.1	41.2
UDA-COPE [42] [CVPR2022]	D	\usym2717	-	79.6	57.8	21.2	29.1	48.7	65.9	-
SAR-Net [39] [CVPR2022]	D	\usym2717	-	79.3	62.4	31.6	42.3	50.3	68.3	6.3
GPV-Pose [63] [CVPR2022]	D	\usym2717	84.1	83.0	64.4	32.0	42.9	55.0	73.3	8.6
HS-Pose [44] [CVPR2023]	D	\usym2717	84.2	82.1	74.7	46.5	55.2	68.6	82.7	-
Query6DoF [40] [ICCV2023]	D	\usym2713	-	82.5	76.1	49.0	58.9	68.7	83.0	-
GPT-COPE [33] [TCSVT2023]	D	\usym2713	-	82.0	70.4	45.9	53.8	63.1	77.7	7.1
Ours	RGB-D	\usym2713	89.8	87.0	82.5	57.1	70.6	75.2	84.5	6.0

TABLE II: Quantitative comparison of category-level 6-DoF pose tracking on the pubilc NOCS-REAL275 dataset. Note that the best and the second best results are highlighted in bold and underlined. The results of available baselines are all summarized from their original papers.

Method

ICP [5]

Oracle ICP [6]

6-PACK [5]

6-PACK

w/o temporal [5]

CAPTRA [6]

CAPTRA

+RGB seg. [6]

MaskFusion [64]

Ours

Input

Depth

RGB-D

Depth

RGB-D

Initialization

GT.

Pert.

GT.

{5^{\circ}}5cm\uparrow

16.9

0.65

28.9

22.1

62.2

63.6

26.5

70.6

IoU25\uparrow

47.0

14.7

55.4

53.6

64.1

69.2

64.9

89.8

{R_{err}}~{}\downarrow

48.1

40.3

19.3

19.7

5.9

6.4

28.5

5.2

{T_{err}}~{}\downarrow

10.5

7.7

3.3

3.6

7.9

4.2

8.3

3.0

5.2 Quantitative Comparisons on the Public Datasets

5.2.1 Results on the NOCS-REAL275 Dataset

We first conduct both category-level 6-DoF pose tracking and estimation on the testing set of the NOCS-REAL275 dataset. Some quantitative results are presented in TABLE I, TABLE II and Fig. 6. As shown in TABLE I, we compare our approach with $15$ state-of-the-art single estimation-based methods. These baselines either take RGB (-D) as inputs or use only point cloud features (D), and they can be divided into two groups: shape prior-based and prior-free methods. In detail, we outperform the pioneer work NOCS [10] by 52.4 in $IoU75$ , 49.9 in ${5^{\circ}}2cm$ and 60.6 in ${5^{\circ}}5cm$ . For comparison with prior-free methods, we also achieve better results than existing approaches. In particular, we outperform Query6DoF [40], the current most powerful method, by 57.1 vs 49.0 on ${5^{\circ}}2cm$ , 70.6 vs 58.9 on ${5^{\circ}}5cm$ and 75.2 vs 68.7 on ${10^{\circ}}2cm$ . As for prior-based methods, we also show significant improvements in nearly all the evaluation metrics with large margins. For example, we reach 87.0, 82.5, and 84.5 in terms of $IoU50$ , $IoU75$ and ${10^{\circ}}5cm$ , which outperform the most competitive representative work SGPA [13] by 6.9%, 20.6% and 13.8%. Notably, our model has minimal parameters among all baselines, proving its low computing cost.

In addition, we summarize the quantitative results for category-level object 6-DoF pose tracking, as depicted in TABLE II. We compare our method with the currently available state-of-the-art tracking methods: classic ICP [6] approach and its improved version OracleICP [6], 6-PACK [5] and CAPTRA [6] along with their variants, and MaskFusion [64]. It is worth noting that our method also achieves the best performance in terms of all track-based evaluation metrics. The corresponding quantitative comparisons are presented in Fig. 6, which are arranged from left to right in time sequence. It further shows that our tracking results more accurately match the ground truth compared to CAPTRA [6] and 6-PACK [5].

TABLE III: Quantitative comparison of instance-level 6-DoF pose tracking on the pubilc YCB-Video dataset. Note that the best and the second best results are highlighted in bold and underlined, respectively.

Objects	PoseCNN [17]		CatTrack [9]		Ours
Objects	ADD	ADD-S	ADD	ADD-S	ADD	ADD-S
002 master chef can	50.9	84.0	82.5	86.3	90.3	91.2
003 cracker box	51.7	76.9	86.2	91.7	92.1	91.7
004 sugar box	68.6	84.3	83.6	92.0	88.5	89.2
005 tomato soup can	66.0	80.9	84.3	88.6	86.5	90.3
006 mustard bottle	79.9	90.2	85.9	90.2	91.4	90.3
007 tuna fish can	70.4	87.9	84.7	91.5	88.7	89.2
008 pudding box	92.9	79.0	73.4	85.8	82.5	86.4
009 gelatin box	75.2	87.1	90.8	93.9	90.9	91.4
010 potted meat can	59.6	78.5	66.7	75.9	80.1	82.5
011 banana	72.3	85.9	76.8	82.4	81.2	83.5
019 pitcher base	52.5	76.8	84.1	92.8	86.4	88.8
021 bleach cleanser	50.5	71.9	73.4	80.5	79.8	80.9
024 bowl	6.5	69.7	33.6	89.8	85.4	86.5
025 mug	57.7	78.0	72.1	83.9	77.2	80.4
035 power drill	55.1	72.8	71.3	86.0	79.5	82.6
036 wood block	31.8	65.8	28.6	62.3	76.8	81.3
037 scissors	35.8	56.2	64.9	74.3	74.9	80.0
040 large marker	58.0	71.4	70.8	83.4	81.0	82.7
051 large clamp	25.0	49.9	66.8	78.1	76.9	79.4
052 extra large clamp	15.8	47.0	49.8	77.2	72.1	77.7
061 foam brick	40.4	87.8	86.0	93.4	89.9	92.1
Average	53.7	75.9	72.2	84.8	83.4	85.6

TABLE IV: Quantitative comparison on public YCBInEOAT dataset. We measure using ADD and ADD-S metrics.Note that the best and the second best results are highlighted in bold and underlined, respectively. The results of RGF [65], POT [66], MaskFusion [64] and TEASER [67] are all summarized from the literature [68].

Method

RGF

[65]

POT

[66]

MaskFusion

[64]

TEASER

[67]

Ours

Setting

3D Model

No Model

003 cracker box

ADD

34.78

79.00

79.74

63.24

80.51

ADD-S

55.44

88.13

88.28

81.35

86.32

021 bleach cleanser

ADD

29.40

61.47

29.83

61.83

79.25

ADD-S

45.03

68.96

43.31

82.45

83.19

004 sugar box

ADD

15.82

86.78

36.18

51.91

89.11

ADD-S

16.87

92.75

45.62

81.42

94.42

005 tomato soup can

ADD

15.13

63.71

5.65

41.36

85.77

ADD-S

26.44

93.17

6.45

71.61

90.79

006 mustard bottle

ADD

56.49

91.31

11.55

71.92

92.23

ADD-S

60.17

95.31

13.11

88.53

96.36

Average

ADD

29.98

78.28

35.07

57.91

85.37

ADD-S

39.90

89.18

41.88

81.17

90.22

TABLE V: Quantitative comparison on the pubilc Wild6D dataset. The results of state-of-the-arts are summarized from [19]. Note that the best and the second best results are highlighted in bold and underlined.

Method	Prior	Evaluation Metrics
Method	Prior	$IoU50$	${5^{\circ}}2cm$	${5^{\circ}}5cm$	${10^{\circ}}5cm$
SPD [12] [ECCV2020]	\usym2713	32.5	2.6	3.5	13.9
SGPA [13] [ICCV2021]	\usym2713	63.6	26.2	29.2	39.5
DualPoseNet [36] [ICCV2021]	\usym2717	70.0	17.8	22.8	36.5
CR-Net [59] [IROS2021]	\usym2713	49.5	16.1	19.2	36.4
RePoNet [19] [NeurlPS2022]	\usym2713	70.3	29.5	34.4	42.5
GPV-Pose [63] [CVPR2022]	\usym2717	67.8	14.1	21.5	41.1
GPT-CORE [33] [TCSVT2023]	\usym2713	66.1	29.8	35.6	42.3
Ours	\usym2713	75.1	31.2	44.4	50.9

TABLE VI: Pose tracking speed in FPS. Note that the best and the second best results are highlighted in bold and underlined. All speeds are measured on a single NVIDIA RTX A6000 GPU.

Method

NOCS

[10]

SPD

[12]

SGPA

[13]

6-PACK

[5]

CAPTRA

[6]

Ours

Type

Track-free

Track-based

NOCS-REAL275

5.24

15.23

14.12

4.03

10.35

24.20

Wild6D

5.44

14.22

13.58

4.98

11.23

23.83

YCB-Video

6.39

15.74

14.52

5.01

12.44

23.27

5.2.2 Results on the YCB-Video Dataset

To futher verify the generalization ability of our method regarding the instance-level pose tracking, we verify our model without fine-tuning on the YCB-Video dataset’s testing set. We compared our performance with other relevant instance-level detect-based (PoseCNN [17]) and track-based (CatTrack [9]) baselines, presenting the average results across all $21$ classes in TABLE III. It can be observed that our method performs well in improving performance for object pose tracking. Specifically, our model without fine-tuning achieved the highest average accuracy of $83.4\%$ and $85.6\%$ in ADD and ADD-S metrics, respectively. Visualization comparison results between our predictions and PoseCNN [17] are also provided in Fig. 7. It also demonstrates that our proposed approach can predict higher-quality pose tracking results for unseen objects.

5.2.3 Results on the YCBInEOAT Dataset

To verify the effectiveness of 6-DoF pose tracking in desktop-fixed robotics manipulation scenarios and to evaluate the performance in situations where objects are moving in front of camera, we compare our Robust6DoF on the YCBInEOAT dataset with several available baselines, including 3D model-based methods (RGF [65] and POT [66]) and model-free methods (MaskFusion [64] and TEASER [67]). Corresponding quantitative and qualitative results are displayed in TABLE IV and Fig. 8. Overall, Robust6DoF achieves the best performance in two average metrics. Specifically, we outperforms TEASER, the latest state-of-the-art baseline, by 27.5% at ADD and 9.1% at ADD-S. As shown in the figure, the pose tracking results by our Robust6DoF exhibit a closer value to the ground-truth compared to the 6-PACK [5] results. These analyses demonstrate that our proposed method not only achieves the better performance for static objects but also facilitates the superior generalizability for dynamic instances captured by a fixed camera.

5.2.4 Results on the Wild6D Dataset

To assess the generalization ability of our method in handling overcrowded objects in real-world cluttered scenes, we conduct evaluations on the public Wild6D dataset. We directly test our trained model with some existing works as reported in TABLE V. Their pre-trained models, trained on NOCS-REAL275 along with CAMERA75 datasets [10], were used for comparison. It is observed that our proposed achieves 75.1%, 31.2%, 44.4% and 50.9% on $IoU50$ , ${5^{\circ}}2cm$ , ${5^{\circ}}5cm$ and ${10^{\circ}}5cm$ , respectively, outperforming these available baselines on almost all metrics. This significant improvement shows the superior generalization ability of our approach under the crowded settings in the wild. Additionally, we perform a qualitative comparison of pose tracking by our method and SGPA [13] and SPD [12] on the Wild6D testing set. The results are displayed in Fig. 9. It can be concluded that we can exhibit a closer match to the ground-truth compared to existing single estimation method SGPA and SPD. These analysis and results showcase the potential of our Robust6DoF.

5.2.5 Pose Tracking Speed in FPS

Beyond the comparison of performance with state-of-the-arts, we futher verify the tracking speed (FPS) among five typical baselines: NOCS [10], SPD [12], SGPA [13], 6-PACK [5] and CAPTRA [6]. As summarized in TABLE VI, all methods are tested on the same device using their officially released code or checkpoint to ensure a fair evaluation. From TABLE VI, it is evident that our method achieves an average speed of 24.2 FPS on the NOCS-REAL275 dataset, 23.8 FPS on the Wild6D dataset and 23.3 FPS on the YCB-Video dataset, respectively. It is clear that our method outperforms these existing track-based and track-free approaches.

5.3 Additional Analyses

To assess the pose tracking robustness of the proposed Robust6DoF, we also conduct several extra experiments on the NOCS-REAL275 dataset, the detailed results are displayed in Fig. 10 to Fig. 13.

5.3.1 Comparison of Mean Average Precision (mAP)

To futher analyze the performance of our method for various instances with the same category, we also present detailed per-category results for 3D IoU, rotation accuracy and translation precision on the NOCS-REAL275 dataset. Meanwhile, to support our claim regarding the generalization robustness of our proposed Robust6DoF to the intra-class shape variations, we conduct a quantitative comparison with the related track-free method, SPD [12]. It is evident from the visualization in Fig. 10 that we outperforms SPD [12] in mean accuracy for almost all thresholds, especially in both two evaluation metrics: 3D IoU and translation estimation.

5.3.2 Stability to Dropped Frames

Here, we examine how the frame dropping affects tracking performance. We drop the next $N$ frames after the first frame and use the mean performance of ${<5^{\circ}}5cm$ to evaluate different baselines, as depicted in the Fig. 11. The fewer frames dropped after the first frame, the easier it is to track. Note that the performance of almost all methods decreases as the number of dropped frames increases, except for NOCS because it is a track-free method, that is not influenced by dropped frames. Compared to other track-based baselines, our method decreases only $3.2\%$ when dropping 75 frames, while the state-of-the-art CAPTRA [6] reduces by $5\%$ . Meanwhile, the performance of our method is $10\%$ higher than CAPTRA throughout the whole process, and remains steady at about 100 frames.

5.3.3 Tracking Comparison to Pose Noise

This experiment validates the performance of our method with the noisy pose inputs. Randomly sampled pose noise is added during training time. As shown in Fig. 13, we directly test our method under the following settings: (i) increasing the pose noise by $1$ or $2$ times, denoted as $+n\times Noise(n=1,2)$ ; (ii) adding the pose noise in the initial pose, denoted as $Init.$ , and (iii) adding the pose noise to pose prediction of every previous frame, denoted as $All.$ . We compare our method with state-of-the-art methods, 6-PACK [5] and CAPTRA [6] under the same settings, and show the results under the metric ${5^{\circ}}5cm$ in Fig. 13. It can be seen that our method is more robust to pose noise than other baseline methods.

5.3.4 Sensitivity to the Number of Generated Keypoints

We also evaluate the sensitivity of our proposed method with a different number of generated keypoints, as shown in Fig. 12. We chose eight different sets of generated keypoint numbers, ranging from 100 to 800. Our Robust6DoF with about $n=500$ (we set $n=512$ for experiments), achieves optimal performance, and our model doesn’t seem to be very sensitive to this parameter.

5.4 Ablation Studies

TABLE VII: Ablation experiments on different components of our Robust6DoF. Note that ”WAS”, ”GDF”, ”STE”, ”SSF” and ”AD” represent the weight-shared attention, global dense fusion, spatial-temporal filtering encoder, shape-similarity filtering and augmentation decoder,respectively.

#	WAS	GDF	STE	SSF	AD	NOCS-REAL275				Wild6D
#	WAS	GDF	STE	SSF	AD	$IoU25$	${5^{\circ}}2cm$	${R_{err}}$	${T_{err}}$	$IoU50$	${5^{\circ}}2cm$
M0	\usym2717	\usym2717	\usym2717	\usym2717	\usym2717	63.5	30.1	19.6	14.3	57.6	20.7
M1	\usym2713	\usym2717	\usym2717	\usym2717	\usym2717	70.0	33.2	14.3	9.9	62.0	22.7
M2	\usym2713	\usym2713	\usym2717	\usym2717	\usym2717	71.2	33.5	14.8	10.6	62.3	23.0
M3	\usym2713	\usym2713	\usym2713	\usym2717	\usym2717	80.8	50.4	8.9	7 .3	60.0	25.8
M4	\usym2713	\usym2713	\usym2713	\usym2713	\usym2717	90.2	55.4	5.5	3.7	74.8	30.0
M5	\usym2713	\usym2713	\usym2713	\usym2713	\usym2713	89.8	57.1	5.2	3.0	75.1	31.2

TABLE VIII: Ablation study on keypoints generation manner of our Robust6DoF. ”unsupervised KP” means we use the unsupervised keypoints generation method introduced in 6-PACK [5].

Dataset	Settings	Evaluation Metrics
Dataset	Settings	$IoU50$	${5^{\circ}}2cm$	${5^{\circ}}5cm$	${10^{\circ}}5cm$
NOCS-REAL275	Ours w/o prior guidance	58.6	33.3	35.9	62.2
	Ours w/o finer matching	70.8	50.4	62.5	66.7
	Ours + unsupervised KP	80.2	55.4	63.0	83.0
	Ours	87.0	57.1	70.6	84.5
Wild6D	Ours w/o prior guidance	65.2	26.1	29.6	35.0
	Ours w/o finer matching	72.1	29.4	43.2	44.8
	Ours + unsupervised KP	73.5	30.4	40.1	48.9
	Ours	75.1	31.2	44.4	50.9

TABLE IX: Ablation study on loss functions of our Robust6DoF. ”

{L_{base}}

” contains both rotation loss ”

{L_{rot}}

” and translation loss ”

{L_{tra}}

”.

#	${L_{base}}$	${L_{aux}}$	${L_{mvc}}$	${L_{c}}$	NOCS-REAL275				Wild6D
#	${L_{base}}$	${L_{aux}}$	${L_{mvc}}$	${L_{c}}$	$IoU25$	${5^{\circ}}2cm$	${R_{err}}$	${T_{err}}$	$IoU50$	${5^{\circ}}2cm$
①	\usym2713	\usym2717	\usym2717	\usym2717	82.2	48.7	10.2	9.3	68.6	25.4
②	\usym2713	\usym2713	\usym2717	\usym2717	85.2	51.3	8.0	7.5	71.1	28.9
③	\usym2713	\usym2713	\usym2713	\usym2717	87.0	54.2	6.6	5.3	72.1	30.0
④	\usym2713	\usym2713	\usym2713	\usym2713	89.8	57.1	5.2	3.0	75.1	31.2

TABLE X: Ablation study on robustness to pose errors of our Robust6DoF. ”Init.

\times n

” and ”All.

\times n

” means adding

n

(n=1,2)

times train-time errors in initial pose and adding

n

times pose errors to all frames.

Dataset	Metric	Orig.	Init. $\times 1$	Init. $\times 2$	All. $\times 1$	All. $\times 2$
NOCS-REAL275	$IoU25$	89.9	88.8	87.1	88.2	87.9
	${5^{\circ}}5cm$	70.6	68.3	67.3	68.6	68.5
	${R_{err}}$	5.2	5.57	5.64	5.61	5.79
	${T_{err}}$	3.0	3.94	3.94	4.03	4.98
Wild6D	$IoU50$	75.1	73.4	72.7	74.0	72.8
Wild6D	${5^{\circ}}5cm$	44.4	43.0	42.1	42.8	42.6

5.4.1 Effectiveness Evaluation of Different Components

To evaluate the effectiveness of the dividual components in our Robust6DoF, we conducted ablation studies and presented the results on public datasets i.e., NOCS-REAL275 and Wild6D in TABLE VII. We start with a base model and incrementally add each proposed component to this baseline. This base model, denoted as ”M0”, is built using the classical Scaled Dot-Product Multi-Head Attention and the Transformer in [69], along with the proposed keypoints generation and match module, and the training strategy is consistent with Robust6DoF. First, the results of ”M1” and ”M2” in TABLE VII show that incorporating the WSA layer into the base model resulted in a significant performance improvement, demonstrating the effectiveness of proposed 2D-3D Dense Fusion module. Secondly, by comparing ”M0”, ”M3” and ”M5”, we can observe that the proposed Spatial-Temporal Filtering Encoder can provide efficient dynamic enhancement to capture the temporal information and improve the inference ability. Our third experiment aims to verify the effectiveness of the proposed Augmentation Decoder with the shape-similarity filtering, as shown in ”M4” and ”M5”. Without the ”SSF” and ”AD” block, the pose tracking performance would be severely weakened. Our complete model ”M5” outperforms all other variants in all comparison experiments.

5.4.2 Comparison of Different Keypoints Generations

We also compare our proposed Prior-Guided Keypoints Generation and Match module with its three different manners: (i) keypoint generation without prior guidance, (ii) our method using only the initial matching, and (iii) unsupervised keypoints generation in 6-PACK [5]. As presented in TABLE VIII, the results in both (i) and (ii) manners simultaneously perform the worst, while our proposed approach has the best performance. It can also be seen that unsupervised manner (iii) is slightly better, but its tracking robustness is significantly worse due to the lack of shape prior’s supervision. These experiment results indicate that our proposed manner is more effective in capturing the changes in category relationships among different instances, making it more suitable for category-level pose tracking.

5.4.3 Impact of Different Loss Configurations

In TABLE IX, we compare the generalization capability under different loss combinations during the training stage. We start with the base losses, including ${L_{rot}}$ and ${L_{tra}}$ , and incrementally add other losses in order. The experimental results in #① and #② demonstrate that the prior-guided auxiliary module is very important for keypoints generation. The results in #② and #③ indicate that the supervision of keypoint’s consistency is also critical to improve performance. Furthermore, we also explore the impact of the proposed key-points refine matching block with the loss ${L_{c}}$ . The results in #③ and #④ show its crucial role in capturing the more critical key-points and catching the the structural changes between observable points and prior-points. Finally, our model #④ achieves the best performance under all loss supervisions.

5.4.4 Robustness of Additional Pose Noises

TABLE X shows the detailed ablation experiments of our Robust6DoF with respect to the added pose noise on NOCS-REAL275 and Wild6D datasets. Following the same settings in subsection 5.3.3, we further verify our Robust6DoF perfromance to examine the tracking robustness to extra pose noises, where we add one or two times pose noises into initial pose or each pose prediction in every frame. As shown in TABLE X, the tracking performance of our model is steadily weakening without a particularly severe decrease, which further demonstrates the robustness of our proposed Robust6DoF. The visualization comparison with 6-PACK [5] and CAPTRA [6] is displayed in Fig. 13.

5.5 Real-World Experiments on an Aerial Robot

In addition to the quantitative experiments for the proposed 6-DoF pose tracker, we further test the complete algorithm in a real-world experiment using the aerial robot developed in our Robotic Laboratory. As shown in Fig. 14, the entire experiment platform includes an OptiTrack indoor motion capture system and an Aerial Manipulator, which is mainly composed of a quadrotor, a 4-DoF robotic manipulator and a downward-looking RGB-D camera (RealSense D435i) to capture the real-time image data. The OptiTrack motion capture system communicates with ground station through WiFi to record the ground-truth position information of the quadrotor with respect to the global coordinate frame. A custom-made electronics flight controllor board (Pixhawk 4.0 with IMUs) provides the ground-truth angle and veocity information of quadrotor, and an onboard computer (Jetson AGX Xavier) runs the closed-loop control the whole system at 20 HZ. This onboard computer also records the ground-truth angle veocity information generated from the robotic manipulator (four Dynamixel MX-28). The proposed algorithm is implemented under the Robot Operating System with Ubuntu 20.04.

We consider two different aerial robotic scenarios, as displayed in Fig. 15. The first case involves the aerial manipulator autonomously guiding itself to the neighborhood of fixed objects on a tabletop. The second case entails the aerial manipulator actively following moving objects placed on a ground vehicle. We recorded the original RGB-D data flow online during the experiment and use a offline 3D labeling tool to obtain the required pose annotations. We compare our method with the representative tracking baseline, 6-PACK [5], as shown in the right of Fig. 15. Druing the begining time, the 6-PACK can detect and estimate each object’s pose, but it gradually loses track when the camera’s view changes drastically (as depicted in red dotted box). The detailed video will be presented in the project page. In contrast, our Robust6DoF achieves robust and effective tracking results. These situation occurs in both two scene cases. It qualitatively demonstrates that our proposed Robust6DoF robust performance in real-world aerial scenarios. We also recorded the action signals output during the process of experiment in first case. The time evolution of the linear velocity of the quadrotor, mentioned in Eq. (73), $\upsilon=({v_{x}},{v_{y}},{v_{z}},{\omega_{z}})^{T}$ and angle velocity of onboard robotic manipulator, referred in Eq. (52), $\dot{\eta}={({\dot{\eta}_{1}},{\dot{\eta}_{2}},{\dot{\eta}_{3}},{\dot{\eta}_{4% }})}$ , during the visual guidance process, is shown in Fig. 16 and Fig. 17, respectively. It can be observed that the quadrotor and manipulator successfully track the reference velocities using the real-time pose estimated by our Robust6DoF. The velocity errors converge to a neighborhood around zero without surprise when the whole experimental process comes to an end at $28s$ , where the current 6-DoF object’s pose is infinitely close to the desired setting. It converges fast and successfully tracks all reference velocities. All the results show good stability of our PAD-Servo scheme and the well real-world performance of our Robust6DoF for guiding in aerial robotics manipulation.

6 Discussions and Future Works

In this paper, our focus is on an actual robotics task i.e., aerial vision guidance for aerial robotics manipulation. We first proposed a robust category-level 6-DoF pose tracker called Robust6DoF, which adopts a three-stage pipeline to achieve aerial object’s pose tracking by leveraging the shape prior-guided keypoints alignment. Futhermore, we introduce a pose-aware discrete servo policy for aerial manipulator termed PAD-Servo, designed to effectively handle the challenges of real-time dynamic vision guidance task for aerial manipulator. Extensive experiments conducted on four public datasets demonstrate the effectiveness of our proposed Robust6DoF. Real-world experiments conducted on our built aerial robotics platform also verify the practical significance of our method, including both the proposed Robust6DoF and PAD-Servo.

Althought our method has achieved effective and practical real-world performance, there are still many unresolved challenges and limitations in this robotic vision field, such as ”how to deal with the sudden appearance and disappearance of objects in the field of onboard camera’s view?” and ”how to use language, audio, and other multi-modal information to achieve smarter and more autonomous unexplored tasks?”, and so on. These are worthy of being explored in our following works. In our future work, we aim to establish a new dataset to provide researcher with a valueable dataset resource for validating 6-DoF pose tracking in aerial situations. Meanwhile, the challenge of aerial visual pose tracking under the setting of fast view changes remains an open problem. We believe that our work will contribute to the development and further advancement of the aerial robotic vision field.

Acknowledgments

This work was supported by the National Key Research and Development Program of China under Grant No.2022YFB3903800, Grant No.2021ZD0114503 and Grant No.2021YFB1714700. Jingtao Sun is supported by the China Scholarship Council under Grant No.202206130072.

References

[1] T. Zhu, R. Wu, J. Hang, X. Lin, and Y. Sun, “Toward human-like grasp: Functional grasp by dexterous robotic hand via object-hand semantic representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 4394–4408, 2023.
[2] B. Huang, J. Li, J. Chen, G. Wang, J. Zhao, and T. Xu, “Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
[3] Z. Cao, Z. Huang, L. Pan, S. Zhang, Z. Liu, and C. Fu, “Towards real-world visual tracking with temporal contexts,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 12, pp. 15834–15849, 2023.
[4] Y. Ma, J. He, D. Yang, T. Zhang, and F. Wu, “Adaptive part mining for robust visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 11443–11457, 2023.
[5] C. Wang, R. Martín-Martín, D. Xu, J. Lv, C. Lu, L. Fei-Fei, S. Savarese, and Y. Zhu, “6-pack: Category-level 6d pose tracker with anchor-based keypoints,” in Proc. IEEE Int. Conf. Robot. Automat., 2020, pp. 10 059–10 066.
[6] Y. Weng, H. Wang, Q. Zhou, Y. Qin, Y. Duan, Q. Fan, B. Chen, H. Su, and L. J. Guibas, “Captra: Category-level pose tracking for rigid and articulated objects from point clouds,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 13 209–13 218.
[7] J. Sun, Y. Wang, M. Feng, D. Wang, J. Zhao, C. Stachniss, and X. Chen, “Ick-track: A category-level 6-dof pose tracker using inter-frame consistent keypoints for aerial manipulation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2022, pp. 1556–1563.
[8] Y. Lin, J. Tremblay, S. Tyree, P. A. Vela, and S. Birchfield, “Keypoint-based category-level object pose tracking from an rgb sequence with uncertainty estimation,” in Proc. IEEE Int. Conf. Robot. Automat., 2022, pp. 1258–1264.
[9] S. Yu, D.-H. Zhai, Y. Xia, D. Li, and S. Zhao, “Cattrack: Single-stage category-level 6d object pose tracking via convolution and vision transformer,” IEEE Trans. Multimedia, 2023.
[10] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2642–2651.
[11] D. Chen, J. Li, Z. Wang, and K. Xu, “Learning canonical shape space for category-level 6d object pose and size estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11 973–11 982.
[12] M. Tian, M. H. Ang, and G. H. Lee, “Shape prior deformation for categorical 6d object pose and size estimation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 530–546.
[13] K. Chen and Q. Dou, “Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2773–2782.
[14] Y. Chen, L. Lan, X. Liu, G. Zeng, C. Shang, Z. Miao, H. Wang, Y. Wang, and Q. Shen, “Adaptive stiffness visual servoing for unmanned aerial manipulators with prescribed performance,” IEEE Trans. Ind. Electron., 2024.
[15] G. He, Y. Jangir, J. Geng, M. Mousaei, D. Bai, and S. Scherer, “Image-based visual servo control for aerial manipulation using a fully-actuated uav,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2023, pp. 5042–5049.
[16] O. A. Hay, M. Chehadeh, A. Ayyad, M. Wahbah, M. A. Humais, I. Boiko, L. Seneviratne, and Y. Zweiri, “Noise-tolerant identification and tuning approach using deep neural networks for visual servoing applications,” IEEE Trans. Robot., 2023.
[17] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
[18] B. Wen, C. Mitash, B. Ren, and K. E. Bekris, “se (3)-tracknet: Data-driven 6d pose tracking by calibrating image residuals in synthetic domains,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2020, pp. 10 367–10 373.
[19] Y. Ze and X. Wang, “Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset,” Proc. Adv. Neural Inf. Process. Syst., vol. 35, pp. 27 469–27 483, 2022.
[20] X. Xue, Y. Li, X. Yin, C. Shang, T. Peng, and Q. Shen, “Semantic-aware real-time correlation tracking framework for uav videos,” IEEE Trans. Cybern., vol. 52, no. 4, pp. 2418–2429, 2020.
[21] Z. Huang, C. Fu, Y. Li, F. Lin, and P. Lu, “Learning aberrance repressed correlation filters for real-time uav tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 2891–2900.
[22] J. Ye, C. Fu, F. Lin, F. Ding, S. An, and G. Lu, “Multi-regularized correlation filter for uav tracking and self-localization,” IEEE Trans. Ind. Electron., vol. 69, no. 6, pp. 6004–6014, 2021.
[23] Y. Li, C. Fu, F. Ding, Z. Huang, and G. Lu, “Autotrack: Towards high-performance visual tracking for uav with automatic spatio-temporal regularization,” in Proc. IEEE Int. Conf. Comput. Vis., 2020, pp. 11 923–11 932.
[24] Z. Cao, Z. Huang, L. Pan, S. Zhang, Z. Liu, and C. Fu, “Tctrack: Temporal contexts for aerial tracking. 2022 ieee,” in Proc. IEEE Int. Conf. Comput. Vis., 2022, pp. 14 778–14 788.
[25] T. Cao, W. Zhang, Y. Fu, S. Zheng, F. Luo, and C. Xiao, “Dgecn++: A depth-guided edge convolutional network for end-to-end 6d pose estimation via attention mechanism,” IEEE Trans. Circuits Syst. Video Technol., 2023.
[26] H. Jiang, Z. Dang, S. Gu, J. Xie, M. Salzmann, and J. Yang, “Center-based decoupled point-cloud registration for 6d object pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 3427–3437.
[27] H. Zhao, S. Wei, D. Shi, W. Tan, Z. Li, Y. Ren, X. Wei, Y. Yang, and S. Pu, “Learning symmetry-aware geometry correspondences for 6d object pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 045–14 054.
[28] G. Zhou, N. Gothoskar, L. Wang, J. B. Tenenbaum, D. Gutfreund, M. Lázaro-Gredilla, D. George, and V. K. Mansinghka, “3d neural embedding likelihood: Probabilistic inverse graphics for robust 6d pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 21 625–21 636.
[29] R. Chen, I. Liu, E. Yang, J. Tao, X. Zhang, Q. Ran, Z. Liu, J. Xu, and H. Su, “Activezero++: Mixed domain learning stereo and confidence-based depth completion with zero annotation,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
[30] G. Wang, F. Manhardt, X. Liu, X. Ji, and F. Tombari, “Occlusion-aware self-supervised monocular 6d object pose estimation,” IEEE Trans. Pattern Anal. Mach. Intell., 2021.
[31] D. Wang, G. Zhou, Y. Yan, H. Chen, and Q. Chen, “Geopose: Dense reconstruction guided 6d object pose estimation with geometric consistency,” IEEE Trans. Multimedia, vol. 24, pp. 4394–4408, 2021.
[32] I. Shugurov, S. Zakharov, and S. Ilic, “Dpodv2: Dense correspondence-based 6 dof pose estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11, pp. 7417–7435, 2021.
[33] L. Zou, Z. Huang, N. Gu, and G. Wang, “Gpt-cope: A graph-guided point transformer for category-level object pose estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, pp. 6907–6921, 2023.
[34] L. Zou, Z. Huang, N. Gu, and G. Wang, “6d-vit: Category-level 6d object pose estimation via transformer-based instance representation learning,” IEEE Trans. Image Processing, vol. 31, pp. 6907–6921, 2022.
[35] J. Lin, Z. Wei, Y. Zhang, and K. Jia, “Vi-net: Boosting category-level 6d object pose estimation via learning decoupled rotations on the spherical representations,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 001–14 011.
[36] J. Lin, Z. Wei, Z. Li, S. Xu, K. Jia, and Y. Li, “Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 3560–3569.
[37] W. Chen, X. Jia, H. J. Chang, J. Duan, L. Shen, and A. Leonardis, “Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1581–1590.
[38] L. Liu, H. Xue, W. Xu, H. Fu, and C. Lu, “Toward real-world category-level articulation pose estimation,” IEEE Trans. Image Processing, vol. 31, pp. 1072–1083, 2022.
[39] H. Lin, Z. Liu, C. Cheang, Y. Fu, G. Guo, and X. Xue, “Sar-net: Shape alignment and recovery network for category-level 6d object pose and size estimation,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 6707–6717.
[40] R. Wang, X. Wang, T. Li, R. Yang, M. Wan, and W. Liu, “Query6dof: Learning sparse queries as implicit shape prior for category-level 6dof pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 055–14 064.
[41] S. Yu, D.-H. Zhai, Y. Guan, and Y. Xia, “Category-level 6-d object pose estimation with shape deformation for robotic grasp detection,” IEEE Trans. Neural Netw. Learn. Syst., 2023.
[42] T. Lee, B.-U. Lee, I. Shin, J. Choe, U. Shin, I. S. Kweon, and K.-J. Yoon, “Uda-cope: unsupervised domain adaptation for category-level object pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 14 891–14 900.
[43] T. Lee, J. Tremblay, V. Blukis, B. Wen, B.-U. Lee, I. Shin, S. Birchfield, I. S. Kweon, and K.-J. Yoon, “Tta-cope: Test-time adaptation for category-level object pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 21 285–21 295.
[44] L. Zheng, C. Wang, Y. Sun, E. Dasgupta, H. Chen, A. Leonardis, W. Zhang, and H. J. Chang, “Hs-pose: Hybrid scope feature extraction for category-level object pose estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17 163–17 173.
[45] Z. Liu, Q. Wang, D. Liu, and J. Tan, “Pa-pose: Partial point cloud fusion based on reliable alignment for 6d pose tracking,” Pattern Recognit., p. 110151, 2023.
[46] L. Wang, S. Yan, J. Zhen, Y. Liu, M. Zhang, G. Zhang, and X. Zhou, “Deep active contours for real-time 6-dof object tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 14 034–14 044.
[47] B. Wen, J. Tremblay, V. Blukis, S. Tyree, T. Müller, A. Evans, D. Fox, J. Kautz, and S. Birchfield, “Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 606–617.
[48] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox, “Poserbpf: A rao–blackwellized particle filter for 6-d object pose tracking,” IEEE Trans. Robot., vol. 37, no. 5, pp. 1328–1342, 2021.
[49] A. Santamaria-Navarro, P. Grosch, V. Lippiello, J. Solà, and J. Andrade-Cetto, “Uncalibrated visual servo for unmanned aerial manipulation,” IEEE/ASME Trans. Mechatron., vol. 22, no. 4, pp. 1610–1621, 2017.
[50] S. Kim, H. Seo, S. Choi, and H. J. Kim, “Vision-guided aerial manipulation using a multirotor with a robotic arm,” IEEE/ASME Trans. Mechatron., vol. 21, no. 4, pp. 1912–1923, 2016.
[51] C. Gabellieri, Y. S. Sarkisov, A. Coelho, L. Pallottino, K. Kondak, and M. J. Kim, “Compliance control of a cable-suspended aerial manipulator using hierarchical control framework,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2020, pp. 7196–7202.
[52] G. Zhang, Y. He, B. Dai, F. Gu, J. Han, and G. Liu, “Robust control of an aerial manipulator based on a variable inertia parameters model,” IEEE Trans. Ind. Electron., vol. 67, no. 11, pp. 9515–9525, 2019.
[53] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3343–3352.
[54] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
[55] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis et al., “Efficient geometry-aware 3d generative adversarial networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 123–16 133.
[56] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 8922–8931.
[57] M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 14 254–14 265, 2020.
[58] E. Malis, F. Chaumette, and S. Boudet, “2 1/2 d visual servoing,” IEEE Trans. Robot. Auto., vol. 15, no. 2, pp. 238–250, 1999.
[59] J. Wang, K. Chen, and Q. Dou, “Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2021, pp. 4807–4814.
[60] M. Z. Irshad, T. Kollar, M. Laskey, K. Stone, and Z. Kira, “Centersnap: Single-shot multi-object 3d shape reconstruction and categorical 6d pose and size estimation,” in Proc. IEEE Int. Conf. Robot. Automat., 2022, pp. 10 632–10 640.
[61] M. Z. Irshad, S. Zakharov, R. Ambrus, T. Kollar, Z. Kira, and A. Gaidon, “Shapo: Implicit representations for multi-object shape, appearance, and pose optimization,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 275–292.
[62] J. Liu, Y. Chen, X. Ye, and X. Qi, “Prior-free category-level pose estimation with implicit space transformation,” arXiv preprint arXiv:2303.13479, 2023.
[63] Y. Di, R. Zhang, Z. Lou, F. Manhardt, X. Ji, N. Navab, and F. Tombari, “Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 6781–6791.
[64] M. Runz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects,” in IEEE ISMAR, 2018, pp. 10–20.
[65] J. Issac, M. Wüthrich, C. G. Cifuentes, J. Bohg, S. Trimpe, and S. Schaal, “Depth-based object tracking using a robust gaussian filter,” in Proc. IEEE Int. Conf. Robot. Automat., 2016, pp. 608–615.
[66] M. Wüthrich, P. Pastor, M. Kalakrishnan, J. Bohg, and S. Schaal, “Probabilistic object tracking using a range camera,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2013, pp. 3195–3202.
[67] H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,” IEEE Trans. Robot., vol. 37, no. 2, pp. 314–333, 2020.
[68] B. Wen and K. Bekris, “Bundletrack: 6d pose tracking for novel objects without instance or category-level 3d models,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2021, pp. 8067–8074.
[69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.