Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset

Zixun Huang    Keling Yao    Seth Z. Zhao    Chuanyu Pan    Chenfeng Xu    Kathy Zhuang    Tianjian Xu    Weiyu Feng    Allen Y. Yang
Abstract

Robust 6DoF pose estimation with mobile devices is the foundation for applications in robotics, augmented reality, and digital twin localization. In this paper, we extensively investigate the robustness of existing RGBD-based 6DoF pose estimation methods against varying levels of depth sensor noise. We highlight that existing 6DoF pose estimation methods suffer significant performance discrepancies due to depth measurement inaccuracies. In response to the robustness issue, we present a simple and effective transformer-based 6DoF pose estimation approach called DTTDNet, featuring a novel geometric feature filtering module and a Chamfer distance loss for training. Moreover, we advance the field of robust 6DoF pose estimation and introduce a new dataset – Digital Twin Tracking Dataset Mobile (DTTD-Mobile), tailored for digital twin object tracking with noisy depth data from the mobile RGBD sensor suite of the Apple iPhone 14 Pro. Extensive experiments demonstrate that DTTDNet significantly outperforms state-of-the-art methods at least 4.32, up to 60.74 points in ADD metrics on the DTTD-Mobile. More importantly, our approach exhibits superior robustness to varying levels of measurement noise, setting a new benchmark for the robustness to noise measurements.


1 Introduction

Six-degrees-of-freedom (6DoF) object pose estimation aims at determining the position and orientation of an object in 3D space. In contrast to the more matured technology of camera tracking in static settings known as visual odometry or simultaneous localization and mapping (Campos et al., 2021; Schöps et al., 2019; Permozer & Orehovački, 2019; Keselman et al., 2017), identifying the relative position and orientation of one or more objects in respect to the user’s ego position is a core function that would ensure the quality of user experience in applications like augmented reality (AR). In the most general setting, each object with respect to the ego position may undergo independent rigid-body motion, and the combined effect of overlaying multiple objects in the scene may also cause parts of the objects to be occluded from the measurement of the ego position. In this paper, the main topic of our investigation is to study the 6DoF pose estimation problem under the most general motion, occlusion, color, and lighting conditions, especially improving the accuracy and robustness of algorithms under novel data sensor properties. The dataset and proposed model are made publicly available as open-source code at: https://github.com/augcog/DTTD2.

Refer to caption
Figure 1: Left: Shadow plot of the relation between the depth noise (depth-ADD) and the inference error (ADD) of considered state-of-the-art methods and proposed DTTDNet. Right: Visualization of pose estimation results of baseline methods and proposed DTTDNet.

Recent advancements in the field of 6DoF pose estimation have primarily been motivated by deep neural network (DNN) approaches that advocate end-to-end training to carry out crucial tasks such as image semantic segmentation, object classification, and object pose estimation. Notable studies (Wang et al., 2019; He et al., 2020, 2021c; Jiang et al., 2022; Mo et al., 2022) have demonstrated the effectiveness of these pose estimation algorithms using established real-world 6DoF pose estimation datasets (Hinterstoisser et al., 2013; Marion et al., 2018; Xiang et al., 2018; Liu et al., 2020; Hodaň et al., 2017; Liu et al., 2021). However, it should be noted that these datasets primarily focus on robotic grasping tasks, and applying these solutions to environments served with mobile devices introduces a fresh set of challenges. A previous work (Feng et al., 2023) first studied this gap in the context of 6DoF pose estimation and replicated real-world digital-twin scenarios with varying levels of capturing distances, lighting conditions, and object occlusions. It is important to mention that, however, this dataset was collected using Microsoft Azure Kinect, which may not be the most suitable camera platform for studying 3D localization under realistic mobile environments.

Alternatively, Apple has emerged as a strong proponent of utilizing RGB-D spatial sensors for mobile AR applications with the design of their iPhone Pro camera suite, such as on the latest iPhone 14 Pro model. This particular smartphone has been given a back-facing LiDAR depth sensor (Ilci & Toth, 2020; Li et al., 2016; Bu et al., 2019; Gu et al., 2021; You et al., 2019; Weng & Kitani, 2019), a critical component to achieving accurate and detailed 3D perception and spatial understanding. However, one distinguishing drawback of the iPhone LiDAR depth is the low resolution of the depth map provided by the iPhone ARKit(Permozer & Orehovački, 2019), a 256×192256192256\times 192256 × 192 resolution compared to a 1280×72012807201280\times 7201280 × 720 depth map provided by the Microsoft Azure Kinect. This low resolution is exacerbated by large errors in the retrieved depth map. The large amounts of error in the iPhone data also pose challenges for researchers to develop a pose estimator that can correctly predict object poses that rely heavily on the observed depth map, which has not been particularly addressed in previous works (Wang et al., 2019; Sun et al., 2022; Gu et al., 2021; You et al., 2019; Weng & Kitani, 2019; Mo et al., 2022).

To investigate the 6DoF pose estimation problem under the most popular mobile depth sensor, namely, the Apple iPhone 14 Pro LiDAR, we propose an RGBD-based transformer model for 6DoF object pose estimation, which is designed to effectively handle inaccurate depth measurements and noise. As shown in Fig. 1, our method shows robustness against noisy depth input, while other baselines failed in such conditions. Meanwhile, we introduce DTTD-Mobile, a novel RGB-D dataset captured by iPhone 14 Pro, to bridge the gap of digital-twin pose estimation with mobile devices, allowing research into extending algorithms to iPhone data and analyzing the unique nature of iPhone depth sensors. Our contributions are summarized into three parts:

  1. 1.

    We propose a new transformer-based 6DoF pose estimator with depth-robust designs on modality fusion and training strategies, called DTTDNet. The new solution outperforms other state-of-the-art methods by a large margin especially in noisy depth conditions.

  2. 2.

    We introduce DTTD-Mobile as a novel digital-twin pose estimation dataset captured with mobile devices. We provide in-depth LiDAR depth analysis and evaluation metrics to illustrate the unique properties and complexities of mobile LiDAR data.

  3. 3.

    We conduct extensive experiments and ablation studies to demonstrate the efficacy of DTTDNet and shed light on how the depth-robustifying module works.

Refer to caption
Figure 2: Model Architecture Overview. DTTDNet pipeline starts with segmented depth maps and cropped RGB images. The point cloud from the depth map and RGB colors are encoded and integrated point-wise. Extracted features are then fed into an attention-based two-stage fusion. Finally, the pose predictor produces point-wise predictions with both rotation and translation.

2 Methods

In this section, we will elaborate on the specific details of our methods. The objective is to estimate the 3D location and pose of a known object in the camera coordinates from the RGBD images. This position can be represented using homogeneous transformation matrix pSE(3)𝑝𝑆𝐸3p\in SE(3)italic_p ∈ italic_S italic_E ( 3 ), which consists of a rotation matrix RSO(3)𝑅𝑆𝑂3R\in SO(3)italic_R ∈ italic_S italic_O ( 3 ) and a translation matrix t3𝑡superscript3t\in\mathbb{R}^{3}italic_t ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, p=[R|t]𝑝delimited-[]conditional𝑅𝑡p=[R|t]italic_p = [ italic_R | italic_t ]. Section 2.1 describes our transformer-based model architecture. Section 2.2 introduces two depth robustifying modules on depth feature extractions, dedicated to geometric feature reconstruction and filtering. Section 2.3 illustrates our modality fusion design for the model to disregard significant noisy depth feature. Finally, Section 2.4 describes our final learning objective.

2.1 Architecture Overview

Fig. 2 illustrates the overall architecture of the proposed DTTDNet. The DTTDNet pipeline takes segmented depth maps and cropped RGB images as input. It then obtains feature embedding for both RGB and depth images through separate CNN and point-cloud encoders on cropped RGB images and reconstructed point cloud corresponding to the cropped depth images.111We preprocessed the RGB and depth images to guarantee the pixel-level correspondence between the RGB image and the depth image. The preprocessing process is detailed in Section 3. Inspired by PSPNet (Zhao et al., 2017), the image embedding network comprises a ResNet-18 encoder, which is then followed by 4 up-sampling layers acting as the decoder. It translates an image of size H×W×3𝐻𝑊3H\times W\times 3italic_H × italic_W × 3 into a H×W×drgb𝐻𝑊subscript𝑑𝑟𝑔𝑏H\times W\times d_{rgb}italic_H × italic_W × italic_d start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT embedding space. For depth feature extraction, we take segmented depth pixels and transform them into 3D point clouds with the camera intrinsic.

The 3D point clouds are initially processed using an auto-encoder inspired by the PointNet (Qi et al., 2017). The PointNet-style encoding step aims to capture geometric representations in latent space in d1superscriptsubscript𝑑1\mathbb{R}^{d_{1}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In this context, the encoder component produces two sets of features: early-stage point-wise features in N×d2superscript𝑁subscript𝑑2\mathbb{R}^{N\times d_{2}}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and global geometric features in d3superscriptsubscript𝑑3\mathbb{R}^{d_{3}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Subsequently, we add a decoder that is guided by a reference point set P𝑃Pitalic_P to generate the predicted point cloud P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG. Features extracted from the encoder are subsequently combined with the learned representations to create a new feature sequence with a dimension of N×dgeosuperscript𝑁subscript𝑑𝑔𝑒𝑜\mathbb{R}^{N\times d_{geo}}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dgeo=d1+d2+d3subscript𝑑𝑔𝑒𝑜subscript𝑑1subscript𝑑2subscript𝑑3d_{geo}=d_{1}+d_{2}+d_{3}italic_d start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

This results in a sequence of geometric tokens with a length equal to the number of points N𝑁Nitalic_N. Extracted RGB and depth features are then fed into a two-stage attention-based fusion block, which consists of modality fusion and point-wise fusion. Finally, the pose predictor produces point-wise predictions with both rotation and translation. The predictions are then voted based on unsupervised confidence scoring to get the final 6DoF pose estimate.

2.2 Design for Robustifying Depth Data

In this section, we will introduce two modules (Fig. 3), Chamfer Distance Loss (CDL) and Geometric Feature Filtering (GFF), that enable the point-cloud encoder in DTTDNet to handle noisy and low-resolution LiDAR data robustly.

Chamfer Distance Loss (CDL). Past methods either treated the depth information directly as image channels(Mo et al., 2022) or directly extracted features from a point cloud for information extraction(Wang et al., 2019). These methods underestimated the corruption of the depth data caused by noise and error during the data collection process. To address this, we first introduce a downstream task for point-cloud reconstruction and utilize the Chamfer distance as a loss function to assist our feature embedding in filtering out noise. The Chamfer distance loss (CDL) is widely used for denoising in 3D point clouds (Hermosilla et al., 2019; Duan et al., 2019), and it is defined as the following equation between two point clouds PN×3𝑃superscript𝑁3P\in\mathbb{R}^{N\times 3}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT and P^N×3^𝑃superscript𝑁3\hat{P}\in\mathbb{R}^{N\times 3}over^ start_ARG italic_P end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT:

LCD(P^,P)=1N(xi^P^minxjPxixj^22+xiPminxj^P^xixj^22)subscript𝐿𝐶𝐷^𝑃𝑃1𝑁subscript^subscript𝑥𝑖^𝑃subscriptsubscript𝑥𝑗𝑃superscriptsubscriptnormsubscript𝑥𝑖^subscript𝑥𝑗22subscriptsubscript𝑥𝑖𝑃subscript^subscript𝑥𝑗^𝑃superscriptsubscriptnormsubscript𝑥𝑖^subscript𝑥𝑗22L_{CD}(\hat{P},P)=\frac{1}{N}(\sum\limits_{\hat{x_{i}}\in\hat{P}}\min\limits_{% x_{j}\in P}\left\|x_{i}-\hat{x_{j}}\right\|_{2}^{2}+\sum\limits_{x_{i}\in P}% \min\limits_{\hat{x_{j}}\in\hat{P}}\left\|x_{i}-\hat{x_{j}}\right\|_{2}^{2})italic_L start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT ( over^ start_ARG italic_P end_ARG , italic_P ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( ∑ start_POSTSUBSCRIPT over^ start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∈ over^ start_ARG italic_P end_ARG end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_P end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_P end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT over^ start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∈ over^ start_ARG italic_P end_ARG end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

(1)

where P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG denotes the decoded point set from the embedding, and P𝑃Pitalic_P denotes the reference point set employed to guide the decoder’s learning. For the reference point set, we use the point cloud sampled from the corresponding object CAD models, which are used only in the training process.

Geometric Feature Filtering (GFF). Due to the non-Gaussian noise distribution in iPhone LiDAR data (Fig. 5), which should be assumed for most depth camera data, normal estimators might either get perturbed by such noisy features or interpret wrong camera-object rotations. To deal with this sensor-level error, we advocate for the integration of a Geometric Feature Filtering (GFF) module before the modality fusion module. Our approach incorporates the Fast Fourier Transform (FFT) into the geometric feature encoding. Specifically, the GFF module includes an FFT, a subsequent single layer of MLP, and finally, an inverse-FFT. By leveraging FFT, we can transpose the input sequence of geometric signals to the frequency domain, which selects significant features from noisy input signals. After that, we obtain a more refined geometric embedding that is resilient to the non-Gaussian iPhone LiDAR noise.

Refer to caption
Figure 3: Chamfer Distance Loss (CDL) and Geometric Feature Filtering (GFF).

2.3 Attention-based RGBD Fusion

Previous papers have emphasized the importance of modality fusion (He et al., 2021c; Wang et al., 2019) and the benefits of gathering nearest points from the point cloud (Mo et al., 2022; He et al., 2021c) in RGBD-based pose estimation tasks. While the feature extractor widens each point’s receptive field, we aim for features to interact beyond their corresponding points (Wang et al., 2019) or neighboring points (He et al., 2021c). In predicting the 6DoF pose of a cuboid based on multiple feature descriptors, our focus is on attending to various corner points, rather than solely those in close proximity to each other. To this end, inspired by recent transformer-based models used for modality fusion (Dosovitskiy et al., 2021; He et al., 2021a; Paul & Chen, 2021; Li et al., 2023b; Radford et al., 2021; Kim et al., 2021; Wang et al., 2022), we leverage the self-attention mechanism (Vaswani et al., 2023) to amplify and integrate important features while disregarding the significant LiDAR noise. Specifically, our fusion part is divided into two stages: modality fusion and point-wise fusion (Fig. 2). Both of our fusion modules consist of a standard transformer encoder with linear projection, multi-head attention and layer norm. The former module utilizes the embedding from single-modal encoders and feeds them into a transformer encoder in parallel for cross-modal fusion. The latter fusion module relies on similarity scores among points. It merges all feature embedding in a point-wise manner before feeding them into a transformer encoder. Detailed design and visual analysis of 2 fusion stages are described in our supplemental materials.

2.4 Learning Objective

Based on the overall network structure, our learning objective is to perform 6DoF pose regression, which measures the disparity between points sampled on the object’s model in its ground truth pose and corresponding points on the same model transformed by the predicted pose. Specifically, the pose estimation loss is defined as:

(LADD)i,p=1mxM(Rx+t)(Ri^x+ti^)subscriptsubscript𝐿𝐴𝐷𝐷𝑖𝑝1𝑚subscript𝑥𝑀norm𝑅𝑥𝑡^subscript𝑅𝑖𝑥^subscript𝑡𝑖(L_{ADD})_{i,p}=\frac{1}{m}\sum_{x\in M}\|(Rx+t)-(\hat{R_{i}}x+\hat{t_{i}})\|( italic_L start_POSTSUBSCRIPT italic_A italic_D italic_D end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_M end_POSTSUBSCRIPT ∥ ( italic_R italic_x + italic_t ) - ( over^ start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_x + over^ start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ∥ (2)

where Mm×3𝑀superscript𝑚3M\in\mathbb{R}^{m\times 3}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × 3 end_POSTSUPERSCRIPT represents the randomly sampled point set from the object’s 3D model, p=[R|t]𝑝delimited-[]conditional𝑅𝑡p=[R|t]italic_p = [ italic_R | italic_t ] denotes the ground truth pose, and pi^=[Ri^|ti^]^subscript𝑝𝑖delimited-[]conditional^subscript𝑅𝑖^subscript𝑡𝑖\hat{p_{i}}=[\hat{R_{i}}|\hat{t_{i}}]over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = [ over^ start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | over^ start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ] denotes the predicted pose generated from the fused feature of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT point. Our objective is to minimize the sum of the losses for each fusion point, which can be expressed as LADD=1NiN(LADD)i,psubscript𝐿𝐴𝐷𝐷1𝑁superscriptsubscript𝑖𝑁subscriptsubscript𝐿𝐴𝐷𝐷𝑖𝑝L_{ADD}=\frac{1}{N}\sum_{i}^{N}(L_{ADD})_{i,p}italic_L start_POSTSUBSCRIPT italic_A italic_D italic_D end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT italic_A italic_D italic_D end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT, where N𝑁Nitalic_N is the number of randomly sampled points (token sequence length in the point-wise fusion stage). Meanwhile, we introduce a confidence regularization score (cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) along with each prediction pi^=[Ri^|ti^]^subscript𝑝𝑖delimited-[]conditional^subscript𝑅𝑖^subscript𝑡𝑖\hat{p_{i}}=[\hat{R_{i}}|\hat{t_{i}}]over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = [ over^ start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | over^ start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ], which denotes confidence among the predictions for each fusion point:

LADD=1NiN(ci(LADD)i,pwlog(ci))subscript𝐿𝐴𝐷𝐷1𝑁superscriptsubscript𝑖𝑁subscript𝑐𝑖subscriptsubscript𝐿𝐴𝐷𝐷𝑖𝑝𝑤𝑙𝑜𝑔subscript𝑐𝑖L_{ADD}=\frac{1}{N}\sum_{i}^{N}(c_{i}(L_{ADD})_{i,p}-wlog(c_{i}))italic_L start_POSTSUBSCRIPT italic_A italic_D italic_D end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_A italic_D italic_D end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_p end_POSTSUBSCRIPT - italic_w italic_l italic_o italic_g ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (3)

Predictions with low confidence will lead to a low ADD loss, but this will be balanced by a high penalty from the second term with hyper-parameter w𝑤witalic_w. Finally, the CDL loss, as outlined in Section 2.2, undergoes joint training throughout the training process, leading us to derive our ultimate learning objective as follows:

L=LADD+λLCD𝐿subscript𝐿𝐴𝐷𝐷𝜆subscript𝐿𝐶𝐷L=L_{ADD}+\lambda L_{CD}italic_L = italic_L start_POSTSUBSCRIPT italic_A italic_D italic_D end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT (4)

where λ𝜆\lambdaitalic_λ denotes the weight of the Chamfer distance loss.

Table 1: Features and statistics of different datasets.
Dataset Modality iPhone Camera Texture Occlusion Light variation # of frames # of scenes # of objects # of annotations
StereoOBJ-1M(Liu et al., 2021) RGB ×\times× \checkmark \checkmark \checkmark 393,612 182 18 1,508,327
LINEMOD(Hinterstoisser et al., 2013) RGBD ×\times× \checkmark \checkmark ×\times× 18,000 15 15 15,784
YCB-Video(Xiang et al., 2018) RGBD ×\times× \checkmark \checkmark ×\times× 133,936 92 21 613,917
DTTD(Feng et al., 2023) RGBD ×\times× \checkmark \checkmark \checkmark 55,691 103 10 136,226
TOD(Liu et al., 2020) RGBD ×\times× \checkmark ×\times× ×\times× 64,000 10 20 64,000
LabelFusion(Marion et al., 2018) RGBD ×\times× \checkmark \checkmark \checkmark 352,000 138 12 1,000,000
T-LESS(Hodaň et al., 2017) RGBD ×\times× ×\times× \checkmark ×\times× 47,762 - 30 47,762
DTTD-Mobile (Ours) RGBD \checkmark \checkmark \checkmark \checkmark 47,668 100 18 114,143

3 Dataset Description

DTTD-Mobile data contain 18 rigid objects along with their textured 3D models. The data are generated from 100 scenes, each of which features one or more of the objects in various orientations and occlusion. Following (Feng et al., 2023), our data generation pipeline is consisted of using a professional OptiTrack motion capture system that captures camera pose along the scene and using Apple’s ARKit222https://developer.apple.com/documentation/arkit/ framework to capture RGB images from the iPhone camera and depth information from LiDAR scanner, as illustrated in Fig. 4. After obtaining such data, we then use the open-sourced data annotation pipeline provided by (Feng et al., 2023) to annotate ground-truth object poses. Through this pipeline, the dataset offers ground-truth labels for 3D object poses and per-pixel semantic segmentation. Additionally, it provides detailed camera specifications, pinhole camera projection matrices, and distortion coefficients. Detailed features and statistics are presented in Table 1. The fact that the DTTD-Mobile dataset includes multiple sets of geometrically similar objects, each having distinct color textures, poses challenges to existing digital-twin localization solutions. To ensure compatibility with other existing datasets, some of the collected objects partially overlap with the YCB-Video (Xiang et al., 2018) and DTTD (Feng et al., 2023) datasets. Specific details on data acquisition, benchmarking, and evaluation are presented in the supplementary materials.

Refer to caption
Figure 4: Left: Setup of our data acquisition pipeline. Right: 3D models of the 18 objects in DTTD-Mobile.
Refer to caption
Figure 5: Visualization of an iPhone LiDAR depth scene that shows distortion and long-tail non-Gaussian noise (highlighted inside the red box). (a) Front view. (b) Left view. (c) Right view.

4 iPhone LiDAR data analysis

Compared to dedicated depth cameras such as the Microsoft Azure Kinect or Intel Realsense, iPhone 14 Pro LiDAR exhibits more noise and lower resolution at 256×192256192256\times 192256 × 192 depth maps, which leads to high magnitudes of distortion on objects’ surfaces. Additionally, it introduces long-tail noise on the projection edges of objects when performing interpolation operations between RGB and depth features. Fig. 5 demonstrates one such example of iPhone 14 Pro’s noisy depth data.

To further quantitatively assess the depth noise of each object from the iPhone’s LiDAR, we analyze the numerical difference between LiDAR-measured depth map, which is acquired directly from iPhone LiDAR, and reference depth map, which is derived through ground truth pose annotations. Specifically, to obtain the reference depth map, we leverage ground truth annotated object poses to render the depth projections of each object. We then apply the segmentation mask associated with each object to filter out depth readings that might be compromised due to occlusion. To measure the difference between ground truth and reference depth map, we introduce the depth-ADD metric, which calculates the average of pixel-wise L1 distance between the ground truth depth map and the reference depth map in each frame. The depth-ADD value of each object at frame n𝑛nitalic_n is calculated as follows:

depthADDn=1diD|depthLiDARidepthrefi|,depthsubscriptADD𝑛1𝑑subscript𝑖𝐷subscriptsubscriptdepthLiDAR𝑖subscriptsubscriptdepthref𝑖\mathrm{depth{-}ADD}_{n}=\frac{1}{d}\sum_{i\in D}\left|\mathrm{depth_{LiDAR}}_% {i}-\mathrm{depth_{ref}}_{i}\right|,roman_depth - roman_ADD start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_D end_POSTSUBSCRIPT | roman_depth start_POSTSUBSCRIPT roman_LiDAR end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_depth start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , (5)

where D𝐷Ditalic_D denotes the LiDAR depth map and i𝑖iitalic_i denotes the index of pixels on it. depthLiDARisubscriptsubscriptdepthLiDAR𝑖\mathrm{depth_{LiDAR}}_{i}roman_depth start_POSTSUBSCRIPT roman_LiDAR end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and depthrefisubscriptsubscriptdepthref𝑖\mathrm{depth_{ref}}_{i}roman_depth start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the depth values from D𝐷Ditalic_D and the corresponding depth value from the reference depth map. The set D𝐷Ditalic_D encompasses all indices i𝑖iitalic_i under an object’s segmentation mask where both depthLiDARisubscriptsubscriptdepthLiDAR𝑖\mathrm{depth_{LiDAR}}_{i}roman_depth start_POSTSUBSCRIPT roman_LiDAR end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and depthrefisubscriptsubscriptdepthref𝑖\mathrm{depth_{ref}}_{i}roman_depth start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT yield values greater than zero. The final depth-ADD value of each object is the average of such measurement across all N𝑁Nitalic_N frames:

depthADD=1NnNdepthADDndepthADD1𝑁subscript𝑛𝑁depthsubscriptADD𝑛\mathrm{depth{-}ADD}=\frac{1}{N}\sum_{n\in N}\mathrm{depth{-}ADD}_{n}roman_depth - roman_ADD = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n ∈ italic_N end_POSTSUBSCRIPT roman_depth - roman_ADD start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (6)

Tab. 2 includes the average depth-ADD error in each sampled object in the second column. Greater depth-ADD values indicate increased distortions and the presence of long-tail noise in the depth data. Our analysis indicates that the mean depth-ADD across all objects is around 0.25m. It is worth noticing that the depth quality varies significantly and could potentially be affected by outliers. For example, there are three objects: black_marker, blue_marker and pink_marker exhibiting greater errors in comparison with the other objects. Detailed depth analysis is reported in the supplementary materials.

5 Experiments

Table 2: Comparison with diverse 6DoF pose estimation baselines on DTTD-Mobile dataset. We showcase AUC results of ADD-S and ADD on all 18 objects, higher is better. Based on considered 4 baselines, our model significantly improves the accuracy on most objects. Note that the left-most column indicates the per-object depth-ADD error.
depth-ADD DenseFusion(Wang et al., 2019) MegaPose-RGBD(Labbé et al., 2022) ES6D*(Mo et al., 2022) BundleSDF (Wen et al., 2023a) DTTDNet (Ours)
Object Average ADD AUC ADD-S AUC ADD AUC ADD-S AUC ADD AUC ADD-S AUC ADD AUC ADD-S AUC ADD AUC ADD-S AUC
mac_cheese 0.184 88.10 93.17 78.98 87.94 28.29 57.06 89.95 94.84 94.06 97.02
tomato_can 0.222 69.10 93.42 68.85 84.48 19.07 56.17 79.62 93.65 74.23 94.01
tuna_can 0.278 42.90 79.94 8.90 22.11 10.74 26.86 25.05 37.94 62.98 87.05
cereal_box 0.151 75.20 88.12 59.89 71.53 10.09 53.92 0.00 0.00 86.55 92.74
clam_can 0.157 90.49 96.32 74.11 90.45 17.75 35.92 75.22 96.05 88.15 96.92
spam 0.286 53.29 91.14 72.35 86.16 3.17 13.74 89.24 95.12 52.81 90.83
cheez-it_box 0.152 82.73 92.10 89.18 94.83 7.81 37.14 42.46 51.69 87.03 93.91
mustard 0.184 78.41 91.31 76.08 85.38 21.89 52.56 84.03 92.99 84.06 92.15
pop-tarts_box 0.139 82.94 92.58 44.36 58.97 3.44 35.26 82.24 92.01 84.55 92.65
black_marker 0.769 32.22 38.72 17.38 34.15 2.12 3.72 0.00 0.00 44.08 53.50
blue_marker 0.370 66.06 74.80 6.87 12.46 16.88 41.46 0.00 0.00 50.88 61.69
pink_marker 0.410 56.46 67.86 47.84 58.59 1.59 7.01 0.00 0.00 64.18 73.00
green_tea 0.265 64.37 93.10 48.43 70.50 8.80 32.86 60.24 87.29 64.59 92.31
apple 0.119 68.97 91.13 32.85 76.43 31.65 58.46 79.27 91.78 82.45 94.80
pear 0.085 65.66 91.31 35.80 56.73 16.93 32.57 80.12 93.72 47.83 88.11
pink_pocky 0.231 50.64 67.17 8.69 18.25 0.77 1.93 2.31 6.25 61.40 82.33
red_pocky 0.245 88.14 93.76 76.49 84.56 25.32 51.16 77.08 91.26 90.00 95.24
white_pocky 0.265 89.55 94.27 42.83 54.65 17.19 47.45 24.93 26.71 90.83 94.70
Average 0.239 69.67 85.88 49.02 62.44 13.25 37.38 46.86 55.74 73.99 88.10
Refer to caption
Figure 6: Qualitative evaluation of different methods. To further validate our approach, we provide visual evidence of our model’s effectiveness in challenging occlusion scenarios and varying lighting conditions, where other models’ predictions fail but ours remain reliable. It should be noted that BundleSDF(Wen et al., 2023a) fails to reconstruct 3D objects in some scenes, resulting in the absence of annotations for such objects.

5.1 Evaluation Metrics

We evaluate baselines with the average distance metrics ADD and ADD-S according to previous protocols(Xiang et al., 2018; Feng et al., 2023). Suppose R𝑅Ritalic_R and t𝑡titalic_t are ground truth rotation and translation and R~~𝑅\tilde{R}over~ start_ARG italic_R end_ARG and t~~𝑡\tilde{t}over~ start_ARG italic_t end_ARG are the predicted counterparts. The ADD metric computes the mean of the pairwise distances between the 3D model points using ground truth pose (R,t)𝑅𝑡(R,t)( italic_R , italic_t ) and predicted pose (R~,t~)~𝑅~𝑡(\tilde{R},\tilde{t})( over~ start_ARG italic_R end_ARG , over~ start_ARG italic_t end_ARG ):

ADD=1mxM(Rx+t)(R~x+t~),ADD1𝑚subscript𝑥𝑀norm𝑅𝑥𝑡~𝑅𝑥~𝑡\mathrm{ADD}=\frac{1}{m}\sum_{x\in M}\|(Rx+t)-(\tilde{R}x+\tilde{t})\|,roman_ADD = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ italic_M end_POSTSUBSCRIPT ∥ ( italic_R italic_x + italic_t ) - ( over~ start_ARG italic_R end_ARG italic_x + over~ start_ARG italic_t end_ARG ) ∥ , (7)

where M𝑀Mitalic_M denotes the point set sampled from the object’s 3D model and x𝑥xitalic_x denotes the point sampled from M𝑀Mitalic_M.

The ADD-S metric is designed for symmetric objects when the matching between points could be ambiguous:

ADDS=1mx1Mminx2M(Rx1+t)(R~x2+t~).ADDS1𝑚subscriptsubscript𝑥1𝑀subscriptsubscript𝑥2𝑀norm𝑅subscript𝑥1𝑡~𝑅subscript𝑥2~𝑡\mathrm{ADD{-}S}=\frac{1}{m}\sum_{x_{1}\in M}\min_{x_{2}\in M}\|(Rx_{1}+t)-(% \tilde{R}x_{2}+\tilde{t})\|.roman_ADD - roman_S = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_M end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_M end_POSTSUBSCRIPT ∥ ( italic_R italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_t ) - ( over~ start_ARG italic_R end_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + over~ start_ARG italic_t end_ARG ) ∥ . (8)

Following previous protocols (Xiang et al., 2018; Wang et al., 2019; Liu et al., 2021, 2023; Labbé et al., 2022), a 3D pose estimation is deemed accurate if the average distance error falls below a predefined threshold. Two widely-used metrics are employed in our work, namely ADD/ADD-S AUC and ADD/ADD-S(1cm). For commonly used ADD/ADD-S AUC, we calculate the Area Under the Curve (AUC) of the success-threshold curve over different distance thresholds, where the threshold values are normalized between 0 and 1. On the other hand, ADD/ADD-S(1cm) is defined as the percentage of pose error smaller than the 1cm threshold.

Table 3: Comparison between 2 datasets with diverse 6DoF pose estimation baselines. We evaluate the results as the ADD-S AUC on the overlapping objects between the YCB video dataset and the DTTD-Mobile (DTTD-M) dataset, higher is better.
depth-ADD DenseFusion(Wang et al., 2019) MegaPose-RGBD(Labbé et al., 2022) ES6D(Mo et al., 2022) BundleSDF (Wen et al., 2023a) DTTDNet (Ours)
Object YCB DTTD-M YCB DTTD-M YCB DTTD-M YCB DTTD-M YCB DTTD-M YCB DTTD-M
tomato_can 0.011 0.222 93.70 93.42 86.11 84.48 89.02 56.17 68.27 93.65 96.69 94.01
mustard_bottle 0.005 0.184 95.90 91.31 87.41 85.38 93.13 52.56 98.21 92.99 97.39 92.15
tuna_can 0.013 0.278 94.90 79.94 91.03 22.11 74.86 26.86 91.11 37.94 95.78 87.05
average 0.009 0.228 94.83 88.22 88.18 63.99 85.67 45.20 85.86 74.86 96.62 91.07

5.2 Experimental Results

In this section, we compare the performance of our method DTTDNet with four other 6DoF pose estimators, namely, BundleSDF(Wen et al., 2023a), MegaPose(Labbé et al., 2022), ES6D(Mo et al., 2022), DenseFusion(Wang et al., 2019). While all four methods leverage the benefits of multimodal data from both RGB and depth sources, they differ in the extent to which they emphasize the depth data processing module. Quantitative experimental results are shown in Table 2. Qualitative examples are shown in Fig. 6.

BundleSDF(Wen et al., 2023a) learns multi-view consistent shape and appearance of a 3D object using an object-centric neural signed distance field, leverages frame poses captured in-flight. However, the approach struggles with 3D object reconstruction when faced with large distances, low resolution, or insufficient frame sequence per viewpoint. BundleSDF(Wen et al., 2023a) achieves an ADD AUC of 46.86 and an ADD-S AUC of 55.74. This method failed to reconstruct 8 out of 26 object-scene combinations in DTTD-Mobile. As shown in 6, the failure in 3D object reconstruction results in the absence of object pose tracking in the corresponding scenes. The BundleSDF, as shown in Figure 1, excludes objects in scenes lacking associated pose tracking.

MegaPose(Labbé et al., 2022) employs a coarse-to-fine process for pose estimation. The initial ”coarse” module leverages both RGB and depth data to identify the most probable pose hypothesis. Subsequently, a more precise pose inference is achieved through the ”render-and-compare” technique. Disregarding the noise in the depth data can also impair the effectiveness of their coarse module, consequently leading to failure in their refinement process. Even with the assistance of a refiner, MegaPose-RGBD(Labbé et al., 2022) only manages to attain an ADD AUC of 49.02 and an ADD-S AUC of 62.44. Its damage and susceptibility to depth noise falls somewhere between DenseFusion(Wang et al., 2019) and ES6D(Mo et al., 2022).

DenseFusion(Wang et al., 2019) treats both modalities equally and lacks a specific design for the depth module, whereas ES6D(Mo et al., 2022) heavily relies on depth data during training, using grouped primitives to prevent point-pair mismatch. However, due to potential interpolation errors in the depth data, this additional supervision can introduce erroneous signals to the estimator, resulting in inferior performance compared to DenseFusion(Wang et al., 2019). DenseFusion(Wang et al., 2019) achieves 69.67 ADD AUC and 85.88 ADD-S AUC, whereas ES6D(Mo et al., 2022) only achieves 13.25 ADD AUC and 37.38 ADD-S AUC.

In contrast, our approach harnesses the strengths of both RGB and depth modalities while explicitly designing robust depth feature extraction and selection. In comparison with the above baselines, our method achieves 73.31 ADD AUC and 87.82 ADD-S AUC, surpassing the state of the art with improvements of 1.94 and 3.64 percent in terms of ADD AUC and ADD-S AUC, respectively.

Refer to caption
Figure 7: Shadow plot of the relation between the depth noise (depth-ADD) and the inference error (ADD) of considered state-of-the-art methods and proposed DTTDNet.

5.3 Ablation Studies

In this section, we further delve into a detailed analysis of our own model, highlighting the utility of our depth robustifying module in handling challenging scenarios with significant LiDAR noise.

Evaluation of DTTDNet on other datasets. To show that our proposed pose estimator could also be generalized to other domains, we evaluate our method on the YCB-Video dataset(Xiang et al., 2018), which outperforms MegaPose(Labbé et al., 2022) and ES6D(Mo et al., 2022) by 6.87 and 7.90 points on ADD(S), respectively, when performing fair comparison (i.e., without any data pre-cleaning and iterative refinement).

We share 3 overlapping objects between the YCB video dataset and our DTTD-Mobile dataset, and demonstrate the models’ performance on these objects to examine the performance drop of each baseline when shifting the datasets, as shown in Table 3. DTTDNet achieves the highest performance on both datasets, and shows less performance drop when occurring iPhone LiDAR noise. Detailed per-object evaluation is appended in the supplementary materials.

Robustness to LiDAR Depth Error. To answer the question of whether our method exhibits robustness in the presence of significant LiDAR sensor noise when compared to other approaches, we further assess the depth-ADD metric, as discussed in Section 4, on DTTDNet versus the four baseline algorithms. Fig. 7 illustrates the correlation between the model performance (ADD) of four methods and the quality of depth information (depth-ADD) across various scenes, frames, and 1239 pose prediction outcomes for the 18 objects. Our approach ensures a stable pose prediction performance, even when the depth quality deteriorates, maintaining consistently low levels of ADD error overall.

Refer to caption
Figure 8: Probability Distribution of Reduced Geometric Features. Left: before the GFF module. Right: after the GFF module.
Table 4: Effect of Depth Feature Filtering. M8P4 denotes our model with a fusion stage consisting of 8-layer modality fusion and 4-layer point-wise fusion modules. This table shows the improvement of M8P4 with further incorporation of geometric feature filtering (GFF).
Methods ADD AUC ADD-S AUC ADD(1cm) ADD-S (1cm)
M8P4 72.03 86.44 19.86 70.50
+ GFF 73.31 87.82 24.35 66.16
Table 5: Effect of Object Geometry Augmented CDL. This table depicts the enhancement in model performance when switching the reference point set from being reliant on the depth map to being augmented by the object model.
Methods CDL supervised by ADD AUC ADD-S AUC ADD(1cm) ADD-S(1cm)
M8P4+GFF LiDAR depth 73.31 87.82 24.35 66.16
CAD model 73.99 88.10 25.85 67.75
Table 6: Effect of Layer Number in 2 Fusion Stages. It shows DTTDNet with different layer number combinations in the fusion stages with one \circ denoting one layer. For all combinations, CDL is used in the geometric feature extraction stage.
Layer Num of # Metrics
Modality Fusion Point-wise Fusion ADD AUC ADD-S AUC ADD(1cm) ADD-S(1cm)
\circ\circ∘ ∘ \circ 70.73 85.42 22.91 67.75
\circ\circ∘ ∘ \circ\circ∘ ∘ 71.37 86.69 15.74 64.44
\circ\circ∘ ∘ \circ\circ∘ ∘\circ\circ∘ ∘ 72.06 86.37 19.57 68.37
\circ\circ∘ ∘ \circ 70.73 85.42 22.91 67.75
\circ\circ∘ ∘\circ\circ∘ ∘ \circ\circ∘ ∘ 71.76 88.23 20.01 69.03
\circ\circ∘ ∘\circ\circ∘ ∘\circ\circ∘ ∘\circ\circ∘ ∘ \circ\circ∘ ∘\circ\circ∘ ∘ 72.03 86.44 19.86 70.50

Effect of Depth Feature Filtering module. Table 4 illustrates the improvement in ADD AUC metrics achieved by our method when integrating geometric feature filtering (GFF) module. To provide a detailed insight into the impact of the GFF module, we conducted principal component analysis (PCA) on both the initial geometric tokens encoded by the PointNet and the filtered version after applying the GFF module, i.e., projected the embedding to a 1-D array with its dominant factor. We visualize the geometric embedding both before and after the application of the GFF module by generating histograms of the dimensionally reduced geometric tokens, as shown in Fig. 8. The distribution of these tokens, as shown in the right subplot, becomes more balanced and uniform after learning-based filtering through the GFF module. The enhanced ADD AUC performance can be attributed to the balanced distribution achieved through the use of the depth robustifying module.

Effect of Geometry Augmented CDL. We replaced the reference point set for CDL with measured LiDAR depth data as a baseline to demonstrate the effectiveness of our design choice when supervised by the point cloud sampled from 3D object models. In Table 5, we conduct a performance comparison of our approach with these two reference point choices, our design choice achieved higher ADD AUC and ADD-S AUC, as well as higher performance in the more stringent metric, ADD/ADD-S 1cm (Table 5).

Effect of Layer Number Variation in Fusion Stages. Table 6 display the variations brought about by increasing the number of layers at different fusion stages. Overall, adding layer number increases the model’s performance in terms of ADD AUC. As we proportionally increase the total number of layers in the modality fusion or point-wise fusion, we witness a sustained improvement in model performance.

6 Conclusion

We have presented DTTDNet as a novel digital-twin localization algorithm to bridge the performance gap for 3D object tracking in mobile enviroments and with critical requirements of accuracy. At the algorithm level, DTTDNet is a transformer-based 6DoF pose estimator, specifically designed to navigate the complexities introduced by noisy depth data. At the experiment level, we introduced a new RGBD dataset captured using iPhone 14 Pro, expanding our approach to iPhone sensor data. Through extensive experiments and ablation analysis, we have examined the effectiveness of our method in being robust to erroneous depth data. Additionally, our research has brought to light new complexities associated with object tracking in dynamic AR environments.

References

  • Baruch et al. (2021) Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897, 2021.
  • Besl & McKay (1992) Besl, P. J. and McKay, N. D. Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, volume 1611, pp.  586–606. Spie, 1992.
  • Bu et al. (2019) Bu, F., Le, T., Du, X., Vasudevan, R., and Johnson-Roberson, M. Pedestrian planar lidar pose (pplp) network for oriented pedestrian detection based on planar lidar and monocular images. IEEE Robotics and Automation Letters, 5(2):1626–1633, 2019.
  • Cai et al. (2022) Cai, D., Heikkilä, J., and Rahtu, E. Ove6d: Object viewpoint encoding for depth-based 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6803–6813, June 2022.
  • Calli et al. (2015a) Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., and Dollar, A. M. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), pp.  510–517, 2015a. doi: 10.1109/ICAR.2015.7251504.
  • Calli et al. (2015b) Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., and Dollar, A. M. Benchmarking in manipulation research: Using the yale-cmu-berkeley object and model set. IEEE Robotics & Automation Magazine, 22(3):36–52, 2015b. doi: 10.1109/MRA.2015.2448951.
  • Calli et al. (2017) Calli, B., Singh, A., Bruce, J., Walsman, A., Konolige, K., Srinivasa, S., Abbeel, P., and Dollar, A. M. Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research, 36(3):261–268, 2017. doi: 10.1177/0278364917700714. URL https://doi.org/10.1177/0278364917700714.
  • Campos et al. (2021) Campos, C., Elvira, R., G’omez, J. J., Montiel, J. M. M., and Tard’os, J. D. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Duan et al. (2019) Duan, C., Chen, S., and Kovacevic, J. 3d point cloud denoising via deep neural network based local surface estimation, 2019.
  • Feng et al. (2023) Feng, W., Zhao, S. Z., Pan, C., Chang, A., Chen, Y., Wang, Z., and Yang, A. Y. Digital twin tracking dataset (dttd): A new rgb+depth 3d dataset for longer-range object tracking applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.  3288–3297, June 2023.
  • Gu et al. (2021) Gu, B., Liu, J., Xiong, H., Li, T., and Pan, Y. Ecpc-icp: A 6d vehicle pose estimation method by fusing the roadside lidar point cloud and road feature. Sensors, 21(10):3489, 2021.
  • Guo et al. (2023) Guo, Y., Stutz, D., and Schiele, B. Robustifying token attention for vision transformers, 2023.
  • Haggag et al. (2013) Haggag, H., Hossny, M., Filippidis, D., Creighton, D. C., Nahavandi, S., and Puri, V. Measuring depth accuracy in rgbd cameras. 2013, 7th International Conference on Signal Processing and Communication Systems (ICSPCS), pp.  1–7, 2013.
  • He et al. (2021a) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. Masked autoencoders are scalable vision learners, 2021a.
  • He et al. (2021b) He, L., Zhu, H., Li, F., Bai, H., Cong, R., Zhang, C., Lin, C., Liu, M., and Zhao, Y. Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline. arXiv preprint arXiv:2104.06174, 2021b.
  • He et al. (2020) He, Y., Sun, W., Huang, H., Liu, J., Fan, H., and Sun, J. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • He et al. (2021c) He, Y., Huang, H., Fan, H., Chen, Q., and Sun, J. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021c.
  • He et al. (2022) He, Y., Wang, Y., Fan, H., Sun, J., and Chen, Q. Fs6d: Few-shot 6d pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6814–6824, June 2022.
  • Hermosilla et al. (2019) Hermosilla, P., Ritschel, T., and Ropinski, T. Total denoising: Unsupervised learning of 3d point cloud cleaning. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  52–60, 2019.
  • Hinterstoisser et al. (2013) Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., and Navab, N. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Lee, K. M., Matsushita, Y., Rehg, J. M., and Hu, Z. (eds.), Computer Vision – ACCV 2012, pp.  548–562, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-37331-2.
  • Hodaň et al. (2017) Hodaň, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., and Zabulis, X. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. IEEE Winter Conference on Applications of Computer Vision (WACV), 2017.
  • Ilci & Toth (2020) Ilci, V. and Toth, C. High definition 3d map creation using gnss/imu/lidar sensor integration to support autonomous vehicle navigation. Sensors, 20(3):899, 2020.
  • Jiang et al. (2022) Jiang, X., Li, D., Chen, H., Zheng, Y., Zhao, R., and Wu, L. Uni6d: A unified cnn framework without projection breakdown for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11174–11184, June 2022.
  • Keselman et al. (2017) Keselman, L., Woodfill, J. I., Grunnet-Jepsen, A., and Bhowmik, A. Intel realsense stereoscopic depth cameras, 2017.
  • Kim et al. (2021) Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision, 2021.
  • Labbe et al. (2020) Labbe, Y., Carpentier, J., Aubry, M., and Sivic, J. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • Labbé et al. (2022) Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M., Fox, D., and Sivic, J. Megapose: 6d pose estimation of novel objects via render compare, 2022.
  • Li et al. (2016) Li, B., Zhang, T., and Xia, T. Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916, 2016.
  • Li et al. (2023a) Li, K., Bian, J.-W., Castle, R., Torr, P. H., and Prisacariu, V. A. Mobilebrick: Building lego for 3d reconstruction on mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4892–4901, 2023a.
  • Li et al. (2023b) Li, T., Foo, L. G., Hu, P., Shang, X., Rahmani, H., Yuan, Z., and Liu, J. Token boosting for robust self-supervised visual transformer pre-training, 2023b.
  • Liu et al. (2020) Liu, X., Jonschkowski, R., Angelova, A., and Konolige, K. Keypose: Multi-view 3d labeling and keypoint estimation for transparent objects. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 2020.
  • Liu et al. (2021) Liu, X., Iwase, S., and Kitani, K. M. Stereobj-1m: Large-scale stereo image dataset for 6d object pose estimation. In ICCV, 2021.
  • Liu et al. (2023) Liu, Y., Wen, Y., Peng, S., Lin, C., Long, X., Komura, T., and Wang, W. Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images, 2023.
  • Marion et al. (2018) Marion, P., Florence, P. R., Manuelli, L., and Tedrake, R. Label fusion: A pipeline for generating ground truth labels for real rgbd data of cluttered scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  3325–3242. IEEE, 2018.
  • Mo et al. (2022) Mo, N., Gan, W., Yokoya, N., and Chen, S. Es6d: A computation efficient and symmetry-aware 6d pose regression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6718–6727, 2022.
  • Nguyen et al. (2022) Nguyen, V. N., Hu, Y., Xiao, Y., Salzmann, M., and Lepetit, V. Templates for 3d object pose estimation revisited: Generalization to new objects and robustness to occlusions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6771–6780, 2022.
  • Paul & Chen (2021) Paul, S. and Chen, P.-Y. Vision transformers are robust learners, 2021.
  • Peng et al. (2019) Peng, S., Liu, Y., Huang, Q., Zhou, X., and Bao, H. Pvnet: Pixel-wise voting network for 6dof pose estimation. In CVPR, 2019.
  • Permozer & Orehovački (2019) Permozer, I. and Orehovački, T. Utilizing apple’s arkit 2.0 for augmented reality application development. In 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp.  1629–1634, 2019. doi: 10.23919/MIPRO.2019.8756928.
  • Qi et al. (2017) Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021.
  • Schöps et al. (2019) Schöps, T., Sattler, T., and Pollefeys, M. Bad slam: Bundle adjusted direct rgb-d slam. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  134–144, 2019. doi: 10.1109/CVPR.2019.00022.
  • Sun et al. (2022) Sun, J., Wang, Z., Zhang, S., He, X., Zhao, H., Zhang, G., and Zhou, X. OnePose: One-shot object pose estimation without CAD models. CVPR, 2022.
  • Trockman & Kolter (2023) Trockman, A. and Kolter, J. Z. Mimetic initialization of self-attention layers, 2023.
  • Umeyama (1991) Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
  • Vaswani et al. (2023) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2023.
  • Wang et al. (2019) Wang, C., Xu, D., Zhu, Y., Martín-Martín, R., Lu, C., Fei-Fei, L., and Savarese, S. Densefusion: 6d object pose estimation by iterative dense fusion. 2019.
  • Wang et al. (2022) Wang, K., Zhao, S. Z., Chan, D., Zakhor, A., and Canny, J. Multimodal semantic mismatch detection in social media posts. In Proceedings of IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), 2022.
  • Wen et al. (2023a) Wen, B., Tremblay, J., Blukis, V., Tyree, S., Muller, T., Evans, A., Fox, D., Kautz, J., and Birchfield, S. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. CVPR, 2023a.
  • Wen et al. (2023b) Wen, B., Yang, W., Kautz, J., and Birchfield, S. Foundationpose: Unified 6d pose estimation and tracking of novel objects. arXiv preprint arXiv:2312.08344, 2023b.
  • Weng & Kitani (2019) Weng, X. and Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp.  0–0, 2019.
  • Xiang et al. (2018) Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. 2018.
  • You et al. (2019) You, Y., Wang, Y., Chao, W.-L., Garg, D., Pleiss, G., Hariharan, B., Campbell, M., and Weinberger, K. Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310, 2019.
  • Zakharov et al. (2019) Zakharov, S., Shugurov, I., and Ilic, S. Dpod: 6d pose object detector and refiner. pp.  1941–1950, 10 2019. doi: 10.1109/ICCV.2019.00203.
  • Zhao et al. (2017) Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene parsing network, 2017.
  • Zhao et al. (2023) Zhao, H., Chen, J., Wang, L., and Lu, H. Arkittrack: A new diverse dataset for tracking using mobile rgb-d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5126–5135, 2023.

Appendix A Related Work

6DoF Pose Estimation Algorithms. The majority of data-driven approaches for object pose estimation revolve around utilizing either RGB images (Labbe et al., 2020; Peng et al., 2019; Xiang et al., 2018; Zakharov et al., 2019) or RGBD images (He et al., 2021c, 2020; Wang et al., 2019; Mo et al., 2022; Jiang et al., 2022) as their input source. RGBD remains mainstream in industrial environments requiring higher precision. However, due to the high cost of accurate depth sensors, finding more robust solutions compatible with inexpensive and widely used sensors is a problem we hope to address.

Methods(He et al., 2021c, 2020; Wang et al., 2019; Mo et al., 2022; Jiang et al., 2022) that relied on depth maps advocated for the modality fusion of depth and RGB data to enhance their inference capabilities. To effectively fuse multi-modalities, Wang et al. (Wang et al., 2019) introduced a network architecture capable of extracting and integrating dense feature embedding from both RGB and depth sources. Due to its simplicity, this method achieved high efficiency in predicting object poses. In more recent works (He et al., 2020, 2021c, 2022), performance improvements were achieved through more sophisticated network architectures. For instance, He et al. (He et al., 2021c) proposed an enhanced bidirectional fusion network for key-point matching, resulting in high accuracy on benchmarks such as YCB-Video (Xiang et al., 2018) and LINEMOD (Hinterstoisser et al., 2013). However, these methods exhibited reduced efficiency due to the complex hybrid network structures and processing stages. Addressing symmetric objects, Mo et al. (Mo et al., 2022) proposed a symmetry-invariant pose distance metric to mitigate issues related to local minima. On the other hand, Jiang et al. (Jiang et al., 2022) proposed an L1-regularization loss named abc loss, which enhanced pose estimation accuracy for non-symmetric objects.

Besides the RGBD approach, studies following the RGB-only approach often rely on incorporating additional prior information and inductive biases during the inference process. These requirements impose additional constraints on the application of 3D object tracking on mobile devices. Their inference process can involve utilizing more viewpoints for similarity matching(Liu et al., 2023; Labbé et al., 2022) or geometry reconstruction(Sun et al., 2022; Wen et al., 2023b), employing rendering techniques(Labbé et al., 2022; Nguyen et al., 2022; Cai et al., 2022) based on precise 3D model or leveraging an additional database for viewpoint encoding retrieval(Cai et al., 2022). During the training phase, these approaches typically draw upon more extensive datasets, such as synthetic datasets, to facilitate effective generalization within open-set scenarios. However, when confronted with a limited set of data samples, their performance does not surpass that of closed-set algorithms in cases where there is a surplus of prior information available and depth map loss.

Refer to caption
Figure 9: Sample visualizations of our dataset. First row: Annotations for 3D bounding boxes. Second row: Corresponding semantic segmentation labels. Third row: Zoomed-in LiDAR depth visualizations.

3D Object Tracking Datasets Existing object pose estimation algorithms are predominantly tested on a limited set of real-world 3D object tracking datasets (Hinterstoisser et al., 2013; Marion et al., 2018; Xiang et al., 2018; Liu et al., 2020; Hodaň et al., 2017; Liu et al., 2021; Feng et al., 2023; Calli et al., 2015a, b, 2017), which often employ depth-from-stereo sensors or time-of-flight (ToF) sensors for data collection. Datasets like YCB-Video(Xiang et al., 2018), LINEMOD(Hinterstoisser et al., 2013), StereoOBJ-1M(Liu et al., 2021), and TOD(Liu et al., 2020) utilize depth-from-stereo sensors, while TLess(Hodaň et al., 2017) and DTTD (Feng et al., 2023) deploy ToF sensors, specifically the Microsoft Azure Kinect, to capture meter-scale RGBD data. However, the use of cameras with depth-from-stereo sensors may not be an optimal platform for deploying AR software, because stereo sensors may degrade rapidly in longer-distance (Haggag et al., 2013) and may encounter issues with holes in the depth map when stereo matching fails. In our pursuit of addressing the limitations of existing datasets and ensuring a more realistic dataset captured with mobile devices, we opt to collect RGBD data using the iPhone 14 Pro.

iPhone-based Datasets for 3D Applications. Several datasets utilize the iPhone as their data collection device for 3D applications, such as ARKitScenes(Baruch et al., 2021), MobileBrick(Li et al., 2023a), ARKitTrack(Zhao et al., 2023), and RGBD Dataset(He et al., 2021b). These datasets were constructed to target applications from 3D indoor scene reconstruction, 3D ground-truth annotation, depth-map pairing from different sensors, to RGBD tracking in both static and dynamic scenes. However, most of these datasets did not specifically target the task of 6DoF object pose estimation. Our dataset provides a distinct focus on this task, offering per-pixel segmentation and pose labels. This enables researchers to delve into the 3D localization tasks of objects with a dataset specifically designed for this purpose. The most relevant work is from OnePose(Sun et al., 2022), which is an RGBD 3D dataset collected by iPhone. However, their dataset did not provide 3D models for close-set settings, and they utilized automatic localization provided by ARKit for pose annotation, which involved non-trivial error for high-accuracy 6DoF pose estimation. On the other hand, we achieve higher localization accuracy with the OptiTrack professional motion capture system to track the iPhone camera’s real-time positions as it moves in 3D.

Appendix B More Dataset Description

B.1 Data Acquisition

Apple’s ARKit framework333https://developer.apple.com/documentation/arkit/ enables us to capture RGB images from the iPhone camera and scene depth information from the LiDAR scanner synchronously. We leverage ARKit APIs to retrieve 1920×1440192014401920\times 14401920 × 1440 RGB images and 256×192256192256\times 192256 × 192 depth maps at a capturing rate of 30 frames per second. Despite the resolution difference, both captured RGB images and depth maps match up in the aspect ratio and describe the same scene. Alongside each captured frame, DTTD-Mobile stores the camera intrinsic matrix and lens distortion coefficients, and also stores a 2D confidence map describing how the iPhone depth sensor is confident about the captured depth at the pixel level. In practice, we disabled the auto-focus functionality of the iPhone camera during data collection to avoid drastic changes in the camera’s intrinsics between frames, and we resized the depth map to the RGB resolution using nearest neighbor interpolation to avoid depth map artifacts.

To track the iPhone’s 6DoF movement, we did not use the iPhone’s own world tracking SDK. Instead, we follow the same procedure as in (Feng et al., 2023) and use the professional OptiTrack motion capture system for higher accuracy. For label generation, we also use the open-sourced data annotation pipeline provided by (Feng et al., 2023) to annotate and refine ground-truth poses for objects in the scenes along with per-pixel semantic segmentation. Some visualizations of data samples are illustrated in Fig. 9. Notice that the scenes cover various real-world occlusion and lighting conditions with high-quality annotations. Following previous dataset protocols (Feng et al., 2023; Xiang et al., 2018), we also provide synthetic data for scene augmentations used for training.

The dataset also provides 3D models of the 18 objects as illustrated in the main paper. These models are reconstructed using the iOS Polycam app via access to the iPhone camera and LiDAR sensors. To enhance the models, Blender 444https://www.blender.org/ is employed to repair surface holes and correct inaccurately scanned texture pixels.

B.2 Train/Test Split

DTTD-Mobile offers a suggested train/test partition as follows. The training set contains 8622 keyframes extracted from 88 video sequences, while the testing set contains 1239 keyframes from 12 video sequences. To ensure a representative distribution of scenes with occluded objects and varying lighting conditions, we randomly allocate them across both the training and testing sets. Furthermore, for training purposes of scene augmentations, we provide 20,000 synthetic images by randomly placing objects in scenes using the data synthesizer provided in (Feng et al., 2023).

Appendix C More Implementation Details

C.1 Details on RGBD Feature Fusion

Attention Mechanism. For both modality fusion and point-wise fusion stage, the scaled dot-product attention is utilized in the self-attention layers:

Attention(𝐐,𝐊,𝐕)i=jexp(𝐪iT𝐤j/dhead)kexp(𝐪iT𝐤k/dhead)𝐯j,Attentionsubscript𝐐𝐊𝐕𝑖subscript𝑗superscriptsubscript𝐪𝑖𝑇subscript𝐤𝑗subscript𝑑headsubscript𝑘superscriptsubscript𝐪𝑖𝑇subscript𝐤𝑘subscript𝑑headsubscript𝐯𝑗\mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})_{i}=\sum_{j}\dfrac{\exp(% \mathbf{q}_{i}^{T}\mathbf{k}_{j}/\sqrt{d_{\text{head}}})}{\sum_{k}\exp(\mathbf% {q}_{i}^{T}\mathbf{k}_{k}/\sqrt{d_{\text{head}}})}\mathbf{v}_{j},roman_Attention ( bold_Q , bold_K , bold_V ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG roman_exp ( bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT head end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / square-root start_ARG italic_d start_POSTSUBSCRIPT head end_POSTSUBSCRIPT end_ARG ) end_ARG bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (9)

where query, key, value, and similarity score are denoted as q𝑞qitalic_q, k𝑘kitalic_k, v𝑣vitalic_v, and s𝑠sitalic_s. The distinction between two fusion stages lies in the token preparation prior to the linear projection layer. It results in varying information contained within the query, key, and value.

The key idea in the first fusion stage is to perform local per-point fusion in a cross-modality manner so that we can make predictions based on each fused feature. Each key or query carries only one type of modal information before fusion, allowing different modalities to equally interact with each other through dot-product operations. It exerts a stronger influence when the RGB and geometric representations produce higher similarity.

In the second stage, where we integrate two original single-modal features with the first-stage feature into each point, we calculate similarities solely among different points. The key idea is to enforce attention layers to further capture potential relationships among multiple local features. A skip connection is employed in a concentrating manner between two fusion outputs so that we can make predictions based on per-point features generated in both the first and second stages.

Modality Fusion. The objective of this module is to combine geometric embedding g𝑔gitalic_g and RGB embedding c𝑐citalic_c produced by single-modal encoders in a cross-modal fashion. Drawing inspiration from ViLP (Kim et al., 2021), both types of embedding are linearly transformed into a token sequence (N×dembabsentsuperscript𝑁subscript𝑑𝑒𝑚𝑏\in\mathbb{R}^{N\times d_{emb}}∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT). Before entering the modality fusion module E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, these features are combined along the sequence length direction, i.e., all feature embedding is concentrated into a single combined sequence, where the dimension remains dembsubscript𝑑𝑒𝑚𝑏d_{emb}italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT, and the sequence length becomes twice the original length.

f1=E1[cg]df1×2Nsubscript𝑓1subscript𝐸1delimited-[]direct-sum𝑐𝑔superscriptsubscript𝑑subscript𝑓12𝑁f_{1}=E_{1}\left[c\oplus g\right]\in\mathbb{R}^{d_{f_{1}}\times 2N}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_c ⊕ italic_g ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT × 2 italic_N end_POSTSUPERSCRIPT (10)

where the operation symbol ”direct-sum\oplus” denotes concentrating along the row direction. It is then reshaped into the sequence f1superscriptsubscript𝑓1f_{1}^{\prime}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the length of N and dimension of 2df12subscript𝑑subscript𝑓12d_{f_{1}}2 italic_d start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in order to adapt the point-wise transformer encoder in the next fusion stage. This step enables the model’s attention mechanism to effectively perform cross-modal fusion tasks.

Point-Wise Fusion. The goal of this stage is to enhance the integration of information among various points. The primary advantage of our method over the previous work(He et al., 2021c) is that our model can calculate similarity scores not only with the nearest point but also with all other points, allowing for more comprehensive interactions. In order to enable the point-wise fusion to effectively capture the similarities between different points, we merge the original RGB token sequence c𝑐citalic_c and the geometric token sequence g𝑔gitalic_g together with the output embedding sequence msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the modality fusion module along the feature dimension direction. The combined sequence input [cTgT(f1)T]T(2demb+2df1)×Nsuperscriptdelimited-[]direct-sumsuperscript𝑐𝑇superscript𝑔𝑇superscriptsuperscriptsubscript𝑓1𝑇𝑇superscript2subscript𝑑𝑒𝑚𝑏2subscript𝑑subscript𝑓1𝑁\left[c^{T}\oplus g^{T}\oplus(f_{1}^{\prime})^{T}\right]^{T}\in\mathbb{R}^{(2d% _{emb}+2d_{f_{1}})\times N}[ italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊕ italic_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊕ ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT + 2 italic_d start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) × italic_N end_POSTSUPERSCRIPT is then fed into the point-wise transformer encoder E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to acquire the final fusion:

f2=E2[cTgT(f1)T]Tdf2×Nsubscript𝑓2subscript𝐸2superscriptdelimited-[]direct-sumsuperscript𝑐𝑇superscript𝑔𝑇superscriptsuperscriptsubscript𝑓1𝑇𝑇superscriptsubscript𝑑subscript𝑓2𝑁f_{2}=E_{2}\left[c^{T}\oplus g^{T}\oplus(f_{1}^{\prime})^{T}\right]^{T}\in% \mathbb{R}^{d_{f_{2}}\times N}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_c start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊕ italic_g start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊕ ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT (11)

C.2 Hyper Parameters

Details on Fusion Stages’ Hyper Parameters. We extracted 1000100010001000 of pixels from the decoded RGB representation corresponding to the same number of points in the LiDAR point set. Both extracted RGB and geometric features are linear projected to 256-D before fused together. In the final experiment results, we utilized an 8-layer transformer encoder with 4 attention heads for the modality fusion stage and a 4-layer transformer encoder with 8 attention heads for the point-wise fusion stage.

Training Strategies. For our DTTDNet, learning rate warm-up schedule is used to ensure that our transformer-based model can overcome local minima in early stage and be more effectively trained. By empirical evaluation, in the first epoch, the learning rate lr𝑙𝑟lritalic_l italic_r linearly increases from 00 to 1e51𝑒51e{-5}1 italic_e - 5. In the subsequent epochs, it is decreased using a cosine scheduler to the end learning rate min_lr=1e6𝑚𝑖𝑛_𝑙𝑟1𝑒6min\_lr=1e{-6}italic_m italic_i italic_n _ italic_l italic_r = 1 italic_e - 6. Additionally, following the approach of DenseFusion(Wang et al., 2019), we also decay our learning rate by a certain ratio when the average error is below a certain threshold during the training process. Detailed code and parameters will be publicly available in our code repository. Moreover, we set the importance factor λ𝜆\lambdaitalic_λ of CDL to 0.30.30.30.3 and the initial balancing weight w𝑤witalic_w to 0.0150.0150.0150.015 by empirical testing.

Appendix D More Experimental Details

D.1 Baseline Implementation Details

For all the baseline methods(Wang et al., 2019; Labbé et al., 2022; Mo et al., 2022; Wen et al., 2023a) that we adopted, we did not integrate any additional iterative refinement processes (e.g., ICP(Besl & McKay, 1992)) for fair comparison.

ES6D(Mo et al., 2022). We preprocessed the training datasets including both DTTD-Mobile and YCB video according to the original paper and official codebase of ES6D(Mo et al., 2022), including removing some high-noise data, normalizing 3D translation, averaging the xyz map, and filtering out outliers in the point cloud. For the test set, to ensure fair comparison with other baselines, we did not adopt certain noise reduction methods that require prior knowledge based on ground truth data, nor did we exclude some high-noise data samples. We ensured that all test samples were retained and had errors calculated as those in other baselines.

BundleSDF(Wen et al., 2023a). The object-centric camera pose coordinates outputted by BundleSDF(Wen et al., 2023a) are based on its own embedded geometric reconstruction. In order to compute metrics in the same coordinate system and with the same CAD models as other baselines, we aligned the camera trajectory computed for each scene-object combination with the ground truth camera pose through trajectory-wise alignment based on the Umeyama algorithm (Umeyama, 1991).

D.2 More Ablation Studies and Analysis

Attention Map Visualization. To visualize what our fusion module learns during the training process, we draw on previous studies (Trockman & Kolter, 2023; Guo et al., 2023) and represent our attention map as ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT described in section C.1. Taking two objects (itoen_green_tea and black_marker) as examples, Fig. 10 displays the attention maps produced by different attention heads in the two fusion stages. We showcase the attention maps generated by the modality fusion and point-wise fusion at their respective final layers. The modality fusion part reveals distinct quadrant-like patterns, reflecting differences in how the two modalities fuse. The lower-left and upper-right quadrants offer insights into the degree of RGB and geometric feature fusion. The point-wise fusion part exhibits a striped pattern and shows that it attends to the significance of specific tokens during training.

Refer to caption
Figure 10: Examples of attention map output visualize of both modality fusion stage (the larger maps in the first row) and point-wise fusion stage (the smaller ones in the second row) on two objects (itogen_green_tea and black_marker). Due to the different ways we concentrate features in the two fusion stages, the token sequence length in modality fusion is twice that in the point-wise fusion process. For the attention maps produced in the final layer of modality fusion and point-wise fusion, they are of sizes 2000×2000200020002000\times 20002000 × 2000 and 1000×1000100010001000\times 10001000 × 1000, respectively.

D.3 More Results on YCB video Dataset

Due to the lower depth noise in the YCB video dataset (YCB’s average depth-ADD is 0.009, while DTTD-Mobile’s average depth-ADD is 0.239), we adopted a simplified DTTDNet model structure, omitting the GFF module, to seek faster training convergence. Additionally, the reference point set used for computing Chamfer distance loss was directly extracted from the depth map. Furthermore, regarding hyper-parameters, we chose 0 layers for modality fusion and 6 self-attention layers for point-wise fusion.

When examining the performance of each baseline on the YCB video dataset, as shown in 2. DTTDNet achieves the highest average performance, with an ADD-S AUC of 94.19 and an ADD-S (2cm) of 96.14. DenseFusion (Wang et al., 2019) is the second best, with an ADD-S AUC of 91.20 and an ADD-S (2cm) of 95.30. Although BundleSDF (Wen et al., 2023a) shows strong performance across many object classes, it struggles with pose estimation for some objects, primarily due to its inability to reconstruct 3D models in the presence of occlusions. Its ADD-S AUC is 86.31, and its ADD-S (2cm) is 87.64. In contrast, MegaPose-RGBD (Labbé et al., 2022) and ES6D (Mo et al., 2022) lag behind in performance, particularly in the number of object classes in which they perform the best, with ADD-S AUC scores of 82.64 and 78.14, and ADD-S (2cm) scores of 84.81 and 75.35, respectively.

Table 7: Comparison with diverse 6DoF pose estimation baselines on YCB video dataset. We evaluate the results as the prior works(Wang et al., 2019) using ADD-S AUC and ADD-S (2cm) on all 21 objects, higher is better. Note that the left-most column indicates the per-object depth-ADD error. Objects with bold names are symmetric.
depth-ADD DenseFusion(Wang et al., 2019) MegaPose-RGBD(Labbé et al., 2022) ES6D(Mo et al., 2022) BundleSDF (Wen et al., 2023a) DTTDNet (Ours)
Object Average ADD-S AUC ADD-S (2cm) ADD-S AUC ADD-S (2cm) ADD-S AUC ADD-S (2cm) ADD-S AUC ADD-S (2cm) ADD-S AUC ADD-S (2cm)
master_chef_can 0.005 95.20 100.00 79.11 69.88 82.47 73.56 97.05 100.00 96.32 100.00
cracker_box 0.005 92.50 99.30 74.98 80.65 81.09 84.68 90.69 87.67 92.92 98.04
sugar_box 0.009 95.10 100.00 81.42 90.95 95.97 97.80 97.79 100.00 96.76 100.00
tomato_can 0.011 93.70 96.90 86.11 94.79 89.02 92.71 68.27 70.00 96.69 99.17
mustard_bottle 0.005 95.90 100.00 87.41 99.72 93.13 87.11 98.21 100.00 97.39 100.00
tuna_can 0.013 94.90 100.00 91.03 100.00 74.86 74.22 91.11 100.00 95.78 100.00
pudding_box 0.005 94.70 100.00 89.65 100.00 90.13 98.60 97.67 100.00 93.24 95.33
gelatin_box 0.011 95.80 100.00 87.17 99.07 97.39 100.00 98.46 100.00 97.97 100.00
potted_meat 0.008 90.10 93.10 77.88 80.81 78.56 75.46 62.00 58.22 93.56 92.04
banana 0.007 91.50 93.90 76.18 71.77 92.83 84.70 97.72 100.00 94.52 100.00
pitcher_base 0.007 94.60 100.00 91.26 100.00 93.67 90.18 96.53 100.00 95.76 100.00
bleach_cleanser 0.007 94.30 99.80 82.97 74.05 88.12 87.76 69.67 71.72 93.30 99.61
bowl 0.021 86.60 69.50 83.80 59.61 2.68 0.00 97.57 100.00 84.33 51.23
mug 0.010 95.50 100.00 86.63 97.64 88.58 89.15 97.09 100.00 97.00 99.53
power_drill 0.010 92.40 97.10 88.86 97.73 85.43 78.62 97.17 99.81 95.57 99.81
wood_block 0.007 85.50 93.40 35.55 0.41 29.56 1.24 19.57 0.00 87.63 90.50
scissors 0.010 96.40 100.00 26.52 7.18 33.38 27.07 93.25 97.24 71.68 20.99
large_marker 0.011 94.70 99.20 83.02 67.13 90.08 87.96 95.08 93.83 95.66 97.22
large_clamp 0.011 71.60 78.50 85.93 90.03 43.74 17.84 96.77 99.16 90.99 99.44
extra_large_clamp 0.012 69.00 69.50 76.49 88.32 66.77 69.35 95.15 100.00 89.70 93.11
foam_brick 0.007 92.40 100.00 84.29 92.36 26.11 27.08 0.00 0.00 95.63 100.00
Average 0.009 91.20 95.30 82.64 84.81 78.14 75.35 86.31 87.64 94.19 96.14