Decotr: Enhancing Depth Completion With 2D and 3D Attentions

DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions
Yunxiao Shi Manish Kumar Singh Hong Cai Fatih Porikli

Qualcomm AI Research∗
{yunxshi, masi, hongcai, fporikli}@qti.qualcomm.com
arXiv:2403.12202v1 [cs.CV] 18 Mar 2024
Abstract generating dense and accurate depth maps from sparse mea-
surements alongside aligned camera images, has emerged
In this paper, we introduce a novel approach that har- as a pivotal research area [4, 5, 24, 26, 27, 29, 41, 45].
nesses both 2D and 3D attentions to enable highly accurate Thanks to the advances in deep learning, there has been
depth completion without requiring iterative spatial propa- significant progress in depth completion. Earlier papers
gations. Specifically, we first enhance a baseline convolu- leverage convolutional neural networks to perform depth
tional depth completion model by applying attention to 2D completion with image guidance and achieve promising
features in the bottleneck and skip connections. This effec- results [3, 27, 37]. In order to improve accuracy, re-
tively improves the performance of this simple network and searchers have studied various spatial propagation meth-
sets it on par with the latest, complex transformer-based ods [4, 24, 25, 29], which performs further iterative pro-
models. Leveraging the initial depths and features from this cessing on top of depth maps and features computed by an
network, we uplift the 2D features to form a 3D point cloud initial network. Most existing solutions build on this in the
and construct a 3D point transformer to process it, allowing last stage of their depth completion pipeline to improve per-
the model to explicitly learn and exploit 3D geometric fea- formance [17, 45]. These propagation algorithms, however,
tures. In addition, we propose normalization techniques to focus on 2D feature processing and do not fully exploit the
process the point cloud, which improves learning and leads 3D nature of the problem. A few recent papers utilize trans-
to better accuracy than directly using point transformers formers for depth completion [32, 45]. However, they apply
off the shelf. Furthermore, we incorporate global attention transformer operations mainly to improve feature learning
on downsampled point cloud features, which enables long- on the 2D image plane and fail to achieve acceptable accu-
range context while still being computationally feasible. We racy without employing spatial propagation.
evaluate our method, DeCoTR, on established depth com- Several studies have looked into harnessing 3D represen-
pletion benchmarks, including NYU Depth V2 and KITTI, tation more comprehensively. For instance, [18, 49] con-
showcasing that it sets new state-of-the-art performance. struct a point cloud from the input sparse depth, yet coping
We further conduct zero-shot evaluations on ScanNet and with extreme sparsity poses challenges in effective feature
DDAD benchmarks and demonstrate that DeCoTR has su- learning. Another approach, as seen in [26], uplifts 2D fea-
perior generalizability compared to existing approaches. tures to 3D by using the initial dense depth predicted by
a simple convolutional network, but it is impeded by the
poor accuracy of the initial network and requires dynamic
1. Introduction propagations to attain acceptable accuracy. Very recently,
Depth is crucial for 3D perception in various downstream researchers have proposed employing transformers for 3D
applications, such as autonomous driving, augmented and feature learning in depth completion [44]; however, this
virtual reality, and robotics [1, 2, 8, 9, 11, 33–35, 43, 50, work applies transformer layers to extremely sparse points,
51]. However, sensor-based depth measurement is far from which is ineffective for learning informative 3D features.
perfect. Such measurements often exhibit sparsity, low res- Here, we introduce DeCoTR to perform feature learning
olution, noise interference, and incompleteness. Various in full 3D. It accomplishes this by constructing a dense fea-
factors, including environmental conditions, motion, sensor ture point cloud derived from completed depth values ob-
power constraints, and the presence of specular, transparent, tained from an initial network and subsequently applying
wet, or non-reflective surfaces, contribute to these limita- transformer processing to these 3D points. To do this prop-
tions. Consequently, the task of depth completion, aimed at erly, it is essential to have reasonably accurate initial depths.
As such, we first enhance a commonly used convolution-
∗ Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. based initial depth network, S2D [27], by integrating trans-
Figure 1. Example depth completion results on NYU Depth v2 dataset [36]. We first upgrade S2D (c) to S2D-TR with efficient attention
on 2D features, which significantly improves the initial depth completion accuracy (d). Based on the more accurate initial depths, DeCoTR
uplifts 2D features to form a 3D point cloud and leverages cross-attention on 3D points, which leads to highly accurate depth completion,
with sharp details and close-to-GT quality (e). We highlight sample regions where we can clearly see progressively improving depths by
using our proposed designs.
former layers on bottleneck and skip connection features. • Through extensive evaluations on standard benchmarks,
This upgraded model, termed S2D-TR, achieves signifi- NYU Depth v2 [36] and KITTI [14], we demonstrate the
cantly improved depth accuracy, on par with state-of-the-art efficacy of DeCoTR and show that it sets the new SOTA,
models, without requiring any iterative spatial propagation. e.g., new best result on NYU Depth v2. Our zero-shot
Given the initial depth map, we uplift 2D features to testing on ScanNet [7] and DDAD [15] further showcases
3D to form a point cloud, which is subsequently processed the better generalizability of our model as compared to
by transformer layers, to which we refer as 3D-TR layers. existing methods.
Prior to feeding the points to transformer layers, we nor-
malize them, which regularizes the 3D feature learning and 2. Related Works
leads to better accuracy. In each 3D-TR layer, we follow
Depth completion: Early depth completion ap-
standard practice [40, 46] to perform neighborhood-based
proaches [16, 22, 38] rely solely on the sparse depth
attention, as global attention would be computationally in-
measurements to estimate the dense depth. Since these
tractable when the number of points is large. To facilitate
methods do not utilize the image, they usually suffer from
long-range contextual understanding, we additionally incor-
artifacts like blurriness, especially at object boundaries.
porate global attention on lower-scale versions of the point
Later, image-guided depth completion alleviates these
cloud. Finally, 3D features are projected back to the 2D im-
issues by incorporating the image. S2D [27], one of the
age plane and consumed by a decoder to produce the final
first papers on this, leverages a convolutional network to
depth prediction. As we shall see in the paper, our proposed
consume both the image and sparse depth map. Subsequent
transformer-based learning in full 3D provides considerably
papers design more sophisticated convolutional models for
improved accuracy and generalizability for depth comple-
depth completion [3, 19, 31, 37, 48]. In order to enhance
tion; see Fig. 1 for a visual example.
depth quality, researchers have studied various spatial prop-
In summary, our main contributions are as follows: agation algorithms [4, 5, 24, 29]. These solutions utilize
• We present DeCoTR, a novel transformer-based approach depth values and features given by an initial network (usu-
to perform full 3D feature learning for depth comple- ally S2D), and performs iterative steps to mix and aggregate
tion. This enables high-quality depth estimation without features on the 2D image plane. In many papers nowadays,
requiring iterative processing steps. it has become a common practice to use spatial propagation
• In order to properly do this, we upgrade the commonly on top of the proposed depth completion network in order
used initial network S2D, by enhancing its bottleneck and to achieve state-of-the-art accuracy [17, 28, 45]. Some
skip connection features using transformers. The result- recent works more tightly integrate iterative processing into
ing model, S2D-TR, performs on-par with SOTA and pro- the network, using architectures like recurrent network [39]
vides more correct depths to subsequent 3D learning. and repetitive hourglass network [42].
• We devise useful techniques to normalize the uplifted While existing solutions predominately propose archi-
3D feature point cloud, which improves the model learn- tectures to process features on 2D, several works explore
ing. We additionally apply low-resolution global atten- 3D representations. For instance, [18, 44, 49] considers the
tion to 3D points, which enhances long-range understand- sparse depth as a point cloud and learn features from it.
ing without making computation infeasible. However, the extremely sparse points present a challenge
Figure 2. Overview of our proposed DeCoTR. The input RGB image and sparse depth map are first processed by our S2D-TR, which
upgrades S2D with efficient 2D attentions. The learned 2D guidance features from S2D-TR are then uplifted to form a 3D feature point
cloud based on the initial completed depth map. We normalize the point cloud and feed it through multiple 3D cross-attention layers (3D-
TR) to enable geometry-aware feature learning and processing. We also introduce efficient global attention to capture long-range scene
context. The attended 3D features from 3D-TR are projected back to 2D and given to a decoder to output the final completed depth map.
to feature learning. One of these works, GraphCSPN [26], It is a common approach to employ early fusion between
employs S2D as an initial network to generate the full depth the depth and RGB modalities [26, 27, 29]. This has the ad-
map, before creating a denser point cloud and performing vantage of enabling features to contain both RGB and depth
feature learning on it. However, this is limited by the insuf- information early on, so that the model can learn to rectify
ficient accuracy of the initial depths by S2D and still needs incorrect depth values by leveraging neighboring, similar
iterative processing to achieve good accuracy. pixels that have correct depths. We follow the same prac-
Vision transformer: Even since its introduction [10], tice, first encoding RGB I and sparse depth S with two sep-
vision transformers have been extensively studied and uti- arate convolutions to obtain image and depth features fI
lized for various computer vision tasks, including classifi- and fS respectively:
cation, detection, segmentation, depth estimation, tracking, f_{I} = \text {conv}_{rgb}(I),\quad f_{S} = \text {conv}_{dep}(S) \vspace {-5pt} (2)
3D reconstruction, and more. We refer readers to these sur-
which are then concatenated channel-wise to generate the
veys for a more comprehensive coverage of these works.
initial fused feature, f1 ∈ RC1 ×H1 ×W1 .
More related to our paper are those that leverage vision
transformers for depth completion, such as Completion- 3.2. Enhancing Baseline with Efficient 2D Attention
Former [45] and GuideFormer [32]. While they demon- The early-fusion architecture of S2D [27] has been com-
strate the effectiveness of using vision transformers for monly used by researchers as a base network to predict
depth completion, their feature learning is only performed an initial completed depth map (e.g., [26, 29]). Given the
on the 2D image plane. A very recent paper, PointDC [44], initial fused feature f1 , S2D continues to encode f1 and
proposes to apply transformer to 3D point cloud in the depth generates multi-scale features fm ∈ RCm ×Hm ×Wm for
completion pipeline. However, PointDC operates on very m = {2, ..., 5}, where Cm , Hm , Wm are the number of
sparse points, which makes it challenging for learning 3D channels, and height and width of the feature map. A con-
features. ventional decoder with convolutional and upsampling lay-
ers is used to consume these features, where the smallest
3. Method feature map is directly fed to the decoder and larger ones
In this section, we present our proposed approach, De- are fed via skip connections. The decoder has two predic-
CoTR, powered by efficient 2D and powerful 3D attention branches, one producing a completed depth map and
tion learning. The overall pipeline of DeCoTR is shown the other generates guidance feature g.
in Fig. 2. This architecture, however, has limited accuracy and
may provide erroneous depth values for subsequent oper-
3.1. Problem Setup
ations, such as 2D spatial propagation or 3D representation
Given aligned sparse depth map S ∈ RH×W and an RGB learning. As such, we propose to leverage self-attentions
image I ∈ RH×W ×3 , the goal of image-guided depth com- to enhance the S2D features. More specifically, we apply
pletion is to recover a dense depth map D ∈ RH×W based Multi-Headed Self-Attention (MHSA) to each fm . Since
on S and with semantic guidance from I. The underlying MHSA incurs quadratic complexity in both time and mem-
reasoning is that visually similar adjacent regions are likely ory w.r.t to input resolution, in order to avoid the in-
to have similar depth values. Formally, we have tractable costs of processing high-resolution feature maps,
{D} = H({S}, {I}), \vspace {-5pt} (1) we first employ depth-separable convolutions [6] to sum-
where H is a depth completion model to be learned. marize large feature maps and reduce their resolutions to
the same size (in terms of height, width, and channel) as the qi , key ki , and value vi . Following [46], we use vector atten-
smallest feature map. The downsized feature maps are de- tion which creates attention weights to modulate individual
noted as f˜m ∈ RCk ×Hk ×Wk , ∀m > 1. Three linear layers feature channels. More specifically, the 3D cross-attention
are used to derive query q̃i ∈ RNk ×Ck , key k̃m ∈ RNk ×Ck , is performed as follows:
value ṽm ∈ RNk ×Ck for each f˜m , where Nk = Hk · Wk . \label {eq:vattn} & {a}_{ij} = w(\phi ({q}_i, {k}_j)),\\ & {g}_i^{a} = \sum _{j\in \mathcal {N}(i)}\text {softmax}({A}_i)_j\odot {v}_j, \vspace {-5pt}
Next, we apply self-attention for each f˜m :
(7)
\tilde {f}^{A}_m = \text {softmax}\Big (\frac {\tilde {q}_m\tilde {k}_m^{\top }}{\sqrt {C_k}}\Big ) \tilde {v}_m, \vspace {-5pt} (3)
where ϕ is a relation function to capture the similarity be-
where f˜mA
denotes the attended features. These features tween a pair of input point features (we use subtraction
are restored to their original resolutions by using depth- here), w is a learnable encoding function that computes at-
separable de-convolutions and we denote the restored ver- tention scores to re-weight the channels of the value, A is
A
sions as fm . Finally, we apply a residual addition to obtain the attention weight matrix whose entries are aij for points
the enhanced feature map for each scale: pi and pj , giA denotes the output feature after cross-attention
for pi , and ⊙ denotes the Hadamard product. We perform
f^E_m = f^{A}_m + f_m,\quad \forall m=\{2,...,5\}. \vspace {-5pt} (4) such 3D cross-attention in multiple transformer layers, to
The enhanced feature maps are then consumed by the de- which we refer as 3D-TR layers.
coder to predict the initial completed depth map and guid- While it is possible to directly use existing point trans-
ance feature map. Our upgraded version of S2D with effi- formers off-the-shelf, we find that this is not optimal for
cient attention enhancement, denoted as S2D-TR, provides depth completion. Specifically, we incorporate the follow-
significantly improved accuracy while having better effi- ing technical designs to improve the 3D feature learning
ciency than latest transformer-based depth completion mod- process.
els. For instance, S2D-TR achieves a lower RMSE (0.094 Point cloud normalization: We normalize the con-
vs. 0.099) with ∼50% less computation as compared to structed point cloud from S2D-TR outputs into a unit ball,
CompletionFormer without spatial propagation [45]. before proceeding to the 3D attention layers. We find this
technique effectively improves depth completion, as we
3.3. Feature Cross-Attention in 3D shall show in the experiments.
Considering the 3D nature of depth completion, it is impor- Positional embedding: Instead of the positional embed-
tant for the model to properly exploit 3D geometric infor- ding multiplier proposed in [40], we adopt the conventional
mation when processing the features. To enable this, we one based on relative position difference. We find the more
first un-project 2D guidance feature from S2D-TR, based complex positional embedding multiplier does not benefit
on the initial completed depth, to form a 3D point cloud. the learning and incurs additional computational cost.
This is done as follows, assuming a pinhole camera model
with known intrinsic parameters:
3.4. Capturing Global Context in 3D
\begin {bmatrix} x\\ y\\ z\\ 1 \end {bmatrix}=d \begin {bmatrix} 1/\gamma _u & 0 & -c_u/\gamma _u & 0\\ 0 & 1/\gamma _v & -c_v/\gamma _v & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end {bmatrix} \begin {bmatrix} u\\ v\\ 1\\ 1/d \end {bmatrix} \vspace {-5pt} The 3D cross-attention discussed previously updates each
point feature only based on the point’s estimated 3D neigh-
(5) borhood, in order to maintain computation tractability given
the quadratic complexity of attention w.r.t. number of
points. However, global or long-range scene context is also
where γu , γv are focal lengths, (cu , cv ) is the principal important for the model to develop accurate 3D understand-
point, (u, v) and (x, y, z) are the 2D pixel coordinates and ing. To enable global understanding while keeping compu-
3D coordinates, respectively. tation costs under control, we propose to perform global 3D
Given the large number of 3D points uplifted from the cross-attention only on a downsampled point set, at the last
2D feature map, it is computationally intractable to per- encoding stage of the point transformer. In this case, we use
form attention on all the points simultaneously. As such, the scalar attention as follows:
we adopt a neighborhood-based attention, by finding the {g}_i^{ga} = \sum _{j\neq i}\text {softmax}\Big (\frac {\langle {q}_i, {k}_j\rangle }{\sqrt {C_g}}\Big ){v}_j, \vspace {-5pt} (8)
K-Nearest-Neighboring (KNN) points for each point in the
point cloud, pi ∈ R3 , which we denote as N (i). where ⟨·⟩ denotes dot product and Cg is the embedding
To bake 3D geometric relationship into the feature learn- dimension. We apply global attention after the local
ing process, we perform cross-attention between the feature neighborhood-based attentions.
of each point and features of its neighboring points. Con-
cretely, we modify from the original point transformer [40, 3.5. Training
46] to implement this. For each point pi , linear projections We train DeCoTR with a masked ℓ1 loss between the fi-
are first applied to transform its guidance feature gi to query nal completed depth maps and the ground-truth depth maps,
following standard practice as in [26, 29]. More formally, scene. We remove samples where more than 10% of the
the loss is given by ground-truth depth values are missing, resulting in 745 test
frames across all 100 test scenes.
\mathcal {L}(D^{gt}, D^{pred}) = \frac {1}{N}\sum _{i,j}\mathbb {I}_{\{d_{i,j}^{gt}>0\}}\Big |d_{i,j}^{gt} - d_{i,j}^{pred}\Big |, \vspace {-5pt} (9) DDAD is an autonomous driving dataset collected in the
U.S. and Japan using a synchronized 6-camera array, fea-
where I is the indicator function, dgt
i,j ∈ D
gt
and dpred
i,j ∈ turing long-range (up to 250m) and diverse urban driving
Dpred represent pixel-wise depths in ground-truth and pre- scenarios. Following [15], we downsample the images from
dicted depth maps, respectively, N is the total number of the original resolution of 1216 × 1936 to 384 × 640. We use
valid pixels. the official 3,950 validation samples for evaluation. Since
after downsampling there’s only less than 5% valid ground
4. Experiments truth depth, for our method and all the comparing we sam-
ple all the available valid depth points so that reasonable
We conduct extensive experiments to evaluate our proposed results are generated.
DeCoTR on standard depth completion benchmarks and Implementation Details: We implement our proposed ap-
compare with the latest state-of-the-art (SOTA) solutions. proach using PyTorch [30]. We use the Adam [21] opti-
We further perform zero-shot evaluation to assess the model mizer with an initial learning rate of 5 × 10−4 , β1 = 0.9,
generalizability and carry out ablation studies to analyze β2 = 0.999, and no weight decay. The batch size for
different parts of our proposed approach. NYUDv2 and KITTI-DC per GPU is set to 8 and 4, respec-
4.1. Experimental Setup tively. All experiments are conducted on 8 NVIDIA A100
GPUs.
Datasets: We perform standard depth completion evalua-
tions on NYU Depth v2 (NYUD-v2) [36] and KITTI Depth Evaluation: We use standard metrics to evaluate depth
Completion (KITTI-DC) [13, 14], and generalization tests completion performance [12], including Root Mean
on ScanNet-v2 [7] and DDAD [15]. These datasets cover a Squared Error (RMSE), Absolute Relative Error (Abs Rel),
variety of indoor and outdoor scenes. We follow the sam- δ < 1.25, δ < 1.252 , and δ < 1.253 . On KITTI-DC test, we
pling settings from existing works to create input sparse use the official metrics: RMSE, MAE, iRMSE and iMAE.
depth [26, 29]. We refer readers to the supplementary file for detailed math-
NYUD-v2 provides RGB images and depth maps cap- ematical definitions of these metrics. The depth values are
tured by a Kinect device from 464 different indoor scenes. evaluated with maximum distances of 80 meters and 200
We use the official split: 249 scenes for training and the meters for KITTI and DDAD, respectively, and 10 meters
remaining 215 for testing. Following the common prac- for NYUD-v2 and ScanNet.
tice [26, 29, 45], we sample ∼50,000 images from the train-
ing set and resize the image size from 480 × 640 first to half 4.2. Results on NYUD-v2 and KITTI
and then to 228 × 304 with center cropping. We use the On NYUD-v2: Table 1 summarizes the quantitative evalua-
official test set of 654 images for evaluation. tion results on NYUD-v2. Our proposed DeCoTR approach
KITTI is a large real-world dataset in the autonomous sets the new SOTA performance, with the lowest RMSE of
driving domain, with over 90,000 paired RGB images and 0.086 outperforming all existing solutions. When not using
LiDAR depth measurements. There are two versions of 3D global attention, DeCoTR already provides the best ac-
KITTI dataset used for depth completion. One is from [27], curacy and global attention further improves it. Specifically,
which consists of 46,000 images from the training se- our DeCoTR considerably outperforms latest SOTA meth-
quences for training and a random subset of 3,200 images ods that also leverage 3D representation and/or transform-
from the test sequences for evaluation. The other one is ers, such as GraphCSPN, PointDC, and CompletionFormer.
KITTI Depth Completion (KITTI-DC) dataset, which pro- Note that although PointDC uses both 3D representation
vides 86,000 training, 6,900 validation, and 1,000 testing and transformer, it only obtains slightly lower RMSE when
samples with corresponding raw LiDAR scans and refer- comparing to methods that do not use 3D or transformer
ence images. We use KITTI-DC to train and test our model (e.g., CompletionFormer, GraphCSPN). This indicates that
on the official splits. the PointDC approach is suboptimal, potentially due to the
ScanNet-v2 contains 1,513 room scans reconstructed extremely sparse 3D points.
from RGB-D frames. The dataset is divided into 1,201 Fig. 3 provides sample qualitative results on NYUD-v2.
scenes for training and 312 for validation, and provides an We see that DeCoTR generates highly accurate dense depth
additional 100 scenes for testing. For sparse input depths, maps that are very close to the ground truth. The depth
we sample point clouds from vertices of the reconstructed maps produced by DeCoTR capture much finer details as
meshes. We use the 100 test scenes to evaluate depth com- compared to existing SOTA methods. For instance, in the
pletion performance, with 20 frames randomly selected per second example, our proposed approach accurately predicts
Method RMSE ↓ Abs Rel ↓ δ < 1.25 ↑ δ < 1.252 ↑ δ < 1.253 ↑
S2D [27] 0.204 0.043 97.8 99.6 99.9
DeepLiDAR [31] 0.115 0.022 99.3 99.9 100.0
CSPN [4] 0.117 0.016 99.2 99.9 100.0
DepthNormal [41] 0.112 0.018 99.5 99.9 100.0
ACMNet [47] 0.105 0.015 99.4 99.9 100.0
GuideNet [37] 0.101 0.015 99.5 99.9 100.0
TWISE [19] 0.097 0.013 99.6 99.9 100.0
NLSPN [29] 0.092 0.012 99.6 99.9 100.0
RigNet [42] 0.090 0.013 99.6 99.9 100.0
DySPN [24] 0.090 0.012 99.6 99.9 100.0
CompletionFormer [45] 0.090 0.012 - - -
PRNet [23] 0.104 0.014 99.4 99.9 100.0
CostDCNet [20] 0.096 0.013 99.5 99.9 100.0
PointFusion [18] 0.090 0.014 99.6 99.9 100.0
GraphCSPN [26] 0.090 0.012 99.6 99.9 100.0
PointDC [44] 0.089 0.012 99.6 99.9 100.0
DeCoTR (ours) 0.087 0.012 99.6 99.9 100.0
DeCoTR w/ GA (ours) 0.086 0.012 99.6 99.9 100.0
Table 1. Quantitative evaluation of depth completion performance on NYU-Depth-v2. GA denotes global attention. RMSE and REL are
in meters. Methods in the top part of the table focus on feature learning and processing in 2D and those in the bottom block exploit 3D
representation. Best and second best numbers are highlighted in bold and underlined, respectively, for RMSE and Abs Rel.
Figure 3. Qualitative results on NYUD-v2. We compare with SOTA methods such as NLSPN, GraphCSPN, and CompletionFormer. Areas
where DeCoTR provides better depth accuracy are highlighted.
the depth on the faucet despite its small size in the images see the highlighted areas in the figure. For instance, in
and the low contrast, while other methods struggle. the second example, DeCoTR accurate estimates the depth
On KITTI-DC: We evaluate DeCoTR and compare around the upper edge of the truck while the depth map by
with existing methods (including latest SOTA) on the offi- NLSPN is blurry in that region.
cial KITTI test set, as shown in Table 2. DeCoTR achieves
4.3. Zero-Shot Testing on ScanNet and DDAD
SOTA depth completion accuracy and is among the top-
ranking methods on KITTI-DC leaderboard.1 We see that Most existing papers only evaluate their models on NYUD-
DeCoTR performs significantly better than existing SOTA v2 and KITTI, without looking into model generalizabil-
methods that leverage 3D representations, e.g., GraphC- ity. In this part, we perform cross-dataset evaluation.
SPN, PointDC. This indicates that DeCoTR has the right More specifically, we run zero-shot testing of NYUD-v2-
combination of dense 3D representation and transformer- trained models on ScanNet-v2 and KITTI-trained models
based learning. on DDAD. This will allow us to understand how well our
Fig. 4 shows visual examples of our completed depth DeCoTR as well as existing SOTA models generalize to
maps on KITTI. DeCoTR is able to generate correct depth data not seen in training.
prediction where NLSPN produces erroneous depth values; Tables 3 and 4 present evaluation results on ScanNet-
v2 and DDAD, respectively. We see that DeCoTR gener-
1 Top-5 among published methods at the time of submission, in terms alizes better to unseen datasets when comparing to existing
of iRMSE, iMAE, and MAE. SOTA models. It it noteworthy to mention that on DDAD,
Method RMSE ↓ MAE ↓ iRMSE ↓ iMAE ↓
CSPN [4] 1019.64 279.46 2.93 1.15
TWISE [19] 840.20 195.58 2.08 0.82
ACMNet [47] 744.91 206.09 2.08 0.90
GuideNet [37] 736.24 218.83 2.25 0.99
NLSPN [29] 741.68 199.59 1.99 0.84
PENet [17] 730.08 210.55 2.17 0.94
GuideFormer [32] 721.48 207.76 2.14 0.97
RigNet [42] 712.66 203.25 2.08 0.90
DySPN [24] 709.12 192.71 1.88 0.82
CompletionFormer [45] 708.87 203.45 2.01 0.88
PRNet [23] 867.12 204.68 2.17 0.85
FuseNet [3] 752.88 221.19 2.34 1.14
PointFusion [18] 741.9 201.10 1.97 0.85
GraphCSPN [26] 738.41 199.31 1.96 0.84
PointDC [44] 736.07 201.87 1.97 0.87
DeCoTR (ours) 717.07 195.30 1.92 0.84
Table 2. Quantitative evaluation of depth completion performance on official KITTI Depth Completion test set. RMSE and MAE are in
millimeters, and iRMSE and iMAE are in 1/km. Similar to Table 1, methods in the top part focus on feature learning in 2D and those in
the bottom block exploit 3D representation. Best and second best numbers are highlighted in bold and underlined, respectively.
Figure 4. Qualitative results on KITTI DC. Areas where DeCoTR provides better depth accuracy are highlighted.
DeCoTR has significantly lower depth errors as compared Method RMSE ↓ δ < 1.25 ↑
to both NLSPN and CompletionFormer, despite that Com-
NLSPN [29] 0.198 97.3
pletionFormer has slightly lower RMSE on KITTI-DC test. GraphCSPN [26] 0.197 97.3
Moreover, in this case, CompletionFormer has even worse CompletionFormer [45] 0.194 97.3
accuracy than NLSPN, indicating its poor generalizability.
Fig. 5 shows sample visual results of zero-shot depth DeCoTR (ours) 0.188 97.6
completion on ScanNet-v2. DeCoTR generates highly ac- Table 3. Zero-shot testing on ScanNet-v2 using models trained on
curate depth maps and captures fine details, e.g., arm rest in NYUD-v2. Best numbers are highlighted in bold.
the first example, lamp in the second example. Other meth-
ods cannot recover the depths accurately. Fig. 6 provides
qualitative results on DDAD for CompletionFormer and our
DeCoTR. While this is a challenging test setting given the range (e.g., on cars) and less so when it is far away (e.g.,
much larger depth range in DDAD, DeCoTR still predicts on trees), since KITTI training only covers up to 80 me-
reasonable depths. In contrast, it can be seen that Com- ters whereas DDAD has depth up to 200 meters. This is
pletionFormer performs very poorly on DDAD. We notice also confirmed by the lower-than-KITTI RMSE and higher-
that DeCoTR’s predictions are more accurate in the nearer than-KITTI MAE numbers of DeCoTR on DDAD.
Figure 5. Qualitative results of zero-shot inference on ScanNet-v2. Areas where DeCoTR provide better depth accuracy are highlighted.
Method RMSE ↓ MAE ↓

NLSPN [29] 701.9 309.6
CompletionFormer [45] 889.3 400.1
DeCoTR (ours) 399.2 263.1
Table 4. Zero-shot testing on DDAD using models trained on
KITTI. Best numbers are highlighted in bold.
4.4. Ablation Study

In this part, we investigate the effectiveness of various de-
sign aspects of our proposed DeCoTR solution. Table 5
summarizes the ablation study results. Starting from the
S2D baseline, we significantly improve depth completion
performance by introducing efficient attention on the 2D
features, reducing RMSE from 0.204 to 0.094. Next, by
using neighborhood-based cross-attention on the 3D points
(without normalizing the point cloud before 3D-TR layers),
we reduce RMSE to 0.089. Even though scaling a 3D scene
to a uniform perceived range may present a challenge to
maintain the original spatial relationship, after applying our Figure 6. Qualitative results of zero-shot inference on DDAD. Ar-
normalization scheme, DeCoTR achieves a better RMSE eas where DeCoTR provide better depth accuracy are highlighted.
of 0.087 and by additionally incorporating efficient global
attention, the RMSE is further improved to 0.086. This Method RMSE ↓ δ < 1.25 ↑
study verifies the usefulness of our proposed components S2D [27] 0.204 97.8
and techniques.
S2D-TR 0.094 99.4
Note that if we directly apply 3D-TR on top of the origi- S2D + 3D-TR 0.091 99.6
nal S2D network (second row in the table), we can still dras- DeCoTR w/o normalization 0.089 99.6
tically improve upon S2D but fail to outperform existing DeCoTR 0.087 99.6
methods that leverage 3D or transformers such as GraphC- DeCoTR w/ global attention 0.086 99.6
SPN and CompletionFormer. This confirms the importance
Table 5. Ablation study conducted on NYUD-v2.
of getting more accurate initial depth before applying 3D
feature learning.
ful neighborhood-based cross-attention on the 3D points.
5. Conclusion We further devised an efficient global attention operation
In this paper, we proposed a novel approach, DeCoTR, for to provide scene-level understanding while keeping compu-
image-guided depth completion, by employing transformer- tation costs in check. Through extensive experiments, we
based learning in full 3D. We first proposed an efficient at- have shown that DeCoTR achieves SOTA performance on
tention scheme to upgrade the common baseline of S2D, standard benchmarks like NYUD-v2 and KITTI-DC. Fur-
allowing S2D-TR to provide more accurate initial depth thermore, zero-shot evaluation on unseen datasets such as
completion. 2D features are then uplifted to form a 3D ScanNet and DDAD shows that DeCoTR has better gener-
point cloud followed by 3D-TR layers that apply power- alizability as compared to existing methods.
References [13] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
ready for autonomous driving? the kitti vision benchmark
[1] Lin Bai, Yiming Zhao, Mahdi Elhousni, and Xinming suite. In 2012 IEEE conference on computer vision and pat-
Huang. Depthnet: Real-time lidar point cloud depth com- tern recognition, pages 3354–3361. IEEE, 2012. 5
pletion for autonomous vehicles. IEEE Access, 8:227825–
[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
227833, 2020. 1
Urtasun. Vision meets robotics: The kitti dataset. The Inter-
[2] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong
national Journal of Robotics Research, 32(11):1231–1237,
Xiao. Deepdriving: Learning affordance for direct percep-
2013. 2, 5
tion in autonomous driving. In Proceedings of the IEEE
[15] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven-
international conference on computer vision, pages 2722–
tos, and Adrien Gaidon. 3d packing for self-supervised
2730, 2015. 1
monocular depth estimation. In Proceedings of the
[3] Yun Chen, Bin Yang, Ming Liang, and Raquel Urtasun.
IEEE/CVF conference on computer vision and pattern
Learning joint 2d-3d representations for depth completion.
recognition, pages 2485–2494, 2020. 2, 5
In Proceedings of the IEEE/CVF International Conference
[16] Simon Hawe, Martin Kleinsteuber, and Klaus Diepold.
on Computer Vision, pages 10023–10032, 2019. 1, 2, 7
Dense disparity maps from sparse disparity measurements.
[4] Xinjing Cheng, Peng Wang, and Ruigang Yang. Depth esti-
In 2011 International Conference on Computer Vision, pages
mation via affinity learned with convolutional spatial propa-
2126–2133. IEEE, 2011. 2
gation network. In Proceedings of the European conference
on computer vision (ECCV), pages 103–119, 2018. 1, 2, 6, 7 [17] Mu Hu, Shuling Wang, Bin Li, Shiyu Ning, Li Fan, and
Xiaojin Gong. Penet: Towards precise and efficient image
[5] Xinjing Cheng, Peng Wang, and Ruigang Yang. Learning
guided depth completion. In 2021 IEEE International Con-
depth with convolutional spatial propagation network. IEEE
ference on Robotics and Automation (ICRA), pages 13656–
transactions on pattern analysis and machine intelligence,
13662. IEEE, 2021. 1, 2, 7
42(10):2361–2379, 2019. 1, 2
[6] François Chollet. Xception: Deep learning with depthwise [18] Lam Huynh, Phong Nguyen, Jiřı́ Matas, Esa Rahtu, and
separable convolutions. In Proceedings of the IEEE con- Janne Heikkilä. Boosting monocular depth estimation with
ference on computer vision and pattern recognition, pages lightweight 3d point fusion. In Proceedings of the IEEE/CVF
1251–1258, 2017. 3 International Conference on Computer Vision, pages 12767–
12776, 2021. 1, 2, 6, 7
[7] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-
ber, Thomas Funkhouser, and Matthias Nießner. Scannet: [19] Saif Imran, Xiaoming Liu, and Daniel Morris. Depth com-
Richly-annotated 3d reconstructions of indoor scenes. In pletion with twin surface extrapolation at occlusion bound-
Proceedings of the IEEE conference on computer vision and aries. In Proceedings of the IEEE/CVF Conference on Com-
pattern recognition, pages 5828–5839, 2017. 2, 5 puter Vision and Pattern Recognition, pages 2583–2592,
2021. 2, 6, 7
[8] Jorge de Heuvel, Nathan Corral, Benedikt Kreis, Jacobus
Conradi, Anne Driemel, and Maren Bennewitz. Learning [20] Jaewon Kam, Jungeon Kim, Soongjin Kim, Jaesik Park, and
depth vision-based personalized robot navigation from dy- Seungyong Lee. Costdcnet: Cost volume based depth com-
namic demonstrations in virtual reality. In 2023 IEEE/RSJ pletion for a single rgb-d image. In Computer Vision – ECCV
International Conference on Intelligent Robots and Systems 2022: 17th European Conference, Tel Aviv, Israel, October
(IROS), pages 6757–6764. IEEE, 2023. 1 23–27, 2022, Proceedings, Part II, page 257–274, Berlin,
[9] Catherine Diaz, Michael Walker, Danielle Albers Szafir, and Heidelberg, 2022. Springer-Verlag. 6
Daniel Szafir. Designing for depth perceptions in augmented [21] Diederik P Kingma and Jimmy Ba. Adam: A method for
reality. In 2017 IEEE international symposium on mixed and stochastic optimization. arXiv preprint arXiv:1412.6980,
augmented reality (ISMAR), pages 111–122. IEEE, 2017. 1 2014. 5
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [22] Jason Ku, Ali Harakeh, and Steven L Waslander. In defense
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, of classical image processing: Fast depth completion on the
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- cpu. In 2018 15th Conference on Computer and Robot Vision
vain Gelly, et al. An image is worth 16x16 words: Trans- (CRV), pages 16–22. IEEE, 2018. 2
formers for image recognition at scale. In International Con- [23] Byeong-Uk Lee, Kyunghyun Lee, and In So Kweon. Depth
ference on Learning Representations, 2020. 3 completion using plane-residual representation. In Proceed-
[11] Ruofei Du, Eric Turner, Maksym Dzitsiuk, Luca Prasso, Ivo ings of the IEEE/CVF Conference on Computer Vision and
Duarte, Jason Dourgarian, Joao Afonso, Jose Pascoal, Josh Pattern Recognition, pages 13916–13925, 2021. 6, 7
Gladstone, Nuno Cruces, et al. Depthlab: Real-time 3d in- [24] Yuankai Lin, Tao Cheng, Qi Zhong, Wending Zhou, and Hua
teraction with depth maps for mobile augmented reality. In Yang. Dynamic spatial propagation network for depth com-
Proceedings of the 33rd Annual ACM Symposium on User pletion. In Proceedings of the AAAI Conference on Artificial
Interface Software and Technology, pages 829–843, 2020. 1 Intelligence, pages 1638–1646, 2022. 1, 2, 6, 7
[12] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map [25] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong,
prediction from a single image using a multi-scale deep net- Ming-Hsuan Yang, and Jan Kautz. Learning affinity via spa-
work. Advances in neural information processing systems, tial propagation networks. Advances in Neural Information
27, 2014. 5 Processing Systems, 30, 2017. 1
[26] Xin Liu, Xiaofei Shao, Bo Wang, Yali Li, and Shengjin tion. IEEE Transactions on Image Processing, 30:1116–
Wang. Graphcspn: Geometry-aware depth completion via 1129, 2020. 1, 2, 6, 7
dynamic gcns. In European Conference on Computer Vision, [38] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
pages 90–107. Springer, 2022. 1, 3, 5, 6, 7 Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.
[27] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth In 2017 international conference on 3D Vision (3DV), pages
prediction from sparse depth samples and a single image. 11–20. IEEE, 2017. 2
In 2018 IEEE international conference on robotics and au- [39] Yufei Wang, Bo Li, Ge Zhang, Qi Liu, Tao Gao, and Yuchao
tomation (ICRA), pages 4796–4803. IEEE, 2018. 1, 2, 3, 5, Dai. Lrru: Long-short range recurrent updating networks for
6, 8 depth completion. In Proceedings of the IEEE/CVF Inter-
[28] Danish Nazir, Alain Pagani, Marcus Liwicki, Didier Stricker, national Conference on Computer Vision, pages 9422–9432,
and Muhammad Zeshan Afzal. Semattnet: Toward attention- 2023. 2
based semantic aware guided depth completion. IEEE Ac- [40] Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng-
cess, 10:120781–120791, 2022. 2 shuang Zhao. Point transformer v2: Grouped vector atten-
[29] Jinsun Park, Kyungdon Joo, Zhe Hu, Chi-Kuei Liu, and In tion and partition-based pooling. Advances in Neural Infor-
So Kweon. Non-local spatial propagation network for depth mation Processing Systems, 35:33330–33342, 2022. 2, 4
completion. In Computer Vision–ECCV 2020: 16th Euro- [41] Yan Xu, Xinge Zhu, Jianping Shi, Guofeng Zhang, Hujun
pean Conference, Glasgow, UK, August 23–28, 2020, Pro- Bao, and Hongsheng Li. Depth completion from sparse lidar
ceedings, Part XIII 16, pages 120–136. Springer, 2020. 1, 2, data with depth-normal constraints, 2019. 1, 6
3, 5, 6, 7, 8 [42] Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li,
[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, and Jian Yang. Rignet: Repetitive image guided network
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming for depth completion. In European Conference on Computer
Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- Vision, pages 214–230, 2022. 2, 6, 7
perative style, high-performance deep learning library. Ad- [43] Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi,
vances in neural information processing systems, 32, 2019. Risheek Garrepalli, and Fatih Porikli. Mamo: Leveraging
5 memory and attention for monocular video depth estimation.
[31] Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, In Proceedings of the IEEE/CVF International Conference
Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. Deepli- on Computer Vision, pages 8754–8764, 2023. 1
dar: Deep surface normal guided depth prediction for out- [44] Zhu Yu, Zehua Sheng, Zili Zhou, Lun Luo, Si-Yuan Cao,
door scene from sparse lidar data and single color image. In Hong Gu, Huaqi Zhang, and Hui-Liang Shen. Aggregating
Proceedings of the IEEE/CVF Conference on Computer Vi- feature point cloud for depth completion. In Proceedings
sion and Pattern Recognition, pages 3313–3322, 2019. 2, of the IEEE/CVF International Conference on Computer Vi-
6 sion, pages 8732–8743, 2023. 1, 2, 3, 6, 7
[32] Kyeongha Rho, Jinsung Ha, and Youngjung Kim. Guide- [45] Youmin Zhang, Xianda Guo, Matteo Poggi, Zheng Zhu,
former: Transformers for image guided depth completion. Guan Huang, and Stefano Mattoccia. Completionformer:
In Proceedings of the IEEE/CVF Conference on Computer Depth completion with convolutions and vision transform-
Vision and Pattern Recognition, pages 6250–6259, 2022. 1, ers. In Proceedings of the IEEE/CVF Conference on Com-
3, 7 puter Vision and Pattern Recognition, pages 18527–18536,
[33] Yunxiao Shi, Haoyu Fang, Jing Zhu, and Yi Fang. Pairwise 2023. 1, 2, 3, 4, 5, 6, 7, 8
attention encoding for point cloud feature learning. In 2019 [46] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and
International Conference on 3D Vision (3DV), pages 135– Vladlen Koltun. Point transformer. In Proceedings of
144. IEEE, 2019. 1 the IEEE/CVF international conference on computer vision,
[34] Yunxiao Shi, Jing Zhu, Yi Fang, Kuochin Lien, and Junli pages 16259–16268, 2021. 2, 4
Gu. Self-supervised learning of depth and ego-motion [47] Shanshan Zhao, Mingming Gong, Huan Fu, and Dacheng
with differentiable bundle adjustment. arXiv preprint Tao. Adaptive context-aware multi-modal network for depth
arXiv:1909.13163, 2019. completion. IEEE Transactions on Image Processing, 30:
[35] Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- 5264–5276, 2021. 6, 7
depth: Efficient guided attention for self-supervised multi- [48] Shanshan Zhao, Mingming Gong, Huan Fu, and Dacheng
camera depth estimation. In Proceedings of the IEEE/CVF Tao. Adaptive context-aware multi-modal network for depth
Conference on Computer Vision and Pattern Recognition, completion. IEEE Transactions on Image Processing, 30:
pages 119–129, 2023. 1 5264–5276, 2021. 2
[36] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob [49] Wending Zhou, Xu Yan, Yinghong Liao, Yuankai Lin, Jin
Fergus. Indoor segmentation and support inference from Huang, Gangming Zhao, Shuguang Cui, and Zhen Li. Bev@
rgbd images. In Computer Vision–ECCV 2012: 12th Eu- dc: Bird’s-eye view assisted training for depth completion.
ropean Conference on Computer Vision, Florence, Italy, Oc- In Proceedings of the IEEE/CVF Conference on Computer
tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Vision and Pattern Recognition, pages 9233–9242, 2023. 1,
Springer, 2012. 2, 5 2
[37] Jie Tang, Fei-Peng Tian, Wei Feng, Jian Li, and Ping Tan. [50] Jing Zhu, Yunxiao Shi, Mengwei Ren, Yi Fang, Kuo-
Learning guided convolutional network for depth comple- Chin Lien, and Junli Gu. Structure-attentioned memory
network for monocular depth estimation. arXiv preprint
arXiv:1909.04594, 2019. 1
[51] Jing Zhu, Yunxiao Shi, Mengwei Ren, and Yi Fang. Mda-
net: memorable domain adaptation network for monocu-
lar depth estimation. In British Machine Vision Conference
2020, 2020. 1

Decotr: Enhancing Depth Completion With 2D and 3D Attentions

Uploaded by

Copyright:

Available Formats

Decotr: Enhancing Depth Completion With 2D and 3D Attentions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decotr: Enhancing Depth Completion With 2D and 3D Attentions

Uploaded by

Copyright:

Available Formats

DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions

Yunxiao Shi Manish Kumar Singh Hong Cai Fatih Porikli

Method RMSE ↓ MAE ↓

4.4. Ablation Study

You might also like