Decotr: Enhancing Depth Completion With 2D and 3D Attentions
Decotr: Enhancing Depth Completion With 2D and 3D Attentions
Decotr: Enhancing Depth Completion With 2D and 3D Attentions
Abstract generating dense and accurate depth maps from sparse mea-
surements alongside aligned camera images, has emerged
In this paper, we introduce a novel approach that har- as a pivotal research area [4, 5, 24, 26, 27, 29, 41, 45].
nesses both 2D and 3D attentions to enable highly accurate Thanks to the advances in deep learning, there has been
depth completion without requiring iterative spatial propa- significant progress in depth completion. Earlier papers
gations. Specifically, we first enhance a baseline convolu- leverage convolutional neural networks to perform depth
tional depth completion model by applying attention to 2D completion with image guidance and achieve promising
features in the bottleneck and skip connections. This effec- results [3, 27, 37]. In order to improve accuracy, re-
tively improves the performance of this simple network and searchers have studied various spatial propagation meth-
sets it on par with the latest, complex transformer-based ods [4, 24, 25, 29], which performs further iterative pro-
models. Leveraging the initial depths and features from this cessing on top of depth maps and features computed by an
network, we uplift the 2D features to form a 3D point cloud initial network. Most existing solutions build on this in the
and construct a 3D point transformer to process it, allowing last stage of their depth completion pipeline to improve per-
the model to explicitly learn and exploit 3D geometric fea- formance [17, 45]. These propagation algorithms, however,
tures. In addition, we propose normalization techniques to focus on 2D feature processing and do not fully exploit the
process the point cloud, which improves learning and leads 3D nature of the problem. A few recent papers utilize trans-
to better accuracy than directly using point transformers formers for depth completion [32, 45]. However, they apply
off the shelf. Furthermore, we incorporate global attention transformer operations mainly to improve feature learning
on downsampled point cloud features, which enables long- on the 2D image plane and fail to achieve acceptable accu-
range context while still being computationally feasible. We racy without employing spatial propagation.
evaluate our method, DeCoTR, on established depth com- Several studies have looked into harnessing 3D represen-
pletion benchmarks, including NYU Depth V2 and KITTI, tation more comprehensively. For instance, [18, 49] con-
showcasing that it sets new state-of-the-art performance. struct a point cloud from the input sparse depth, yet coping
We further conduct zero-shot evaluations on ScanNet and with extreme sparsity poses challenges in effective feature
DDAD benchmarks and demonstrate that DeCoTR has su- learning. Another approach, as seen in [26], uplifts 2D fea-
perior generalizability compared to existing approaches. tures to 3D by using the initial dense depth predicted by
a simple convolutional network, but it is impeded by the
poor accuracy of the initial network and requires dynamic
1. Introduction propagations to attain acceptable accuracy. Very recently,
Depth is crucial for 3D perception in various downstream researchers have proposed employing transformers for 3D
applications, such as autonomous driving, augmented and feature learning in depth completion [44]; however, this
virtual reality, and robotics [1, 2, 8, 9, 11, 33–35, 43, 50, work applies transformer layers to extremely sparse points,
51]. However, sensor-based depth measurement is far from which is ineffective for learning informative 3D features.
perfect. Such measurements often exhibit sparsity, low res- Here, we introduce DeCoTR to perform feature learning
olution, noise interference, and incompleteness. Various in full 3D. It accomplishes this by constructing a dense fea-
factors, including environmental conditions, motion, sensor ture point cloud derived from completed depth values ob-
power constraints, and the presence of specular, transparent, tained from an initial network and subsequently applying
wet, or non-reflective surfaces, contribute to these limita- transformer processing to these 3D points. To do this prop-
tions. Consequently, the task of depth completion, aimed at erly, it is essential to have reasonably accurate initial depths.
As such, we first enhance a commonly used convolution-
∗ Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. based initial depth network, S2D [27], by integrating trans-
Figure 1. Example depth completion results on NYU Depth v2 dataset [36]. We first upgrade S2D (c) to S2D-TR with efficient attention
on 2D features, which significantly improves the initial depth completion accuracy (d). Based on the more accurate initial depths, DeCoTR
uplifts 2D features to form a 3D point cloud and leverages cross-attention on 3D points, which leads to highly accurate depth completion,
with sharp details and close-to-GT quality (e). We highlight sample regions where we can clearly see progressively improving depths by
using our proposed designs.
former layers on bottleneck and skip connection features. • Through extensive evaluations on standard benchmarks,
This upgraded model, termed S2D-TR, achieves signifi- NYU Depth v2 [36] and KITTI [14], we demonstrate the
cantly improved depth accuracy, on par with state-of-the-art efficacy of DeCoTR and show that it sets the new SOTA,
models, without requiring any iterative spatial propagation. e.g., new best result on NYU Depth v2. Our zero-shot
Given the initial depth map, we uplift 2D features to testing on ScanNet [7] and DDAD [15] further showcases
3D to form a point cloud, which is subsequently processed the better generalizability of our model as compared to
by transformer layers, to which we refer as 3D-TR layers. existing methods.
Prior to feeding the points to transformer layers, we nor-
malize them, which regularizes the 3D feature learning and 2. Related Works
leads to better accuracy. In each 3D-TR layer, we follow
Depth completion: Early depth completion ap-
standard practice [40, 46] to perform neighborhood-based
proaches [16, 22, 38] rely solely on the sparse depth
attention, as global attention would be computationally in-
measurements to estimate the dense depth. Since these
tractable when the number of points is large. To facilitate
methods do not utilize the image, they usually suffer from
long-range contextual understanding, we additionally incor-
artifacts like blurriness, especially at object boundaries.
porate global attention on lower-scale versions of the point
Later, image-guided depth completion alleviates these
cloud. Finally, 3D features are projected back to the 2D im-
issues by incorporating the image. S2D [27], one of the
age plane and consumed by a decoder to produce the final
first papers on this, leverages a convolutional network to
depth prediction. As we shall see in the paper, our proposed
consume both the image and sparse depth map. Subsequent
transformer-based learning in full 3D provides considerably
papers design more sophisticated convolutional models for
improved accuracy and generalizability for depth comple-
depth completion [3, 19, 31, 37, 48]. In order to enhance
tion; see Fig. 1 for a visual example.
depth quality, researchers have studied various spatial prop-
In summary, our main contributions are as follows: agation algorithms [4, 5, 24, 29]. These solutions utilize
• We present DeCoTR, a novel transformer-based approach depth values and features given by an initial network (usu-
to perform full 3D feature learning for depth comple- ally S2D), and performs iterative steps to mix and aggregate
tion. This enables high-quality depth estimation without features on the 2D image plane. In many papers nowadays,
requiring iterative processing steps. it has become a common practice to use spatial propagation
• In order to properly do this, we upgrade the commonly on top of the proposed depth completion network in order
used initial network S2D, by enhancing its bottleneck and to achieve state-of-the-art accuracy [17, 28, 45]. Some
skip connection features using transformers. The result- recent works more tightly integrate iterative processing into
ing model, S2D-TR, performs on-par with SOTA and pro- the network, using architectures like recurrent network [39]
vides more correct depths to subsequent 3D learning. and repetitive hourglass network [42].
• We devise useful techniques to normalize the uplifted While existing solutions predominately propose archi-
3D feature point cloud, which improves the model learn- tectures to process features on 2D, several works explore
ing. We additionally apply low-resolution global atten- 3D representations. For instance, [18, 44, 49] considers the
tion to 3D points, which enhances long-range understand- sparse depth as a point cloud and learn features from it.
ing without making computation infeasible. However, the extremely sparse points present a challenge
Figure 2. Overview of our proposed DeCoTR. The input RGB image and sparse depth map are first processed by our S2D-TR, which
upgrades S2D with efficient 2D attentions. The learned 2D guidance features from S2D-TR are then uplifted to form a 3D feature point
cloud based on the initial completed depth map. We normalize the point cloud and feed it through multiple 3D cross-attention layers (3D-
TR) to enable geometry-aware feature learning and processing. We also introduce efficient global attention to capture long-range scene
context. The attended 3D features from 3D-TR are projected back to 2D and given to a decoder to output the final completed depth map.
to feature learning. One of these works, GraphCSPN [26], It is a common approach to employ early fusion between
employs S2D as an initial network to generate the full depth the depth and RGB modalities [26, 27, 29]. This has the ad-
map, before creating a denser point cloud and performing vantage of enabling features to contain both RGB and depth
feature learning on it. However, this is limited by the insuf- information early on, so that the model can learn to rectify
ficient accuracy of the initial depths by S2D and still needs incorrect depth values by leveraging neighboring, similar
iterative processing to achieve good accuracy. pixels that have correct depths. We follow the same prac-
Vision transformer: Even since its introduction [10], tice, first encoding RGB I and sparse depth S with two sep-
vision transformers have been extensively studied and uti- arate convolutions to obtain image and depth features fI
lized for various computer vision tasks, including classifi- and fS respectively:
cation, detection, segmentation, depth estimation, tracking, f_{I} = \text {conv}_{rgb}(I),\quad f_{S} = \text {conv}_{dep}(S) \vspace {-5pt} (2)
3D reconstruction, and more. We refer readers to these sur-
which are then concatenated channel-wise to generate the
veys for a more comprehensive coverage of these works.
initial fused feature, f1 ∈ RC1 ×H1 ×W1 .
More related to our paper are those that leverage vision
transformers for depth completion, such as Completion- 3.2. Enhancing Baseline with Efficient 2D Attention
Former [45] and GuideFormer [32]. While they demon- The early-fusion architecture of S2D [27] has been com-
strate the effectiveness of using vision transformers for monly used by researchers as a base network to predict
depth completion, their feature learning is only performed an initial completed depth map (e.g., [26, 29]). Given the
on the 2D image plane. A very recent paper, PointDC [44], initial fused feature f1 , S2D continues to encode f1 and
proposes to apply transformer to 3D point cloud in the depth generates multi-scale features fm ∈ RCm ×Hm ×Wm for
completion pipeline. However, PointDC operates on very m = {2, ..., 5}, where Cm , Hm , Wm are the number of
sparse points, which makes it challenging for learning 3D channels, and height and width of the feature map. A con-
features. ventional decoder with convolutional and upsampling lay-
ers is used to consume these features, where the smallest
3. Method feature map is directly fed to the decoder and larger ones
In this section, we present our proposed approach, De- are fed via skip connections. The decoder has two predic-
CoTR, powered by efficient 2D and powerful 3D atten- tion branches, one producing a completed depth map and
tion learning. The overall pipeline of DeCoTR is shown the other generates guidance feature g.
in Fig. 2. This architecture, however, has limited accuracy and
may provide erroneous depth values for subsequent oper-
3.1. Problem Setup
ations, such as 2D spatial propagation or 3D representation
Given aligned sparse depth map S ∈ RH×W and an RGB learning. As such, we propose to leverage self-attentions
image I ∈ RH×W ×3 , the goal of image-guided depth com- to enhance the S2D features. More specifically, we apply
pletion is to recover a dense depth map D ∈ RH×W based Multi-Headed Self-Attention (MHSA) to each fm . Since
on S and with semantic guidance from I. The underlying MHSA incurs quadratic complexity in both time and mem-
reasoning is that visually similar adjacent regions are likely ory w.r.t to input resolution, in order to avoid the in-
to have similar depth values. Formally, we have tractable costs of processing high-resolution feature maps,
{D} = H({S}, {I}), \vspace {-5pt} (1) we first employ depth-separable convolutions [6] to sum-
where H is a depth completion model to be learned. marize large feature maps and reduce their resolutions to
the same size (in terms of height, width, and channel) as the qi , key ki , and value vi . Following [46], we use vector atten-
smallest feature map. The downsized feature maps are de- tion which creates attention weights to modulate individual
noted as f˜m ∈ RCk ×Hk ×Wk , ∀m > 1. Three linear layers feature channels. More specifically, the 3D cross-attention
are used to derive query q̃i ∈ RNk ×Ck , key k̃m ∈ RNk ×Ck , is performed as follows:
value ṽm ∈ RNk ×Ck for each f˜m , where Nk = Hk · Wk . \label {eq:vattn} & {a}_{ij} = w(\phi ({q}_i, {k}_j)),\\ & {g}_i^{a} = \sum _{j\in \mathcal {N}(i)}\text {softmax}({A}_i)_j\odot {v}_j, \vspace {-5pt}
Next, we apply self-attention for each f˜m :
(7)
\tilde {f}^{A}_m = \text {softmax}\Big (\frac {\tilde {q}_m\tilde {k}_m^{\top }}{\sqrt {C_k}}\Big ) \tilde {v}_m, \vspace {-5pt} (3)
where ϕ is a relation function to capture the similarity be-
where f˜mA
denotes the attended features. These features tween a pair of input point features (we use subtraction
are restored to their original resolutions by using depth- here), w is a learnable encoding function that computes at-
separable de-convolutions and we denote the restored ver- tention scores to re-weight the channels of the value, A is
A
sions as fm . Finally, we apply a residual addition to obtain the attention weight matrix whose entries are aij for points
the enhanced feature map for each scale: pi and pj , giA denotes the output feature after cross-attention
for pi , and ⊙ denotes the Hadamard product. We perform
f^E_m = f^{A}_m + f_m,\quad \forall m=\{2,...,5\}. \vspace {-5pt} (4) such 3D cross-attention in multiple transformer layers, to
The enhanced feature maps are then consumed by the de- which we refer as 3D-TR layers.
coder to predict the initial completed depth map and guid- While it is possible to directly use existing point trans-
ance feature map. Our upgraded version of S2D with effi- formers off-the-shelf, we find that this is not optimal for
cient attention enhancement, denoted as S2D-TR, provides depth completion. Specifically, we incorporate the follow-
significantly improved accuracy while having better effi- ing technical designs to improve the 3D feature learning
ciency than latest transformer-based depth completion mod- process.
els. For instance, S2D-TR achieves a lower RMSE (0.094 Point cloud normalization: We normalize the con-
vs. 0.099) with ∼50% less computation as compared to structed point cloud from S2D-TR outputs into a unit ball,
CompletionFormer without spatial propagation [45]. before proceeding to the 3D attention layers. We find this
technique effectively improves depth completion, as we
3.3. Feature Cross-Attention in 3D shall show in the experiments.
Considering the 3D nature of depth completion, it is impor- Positional embedding: Instead of the positional embed-
tant for the model to properly exploit 3D geometric infor- ding multiplier proposed in [40], we adopt the conventional
mation when processing the features. To enable this, we one based on relative position difference. We find the more
first un-project 2D guidance feature from S2D-TR, based complex positional embedding multiplier does not benefit
on the initial completed depth, to form a 3D point cloud. the learning and incurs additional computational cost.
This is done as follows, assuming a pinhole camera model
with known intrinsic parameters:
3.4. Capturing Global Context in 3D
\begin {bmatrix} x\\ y\\ z\\ 1 \end {bmatrix}=d \begin {bmatrix} 1/\gamma _u & 0 & -c_u/\gamma _u & 0\\ 0 & 1/\gamma _v & -c_v/\gamma _v & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end {bmatrix} \begin {bmatrix} u\\ v\\ 1\\ 1/d \end {bmatrix} \vspace {-5pt} The 3D cross-attention discussed previously updates each
point feature only based on the point’s estimated 3D neigh-
(5) borhood, in order to maintain computation tractability given
the quadratic complexity of attention w.r.t. number of
points. However, global or long-range scene context is also
where γu , γv are focal lengths, (cu , cv ) is the principal important for the model to develop accurate 3D understand-
point, (u, v) and (x, y, z) are the 2D pixel coordinates and ing. To enable global understanding while keeping compu-
3D coordinates, respectively. tation costs under control, we propose to perform global 3D
Given the large number of 3D points uplifted from the cross-attention only on a downsampled point set, at the last
2D feature map, it is computationally intractable to per- encoding stage of the point transformer. In this case, we use
form attention on all the points simultaneously. As such, the scalar attention as follows:
we adopt a neighborhood-based attention, by finding the {g}_i^{ga} = \sum _{j\neq i}\text {softmax}\Big (\frac {\langle {q}_i, {k}_j\rangle }{\sqrt {C_g}}\Big ){v}_j, \vspace {-5pt} (8)
K-Nearest-Neighboring (KNN) points for each point in the
point cloud, pi ∈ R3 , which we denote as N (i). where ⟨·⟩ denotes dot product and Cg is the embedding
To bake 3D geometric relationship into the feature learn- dimension. We apply global attention after the local
ing process, we perform cross-attention between the feature neighborhood-based attentions.
of each point and features of its neighboring points. Con-
cretely, we modify from the original point transformer [40, 3.5. Training
46] to implement this. For each point pi , linear projections We train DeCoTR with a masked ℓ1 loss between the fi-
are first applied to transform its guidance feature gi to query nal completed depth maps and the ground-truth depth maps,
following standard practice as in [26, 29]. More formally, scene. We remove samples where more than 10% of the
the loss is given by ground-truth depth values are missing, resulting in 745 test
frames across all 100 test scenes.
\mathcal {L}(D^{gt}, D^{pred}) = \frac {1}{N}\sum _{i,j}\mathbb {I}_{\{d_{i,j}^{gt}>0\}}\Big |d_{i,j}^{gt} - d_{i,j}^{pred}\Big |, \vspace {-5pt} (9) DDAD is an autonomous driving dataset collected in the
U.S. and Japan using a synchronized 6-camera array, fea-
where I is the indicator function, dgt
i,j ∈ D
gt
and dpred
i,j ∈ turing long-range (up to 250m) and diverse urban driving
Dpred represent pixel-wise depths in ground-truth and pre- scenarios. Following [15], we downsample the images from
dicted depth maps, respectively, N is the total number of the original resolution of 1216 × 1936 to 384 × 640. We use
valid pixels. the official 3,950 validation samples for evaluation. Since
after downsampling there’s only less than 5% valid ground
4. Experiments truth depth, for our method and all the comparing we sam-
ple all the available valid depth points so that reasonable
We conduct extensive experiments to evaluate our proposed results are generated.
DeCoTR on standard depth completion benchmarks and Implementation Details: We implement our proposed ap-
compare with the latest state-of-the-art (SOTA) solutions. proach using PyTorch [30]. We use the Adam [21] opti-
We further perform zero-shot evaluation to assess the model mizer with an initial learning rate of 5 × 10−4 , β1 = 0.9,
generalizability and carry out ablation studies to analyze β2 = 0.999, and no weight decay. The batch size for
different parts of our proposed approach. NYUDv2 and KITTI-DC per GPU is set to 8 and 4, respec-
4.1. Experimental Setup tively. All experiments are conducted on 8 NVIDIA A100
GPUs.
Datasets: We perform standard depth completion evalua-
tions on NYU Depth v2 (NYUD-v2) [36] and KITTI Depth Evaluation: We use standard metrics to evaluate depth
Completion (KITTI-DC) [13, 14], and generalization tests completion performance [12], including Root Mean
on ScanNet-v2 [7] and DDAD [15]. These datasets cover a Squared Error (RMSE), Absolute Relative Error (Abs Rel),
variety of indoor and outdoor scenes. We follow the sam- δ < 1.25, δ < 1.252 , and δ < 1.253 . On KITTI-DC test, we
pling settings from existing works to create input sparse use the official metrics: RMSE, MAE, iRMSE and iMAE.
depth [26, 29]. We refer readers to the supplementary file for detailed math-
NYUD-v2 provides RGB images and depth maps cap- ematical definitions of these metrics. The depth values are
tured by a Kinect device from 464 different indoor scenes. evaluated with maximum distances of 80 meters and 200
We use the official split: 249 scenes for training and the meters for KITTI and DDAD, respectively, and 10 meters
remaining 215 for testing. Following the common prac- for NYUD-v2 and ScanNet.
tice [26, 29, 45], we sample ∼50,000 images from the train-
ing set and resize the image size from 480 × 640 first to half 4.2. Results on NYUD-v2 and KITTI
and then to 228 × 304 with center cropping. We use the On NYUD-v2: Table 1 summarizes the quantitative evalua-
official test set of 654 images for evaluation. tion results on NYUD-v2. Our proposed DeCoTR approach
KITTI is a large real-world dataset in the autonomous sets the new SOTA performance, with the lowest RMSE of
driving domain, with over 90,000 paired RGB images and 0.086 outperforming all existing solutions. When not using
LiDAR depth measurements. There are two versions of 3D global attention, DeCoTR already provides the best ac-
KITTI dataset used for depth completion. One is from [27], curacy and global attention further improves it. Specifically,
which consists of 46,000 images from the training se- our DeCoTR considerably outperforms latest SOTA meth-
quences for training and a random subset of 3,200 images ods that also leverage 3D representation and/or transform-
from the test sequences for evaluation. The other one is ers, such as GraphCSPN, PointDC, and CompletionFormer.
KITTI Depth Completion (KITTI-DC) dataset, which pro- Note that although PointDC uses both 3D representation
vides 86,000 training, 6,900 validation, and 1,000 testing and transformer, it only obtains slightly lower RMSE when
samples with corresponding raw LiDAR scans and refer- comparing to methods that do not use 3D or transformer
ence images. We use KITTI-DC to train and test our model (e.g., CompletionFormer, GraphCSPN). This indicates that
on the official splits. the PointDC approach is suboptimal, potentially due to the
ScanNet-v2 contains 1,513 room scans reconstructed extremely sparse 3D points.
from RGB-D frames. The dataset is divided into 1,201 Fig. 3 provides sample qualitative results on NYUD-v2.
scenes for training and 312 for validation, and provides an We see that DeCoTR generates highly accurate dense depth
additional 100 scenes for testing. For sparse input depths, maps that are very close to the ground truth. The depth
we sample point clouds from vertices of the reconstructed maps produced by DeCoTR capture much finer details as
meshes. We use the 100 test scenes to evaluate depth com- compared to existing SOTA methods. For instance, in the
pletion performance, with 20 frames randomly selected per second example, our proposed approach accurately predicts
Method RMSE ↓ Abs Rel ↓ δ < 1.25 ↑ δ < 1.252 ↑ δ < 1.253 ↑
S2D [27] 0.204 0.043 97.8 99.6 99.9
DeepLiDAR [31] 0.115 0.022 99.3 99.9 100.0
CSPN [4] 0.117 0.016 99.2 99.9 100.0
DepthNormal [41] 0.112 0.018 99.5 99.9 100.0
ACMNet [47] 0.105 0.015 99.4 99.9 100.0
GuideNet [37] 0.101 0.015 99.5 99.9 100.0
TWISE [19] 0.097 0.013 99.6 99.9 100.0
NLSPN [29] 0.092 0.012 99.6 99.9 100.0
RigNet [42] 0.090 0.013 99.6 99.9 100.0
DySPN [24] 0.090 0.012 99.6 99.9 100.0
CompletionFormer [45] 0.090 0.012 - - -
PRNet [23] 0.104 0.014 99.4 99.9 100.0
CostDCNet [20] 0.096 0.013 99.5 99.9 100.0
PointFusion [18] 0.090 0.014 99.6 99.9 100.0
GraphCSPN [26] 0.090 0.012 99.6 99.9 100.0
PointDC [44] 0.089 0.012 99.6 99.9 100.0
DeCoTR (ours) 0.087 0.012 99.6 99.9 100.0
DeCoTR w/ GA (ours) 0.086 0.012 99.6 99.9 100.0
Table 1. Quantitative evaluation of depth completion performance on NYU-Depth-v2. GA denotes global attention. RMSE and REL are
in meters. Methods in the top part of the table focus on feature learning and processing in 2D and those in the bottom block exploit 3D
representation. Best and second best numbers are highlighted in bold and underlined, respectively, for RMSE and Abs Rel.
Figure 3. Qualitative results on NYUD-v2. We compare with SOTA methods such as NLSPN, GraphCSPN, and CompletionFormer. Areas
where DeCoTR provides better depth accuracy are highlighted.
the depth on the faucet despite its small size in the images see the highlighted areas in the figure. For instance, in
and the low contrast, while other methods struggle. the second example, DeCoTR accurate estimates the depth
On KITTI-DC: We evaluate DeCoTR and compare around the upper edge of the truck while the depth map by
with existing methods (including latest SOTA) on the offi- NLSPN is blurry in that region.
cial KITTI test set, as shown in Table 2. DeCoTR achieves
4.3. Zero-Shot Testing on ScanNet and DDAD
SOTA depth completion accuracy and is among the top-
ranking methods on KITTI-DC leaderboard.1 We see that Most existing papers only evaluate their models on NYUD-
DeCoTR performs significantly better than existing SOTA v2 and KITTI, without looking into model generalizabil-
methods that leverage 3D representations, e.g., GraphC- ity. In this part, we perform cross-dataset evaluation.
SPN, PointDC. This indicates that DeCoTR has the right More specifically, we run zero-shot testing of NYUD-v2-
combination of dense 3D representation and transformer- trained models on ScanNet-v2 and KITTI-trained models
based learning. on DDAD. This will allow us to understand how well our
Fig. 4 shows visual examples of our completed depth DeCoTR as well as existing SOTA models generalize to
maps on KITTI. DeCoTR is able to generate correct depth data not seen in training.
prediction where NLSPN produces erroneous depth values; Tables 3 and 4 present evaluation results on ScanNet-
v2 and DDAD, respectively. We see that DeCoTR gener-
1 Top-5 among published methods at the time of submission, in terms alizes better to unseen datasets when comparing to existing
of iRMSE, iMAE, and MAE. SOTA models. It it noteworthy to mention that on DDAD,
Method RMSE ↓ MAE ↓ iRMSE ↓ iMAE ↓
CSPN [4] 1019.64 279.46 2.93 1.15
TWISE [19] 840.20 195.58 2.08 0.82
ACMNet [47] 744.91 206.09 2.08 0.90
GuideNet [37] 736.24 218.83 2.25 0.99
NLSPN [29] 741.68 199.59 1.99 0.84
PENet [17] 730.08 210.55 2.17 0.94
GuideFormer [32] 721.48 207.76 2.14 0.97
RigNet [42] 712.66 203.25 2.08 0.90
DySPN [24] 709.12 192.71 1.88 0.82
CompletionFormer [45] 708.87 203.45 2.01 0.88
PRNet [23] 867.12 204.68 2.17 0.85
FuseNet [3] 752.88 221.19 2.34 1.14
PointFusion [18] 741.9 201.10 1.97 0.85
GraphCSPN [26] 738.41 199.31 1.96 0.84
PointDC [44] 736.07 201.87 1.97 0.87
DeCoTR (ours) 717.07 195.30 1.92 0.84
Table 2. Quantitative evaluation of depth completion performance on official KITTI Depth Completion test set. RMSE and MAE are in
millimeters, and iRMSE and iMAE are in 1/km. Similar to Table 1, methods in the top part focus on feature learning in 2D and those in
the bottom block exploit 3D representation. Best and second best numbers are highlighted in bold and underlined, respectively.
Figure 4. Qualitative results on KITTI DC. Areas where DeCoTR provides better depth accuracy are highlighted.
DeCoTR has significantly lower depth errors as compared Method RMSE ↓ δ < 1.25 ↑
to both NLSPN and CompletionFormer, despite that Com-
NLSPN [29] 0.198 97.3
pletionFormer has slightly lower RMSE on KITTI-DC test. GraphCSPN [26] 0.197 97.3
Moreover, in this case, CompletionFormer has even worse CompletionFormer [45] 0.194 97.3
accuracy than NLSPN, indicating its poor generalizability.
Fig. 5 shows sample visual results of zero-shot depth DeCoTR (ours) 0.188 97.6
completion on ScanNet-v2. DeCoTR generates highly ac- Table 3. Zero-shot testing on ScanNet-v2 using models trained on
curate depth maps and captures fine details, e.g., arm rest in NYUD-v2. Best numbers are highlighted in bold.
the first example, lamp in the second example. Other meth-
ods cannot recover the depths accurately. Fig. 6 provides
qualitative results on DDAD for CompletionFormer and our
DeCoTR. While this is a challenging test setting given the range (e.g., on cars) and less so when it is far away (e.g.,
much larger depth range in DDAD, DeCoTR still predicts on trees), since KITTI training only covers up to 80 me-
reasonable depths. In contrast, it can be seen that Com- ters whereas DDAD has depth up to 200 meters. This is
pletionFormer performs very poorly on DDAD. We notice also confirmed by the lower-than-KITTI RMSE and higher-
that DeCoTR’s predictions are more accurate in the nearer than-KITTI MAE numbers of DeCoTR on DDAD.
Figure 5. Qualitative results of zero-shot inference on ScanNet-v2. Areas where DeCoTR provide better depth accuracy are highlighted.