Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

4 months ago·arXiv

Abstract

We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. State-of-the-art (SoTA) monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method current ranks the 1st on various zero-shot and non-zero-shot benchmarks for metric depth, affine-invariant-depth as well as surface-normal prediction, shown in Fig. 1. Notably, we surpassed the ultra-recent MarigoldDepth and DepthAnything on various depth benchmarks including NYUv2 and KITTI. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. These applications highlight the versatility of Metric3D v2 models as geometric foundation models. Our project page is at https://JUGGHM.github.io/Metric3Dv2.

1 INTRODUCTION

Monocular metric depth and surface normal estimation is the task of predicting absolute distance and surface direction from a single image. As crucial 3D representations, depth and normals are geometrically related and highly complementary. While metric depth excels in capturing data at scale, surface normals offer superior preservation of local geometry and are devoid of metric ambiguity compared to metric depth. These unique attributes render both depth and surface normals indispensable in various computer vision applications, including 3D reconstruction [1], [2], [3], neural rendering (NeRF) [4], [5], [6], [7], autonomous driving [8], [9], [10], and robotics [11], [12], [13]. Currently, the community still lacks a robust, generalizable geometry foundation model [14], [15], [16] capable of producing high-quality metric depth and surface normal from a single image.

Metric depth estimation and surface normal estimation confront distinct challenges. Existing depth estimation methods are categorized into learning metric depth [17],

Fig. 1 – Comparisons with SoTA methods on 16 depth and normal benchmarks. Radar-map of our Metric3D V2 v.s. SoTA methods from different works, on (1) Metric depth benchmarks, see ‘(Metric-depth)’. (2) Affine-invariant depth benchmarks, see ‘(Affine-invariant-depth)’. (3) Surface normal benchmarks, see ‘(Normal)’. Zero-shot testing is denoted by ‘ ♠’. Here δ1 percentage accuracy is used for depth benchmarks and 30◦ percentage accuracy is for normal. Both higher values are for better performance. We establish new SoTA on a wide range of depth and normal benchmarks.

Fig. 2 – Surface normal (N) and monocular depth (D) comparisons on diverse web images. Our method, directly estimating metric depths and surface normals, shows powerful generalization in a variety of scenarios, including indoor, outdoor, poorvisibility, motion blurred, and fisheye images. Visualized results come from our ViT-large-backbone estimator. Marigold is a strong and robust diffusion-based monocular depth estimation method, but its recovered surface normals from the depth show various artifacts.

[18], [19], [20], relative depth [21], [22], [23], [24], and affine-invariant depth [25], [26], [27], [28], [29]. Although the metric depth methods [17], [18], [19], [20], [30] have achieved impressive accuracy on various benchmarks, they must train and test on the dataset with the same camera intrinsics. Therefore, the training datasets of metric depth methods are often small, as it is hard to collect a large dataset covering diverse scenes using one identical camera. The consequence is that all these models generalize poorly in zero-shot testing, not to mention the camera parameters of test images can vary too. A compromise is to learn the relative depth [21], [23], which only represents one point being further or closer to another one. The application of relative depth is very limited. Learning affine-invariant depth finds a trade-off between the above two categories of methods, i.e. the depth is up to an unknown scale and shift. With large-scale data, they decouple the metric information during training and achieve impressive robustness and generalization ability, such as MiDaS [27], DPT [28], LeReS [25], [26], HDN [29]. The problem is the unknown shift will cause 3D reconstruction distortions [26] and non-metric depth cannot satisfy various downstream applications.

In the meantime, these models cannot generate surface normals. Although lifting depths to 3D point clouds can do so, it places high demands on the accuracy and fine details of predicted depths. Otherwise, various artifacts will remain in such transformed normals. For example, Fig. 2 shows noisy normals from Marigold [31] depths, which excels in producing high-resolution fine depths. Instead of direct transformation, state-of-the-art (SoTA) surface normal estimation methods [32], [33], [34] tend to train estimators on high-quality normal annotations. These annotations, unlike sensor-captured ground-truth (GT), are derived from meticulously and densely reconstructed scenes, which have extremely rigorous requirements for both the capturing equipment and the scene. Consequently, data sources primarily consist of either synthetic creation or 3D indoor reconstruction [35]. Real and diverse outdoor scenes are exceedingly rare. (refer to our data statistics in Tab. 5). Limited by this label deficiency, SoTA surface normal methods [32], [33], [34] typically struggle with strong zero-shot generalization. This work endeavors to tackle these challenges by developing a multi-task foundation model for zero-shot, single view, metric depth, and surface normal estimation.

We propose targeted solutions for the challenges of zero-shot metric depth and surface normal estimation. For metric-scale recovery, we first analyze the metric ambiguity issues in monocular depth estimation and study different camera parameters in depth, including the pixel size, focal length, and sensor size. We observe that the focal length is the critical factor for accurate metric recovery. By design, affine-invariant depth methods do not take the focal length information into account during training. As shown in Sec. 3.1, only from the image appearance, various focal lengths may cause metric ambiguity, thus they decouple the depth scale in training. To solve the problem of varying focal lengths, CamConv [38] encodes the camera model in the network, which enforces the network to implicitly understand camera models from the image appearance and then bridges the imaging size to the real-world size. However, training data contains limited images and types of cameras, which challenges data diversity and network capacity. We

Fig. 3 – Top (metrology for a complex scene): we use two phones (iPhone14 pro and Samsung Galaxy S23) to capture the scene and measure the size of several objects, including a drone which has never occurred in the whole training set. With the photos’ metadata, we perform 3D metric reconstruction and then measure object sizes (marked in red), which are close to the ground truth (marked in yellow). Compared with ZoeDepth [36], our measured sizes are closer to ground truth. Bottom (dense SLAM mapping): existing SoTA mono-SLAM methods usually face scale drift problems (see the red arrows) in large-scale scenes and are unable to achieve the metric scale, while, naively inputting our metric depth, Droid-SLAM [37] can recover much more accurate trajectory and perform the metric dense mapping (see the red measurements). Note that all testing data are unseen to our model.

propose a canonical camera transformation method in training, inspired by the canonical pose space from human body reconstruction methods [39]. We transform all training data to a canonical camera space where the processed images are coarsely regarded as captured by the same camera. To achieve such transformation, we propose two different methods. The first one tries to adjust the image appearance to simulate the canonical camera, while the other one transforms the ground-truth labels for supervision. Camera models are not encoded in the network, making our method easily applicable to existing architectures. During inference, a de-canonical transformation is employed to recover metric information. To further boost the depth accuracy, we propose a random proposal normalization loss. It is inspired by the scale-shift invariant loss [25], [27], [29] decoupling the depth scale to emphasize the single image’s distribution. However, they perform on the whole image, which inevitably squeezes the fine-grained depth difference. We propose to randomly crop several patches from images and enforce the scale-shift invariant loss [25], [27] on them. Our loss emphasizes the local geometry and distribution of the single image.

For surface normal, the biggest challenge is the lack of diverse (outdoor) annotations. Compared to reconstructionbased annotation methods [35], [40], directly producing normal labels from network-predicted depth is more efficient and scalable. The quality of such pseudo-normal labels, however, is bounded by the accuracy of the depth network. Fortunately, we observe that robust metric depth models are scalable geometric learners, containing abundant information for normal estimation. Weak supervision from the pseudo normal annotations transformed by learned metric depth can effectively prevent the normal estimator from collapsing caused by GT absence. Furthermore, this supervision can guide the normal estimator to generalize on large-scale unlabeled data. Based on such observation, we propose a joint depth-normal optimization module to distill knowledge from diverse depth datasets. During optimization, our normal estimator learns from three sources: (1) Groundtruth normal labels, though they are much fewer compared to depth annotations (2) An explicit learning objective to constrain depth-normal consistency. (3) Implicit and thorough knowledge transfer from depth to normal through feature fusion, which is more tolerant to unsatisfactory initial prediction than the explicit counterparts [41], [42]. To achieve this, we implement the optimization module using deep recurrent blocks. While previous researchers have employed similar recurrent modules to optimize depth [42], [43], [44], disparity [45], ego-motion [37], or optical flows [46], it is the first time that normal is iteratively optimized together with depth in a learning-based scheme. Benefiting from the joint optimization module, our models can efficiently learn normal knowledge from large-scale depth datasets even without labels.

With the proposed method, we can stably scale up model training to 16 million images from 18 datasets of diverse scene types (indoor and outdoor, real or synthetic data), camera models (tens of thousands of different cameras), and annotation categories (with or without normal), leading to zero-shot transferability and significantly improved

Fig. 4 – Overall methodology. Our method takes a single image to predict the metric depth and surface normal simultaneously. We apply large-scale data training directly for metric depth estimation rather than affine invariant depth, enabling end-to-end zero-shot metric depth estimation for various applications using a single model. For normals, we enable learning from depth labels only, alleviating the demand for dense reconstruction to generate large-scale normal labels.

accuracy. Fig. 4 illustrates how the large-scale data with depth annotations directly facilitate metric depth and surface normal learning. The metric depth and normal given by our model directly broaden the applications in downstream tasks. We achieve state-of-the-art performance on over 16 depth and normal benchmarks, see Fig. 1. Our model can accurately reconstruct metric 3D from randomly collected Internet images, enabling plausible single-image metrology. As an example (Fig. 3), with the predicted metric depths from our model, we significantly reduce the scale drift of monocular SLAM [37], [47] systems, achieving much better mapping quality with real-world metric recovery. Our model also enables large-scale 3D reconstruction [48]. To summarize, our main contributions are:

• We propose a canonical and de-canonical camera transformation method to solve the metric depth ambiguity problems from various camera settings. It enables the learning of strong zero-shot monocular metric depth models from large-scale datasets.

• We propose a random proposal normalization loss to effectively boost the depth accuracy;

• We propose a joint depth-normal optimization module to learn normal on large-scale datasets without normal annotation, distilling knowledge from the metric depth estimator.

• Our models rank 1st on a wide variety of depth and surface normal benchmarks. It can perform high-quality 3D metric structure recovery in the wild and benefit several downstream tasks, such as monoSLAM [37], [49], 3D scene reconstruction [48], and metrology [50].

2 RELATED WORK

3D reconstruction from a single image. Reconstructing various objects from a single image has been well studied [51], [52], [53]. They can produce high-quality 3D models of cars, planes, tables, and human body [54], [55]. The main challenge is how to best recover objects’ details, how to represent them with limited memory, and how to generalize to more diverse objects. However, all these methods rely on learning priors specific to a certain object class or instance, typically from 3D supervision, and can therefore not work for full scene reconstruction. Apart from these reconstructing objects works, several works focus on scene reconstruction from a single image. Saxena et al. [56] construct the scene based on the assumption that the whole scene can be segmented into several small planes. With planes’ orientation and location, the 3D structure can be represented. Recently, LeReS [25] proposed to use a strong monocular depth estimation model to do scene reconstruction. However, they can only recover the shape up to a scale. Zhang et al. [57] recently proposed a zero-shot geometry-preserving depth estimation model that is capable of making depth predictions up to an unknown scale, without requiring scale-invariant depth annotations for training. In contrast to these works, our method can recover the metric 3D structure.

Supervised monocular depth estimation. After several benchmarks [58], [59] are established, neural network based methods [17], [19], [30] have dominated since then. Several approaches regress the continuous depth from the aggregation of information in an image [60]. As depth distribution corresponding to different RGBs can vary to a large extent, some methods [19], [30] discretize the depth and formulate this problem to a classification [18], which often achieves better performance. The generalization issue of deep models for 3D metric recovery is related to two problems. The first one is to generalize to diverse scenes, while the other one is how to predict accurate metric information under various camera settings. The first problem has been well addressed by recent methods. Some works [18], [21], [22] propose to construct a large-scale relative depth dataset, such as DIW [24] and OASIS [23], and then they target learning the relative relations. However, the relative depth loses geometric structure information. To improve the recovered geometry quality, learning affine-invariant depth methods, such as MiDaS [27], LeReS [25], and HDN [29] are proposed. By mixing large-scale data, state-of-the-art performance and the generalization over scenes are improved continuously. Note that by design, these methods are unable to recover the metric information. How to achieve both strong generalization and accurate metric information over diverse scenes is the key problem that we attempt to tackle.

Surface normal estimation. Compared to metric depth, surface normal suffers no metric ambiguity and preserves local geometry better. These properties attract researchers to apply normal in various vision tasks like localization [11], mapping [61], and 3D scene reconstruction [6], [62]. Currently, learning-based methods [32], [33], [34], [42], [62], [63], [64], [65], [66], [67], [68], [69], [70] have dominated monocular surface normal estimation. Since normal labels required for training cannot be directly captured by sensors, previous works use [41], [58], [63], [65] kernel functions to annotate normal from dense indoor depth maps [58]. These annotations become incomplete on reflective surfaces and inaccurate at object boundaries. To learn from such imperfect annotations, GeoNet [41] proposes to enforce depth-normal consistency with mutual transformation modules, ASN [69], [70] propose a novel adaptive surface normal constraint to facilitate joint depth-normal learning, and Bae et al. [33] propose an uncertainty-based learning objective. Nonetheless, it is challenging for such methods to further increase their generalization, due to the limited dataset size and the diversity of scenes, especially for outdoor scenarios. Omni-data [35] advances to fill this gap by building 1300M frames of normal annotation. Normal-in-the-wild [71] proposes a pipeline for efficient normal labeling. However, further scaling up normal labels remains difficult. This underscores research significance in finding an efficient way to distill prior from other types of annotation. Deep iterative refinement for geometry. Iterative refinement enables multi-step coarse-to-fine prediction and benefits a wide range of geometry estimation tasks, such as optical flow estimation [46], [72], [73], depth completion [43], [74], [75], and stereo matching [45], [76], [77]. Classical iterative refinements [72], [74] optimize directly on high-resolution outputs using high-computing-cost operators, limiting researchers from applying more iterations for better predictions. To address this limitation, RAFT [46] proposes to optimize an intermediate low-resolution prediction using ConvGRU modules. For monocular depth estimation, IEBins [44] employs similar methods to optimize depth-bin distribution. Differently, IronDepth [42] propagates depth on pre-computed local surfaces. Regarding surface normal refinement, Lenssen et al. [78] propose a deep iterative method to optimize normal from point clouds. Zhao et al. [79] design a solver to refine depth and normal jointly, but it requires multi-view prior and per-sample post optimization. Without multi-view prior, such a non-learnable optimization method could fail due to unsatisfactory initial predictions. All the monocular methods [42], [44], [78], however, iterate over either depth or normal independently. In contrast, our joint optimization module tightly couples depth and normal with each other.

Large-scale data training. Recently, various natural language problems and computer vision problems [80], [81], [82] have achieved impressive progress with large-scale data training. CLIP [81] is a promising classification model, which is trained on billions of paired image and language description data. It achieves state-of-the-art performance over several classification benchmarks by zero-shot testing. Dinov2 [83] collects 142M images to conduct vision-only self-supervised learning for vision transformers [84]. Generative models like LDM [85] have also undergone billionlevel data pre-training. For depth prediction, large-scale data training has been widely applied. Ranft et al. [27] mix over 2 million data in training, LeReS [26] collects over 300 thousands data, Eftekhar et al. [35] also merge millions of data to build a strong depth prediction model. For surface normal estimation, Omni-data [35] performs dense reconstruction to generate 14M frames with surface normal annotations, aggregating data from six different datasets. Furthermore, it established multi-task training involving these annotations.

3 METHOD

Preliminaries. We consider the pin-hole camera model with intrinsic parameters formulated as: [[ ˆf/δ, 0, u0], [0, ˆf/δ, v0], [ 0, 0, 1]], where ˆf is the focal length (in micrometers), δ is the pixel size (in micrometers), and (u0, v0) is the principle center. f = ˆf/δ is the pixel-represented focal length.

3.1 Ambiguity Issues in Metric Depth Estimation

Fig. 5 presents an example of photos taken by different cameras and at different distances. Only from the image’s appearance, one may think the last two photos are taken at a similar location by the same camera. In fact, due to different focal lengths, these are captured at different locations. Thus, camera intrinsic parameters are critically important for the metric estimation from a single image, as otherwise, the problem is ill posed. To avoid such metric ambiguity, recent methods, such as MiDaS [27] and LeReS [25], decouple the metric from the supervision and compromise learning the affine-invariant depth.

Fig. 6 (A) shows a simple pin-hole perspective projection. Object A locating at is projected to . Based on the principle of similarity, we have the equation:

where ˆS and ˆS′ are the real and imaging size respectively. denotes variables are in the physical metric (e.g., millimeter). To recover from a single image, focal length, imaging size of the object, and real-world object size must be available. Estimating the focal length from a single image is a challenging and ill-posed problem. Although several methods [25], [86] have explored, the accuracy is still far from being

Fig. 5 – Photos of a chair captured at different distances with different cameras. The first two photos are captured at the same distance but with different cameras, while the last one is taken at a closer distance with the same camera as the first one.

satisfactory. Here, we simplify the problem by assuming the focal length of a training/test image is available. In contrast, understanding the imaging size is much easier for a neural network. To obtain the real-world object size, a neural network needs to understand the semantic scene layout and the object, at which a neural network excels. We define α = ˆf/ ˆS′, so da is proportional to

We make the following observations regarding sensor size, pixel size, and focal length. O1: Sensor size and pixel size do not affect the metric depth estimation. Based on the perspective projection (Fig. 6 (A)), the sensor size only affects the field of view (FOV) and is irrelevant to , thus does not affect the metric depth estimation. For the pixel size, we assume two cameras with different pixel sizes ( δ1 = 2δ2 ) but the same focal length to capture the same object locating at da. Fig. 6 (B) shows their captured photos. According to the preliminaries, the pixel-represented focal length f1 = 12f2. As the second camera has a smaller pixel size, although in the same projected imaging size ˆS′ , the pixel-represented image resolution is S′1 = 12S′2 . According to Eq. (1), ˆfδ1·S′1 = ˆfδ2·S′2 , i.e. α1 = α2, so d1 = d2 . Therefore, different camera sensors would not affect the metric depth estimation. O2: The focal length is vital for metric depth estimation. Fig. 5 shows the metric ambiguity issue caused by the unknown focal length. Fig. 7 illustrates this. If two cameras ( ˆf1 = 2 ˆf2 ) are at distances d1 = 2d2 , the imaging sizes on cameras are the same. Thus, only from the appearance, the network will be confused when supervised with different labels. Based on this observation, we propose a canonical camera transformation method to solve the supervision and image appearance conflicts.

Unlike depth, surface normal does not have any metric ambiguity problem. In Fig. 8, we illustrate this concept with two depth maps at varying scales, denoted as D1 and D2, featuring distinct metrics d1 and d2, respectively, where

Fig. 6 – Pinhole camera model. (A) Object A at the distance is projected to the image plane. (B) Using two cameras to capture the car. The left one has a larger pixel size. Although the projected imaging sizes are the same, the pixel-represented images (resolution) are different.

Fig. 7 – Illustration of two cameras with different focal length at different distance. As f1 = 2f2 and d1 = 2d2, A is projected to two image planes with the same imaging size (i.e. A′1 = A′2).

¡ . After upprojecting the depth to the 3D point cloud, the dolls are in different distances, i.e. d1 and d2 and d1 < d2. However, despite the variation in depth scales, the surface normals n1 and n2 corresponding to a certain pixel A′ ∈ I remain the same.

Fig. 8 – The metric-agnostic property of normal. With differently predicted metrics d1 and d2 , the pixel A′ on the image will be back-projected to 3D points A1 and A2, respectively. The surface normal n 1 at A1 and n2 at A2 remain the same.

3.2 Canonical Camera Transformation

The core idea is to set up a canonical camera space ( (f cx, f cy), f cx = f cy = f c in experiments) and transform all training data to this space. Consequently, all data can roughly be regarded as captured by the canonical camera. We propose two transformation methods, i.e. either transforming the input image ( I ∈ RH×W ×3 ) or the ground-truth (GT) label ( D ∈ RH×W ). The original intrinsics are {f, u0, v0}.

Method1: transforming depth labels (CSTM label). Fig. 5’s ambiguity is for depths. Thus our first method directly transforms the ground-truth depth labels to solve this problem. Specifically, we scale the ground-truth depth () with the ratio ωd = f cf in training, i.e., D∗c = ωdD∗. The original camera model is transformed to {f c, u0, v0}. In inference, the predicted depth () is in the canonical space and needs to perform a de-canonical transformation to recover the metric information, i.e., D = 1ωd Dc. Note the input I does not perform any transformation, i.e., Ic = I.

Method2: transforming input images (CSTM image). From another view, the ambiguity is caused by the similar image appearance. Thus this method is to transform the input image to simulate the canonical camera imaging effect. Specifically, the image I is resized with the ratio ωr = f cf , i.e., Ic = T(I, ωr), where T(·) denotes image resize. The optical center is resized, thus the canonical camera model is {f c, ωru0, ωrv0} . The ground-truth labels are resized without any scaling, i.e., D∗c = T(D∗, ωr) . In inference, the de- canonical transformation is to resize the prediction to the original size without scaling, i.e., D = T(Dc, 1ωr ).

Fig. 9 – Pipeline. Given an input image I, we first transform it to the canonical space using CSTM. The transformed image is fed into a standard depth-normal estimation model to produce the predicted metric depth in the canonical space and metric-agnostic surface normal N. During training, is supervised by a GT depth D∗c which is also transformed into the canonical space. In inference, after producing the metric depth in the canonical space, we perform a de-canonical transformation to convert it back to the space of the original input I. The canonical space transformation and de-canonical transformation are executed using camera intrinsics. The predicted normal N is supervised by depth-normal consistency via the recovered metric depth D as well as GT normal N ∗ , if available.

Fig. 9 shows the pipeline. After performing either transformation, we randomly crop a patch for training. The cropping only adjusts the FOV and the optical center, thus not causing any metric ambiguity issues. In the labels transformation method ωr = 1 and ωd = f cf , while ωd = 1 and ωr = f cf in the images transformation method. During training, the transformed ground-truth depth labels D∗c are then used as supervision. Notably, as surface normals do not suffer from any metric ambiguity, we impose no transformation to normal labels N∗.

Mix-data training is an effective way to boost generalization. We collect 18 datasets for training, see Tab. 5 for details. In the mixed data, over 10K different cameras are included. All collected training data have included paired camera intrinsic parameters, which are used in our canonical transformation module.

3.3 Jointly optimizing depth and normal

We propose to optimize metric depth and surface normal jointly in an end-to-end manner. This optimization is primarily aimed at leveraging the large amount of annotation knowledge available in depth datasets to improve normal estimation, particularly in outdoor scenarios where depth datasets contain significantly more annotations than normal datasets. In our experiments, we collect from the community 9488K images with depth annotations across 14 outdoor datasets while less than 20K outdoor normal-labeled images, presented in Tab. 5.

To facilitate knowledge flow across depth and normal, we implement the learning-based optimization with recurrent refinement blocks, as depicted in Fig 10. Unlike previous monocular methods [42], [44], our method updates both depth and normal iteratively through these blocks. Inspired by RAFT [45], [46], we iteratively optimize the intermediate low-resolution depth ˆDc and unnormalized normal ˆNu, where ˆ denotes low resolution prediction ˆDc ∈ RH4 × W4 , and ˆNu ∈ RH4 × W4 ×3 , and the subscript means the depth ˆDc is in canonical space. As sketched in Fig. 10, ˆDtc and ˆNtu represent the low-resolution depth and normal optimized after step t, where t = 0, 1, 2, . . . , T denotes the step index. Initially, at step t = 0, ˆD0c and ˆN0u are given by the decoder. In addition to updating depth and normal, the optimization module also updates hidden feature maps , which are initialized by the decoder. During each iteration, the learned recurrent block F output updates ∆ ˆDc, ∆ ˆNu and renews the hidden features H:

The updates are then applied for updating the predictions:

To be more specific, the recurrent block F comprises a ConvGRU sub-block and two projection heads. First, the ConvGRU sub-block updates the hidden features Ht taking all the variables as inputs. Subsequently, the two branched projection heads Gd and Gn estimate the updates ∆ ˆDt+1 and ∆ ˆNt+1 respectively. A more comprehensive representation of Eq. 2, therefore, can be written as:

For detailed structures of the refinement module F, we recommend readers refer to supplementary materials.

After T + 1 iterative steps, we obtain the well-optimized low-resolution predictions ˆDT +1c and ˆNT +1u . These predictions are then up-sampled and post-processed to generate the final depth and surface normal N:

where is the ReLU function to guarantee depth is nonnegative, and represents normalization to ensure ∥n∥ = 1 for all pixels.

In a general formulation, the end-to-end network in Fig. 10 can be rewritten as:

where is the network’s ( Nd−n ) parameters.

Fig. 10 – Joint depth and normal optimization. In the canonical space, we deploy recurrent blocks composed of ConvGRU sub-blocks (C-RGU) and projection heads (Proj.) to predict the updates . During optimization, intermediate low-reolsution depth and normal ˆD0c ˆN0u are initially given by the decoder, and then iteratively refined by the predicted updates ∆. After T + 1 iterations, the optimized intermediate predictions ˆDT +1c ˆNT +1u are upsampled and post-processed to obtain the final depth in the canonical space and the final normal N.

3.4 Supervision

The training objective is:

where D∗c and Ic are transformed ground-truth depth labels and images in the canonical space c, N∗ denotes normal labels, L is the supervision loss to be illustrated as following. Random proposal normalization loss. To boost the performance of depth estimation, we propose a random proposal normalization loss (RPNL). The scale-shift invariant loss [25], [27] is widely applied for the affine-invariant depth estimation, which decouples the depth scale to emphasize the single image distribution. However, such normalization based on the whole image inevitably squeezes the fine-grained depth difference, particularly in close regions. Inspired by this, we propose to randomly crop several patches ( pi(i=0,...,M) ∈ Rhi×wi ) from the ground truth D∗c and the predicted depth . Then we employ the median absolute deviation normalization [87] for paired patches. By normalizing the local statistics, we can enhance local contrast. The loss function is as follows:

where d∗ ∈ D∗c and d ∈ Dc are the ground truth and predicted depth respectively. µ(·) and is the median of depth. M is the number of proposal crops, which is set to 32. During training, proposals are randomly cropped from the image by 0.125 to 0.5 of the original size. Furthermore, several other losses are employed, including the scale-invariant logarithmic loss [60] Lsilog , pair-wise normal regression loss [25] LPWN , virtual normal loss [18] LVNL. Note Lsilog is a variant of L1 loss. The overall losses are as follows.

Normal loss. To supervise normal prediction, we employ two distinct loss functions depending on the availability of ground-truth (GT) normals . As presented in Fig. 9, when GT normals are provided, we utilize an aleatoric uncertainty-aware loss [33] ( Ln(·) ) to supervise prediction N. Alternatively, in the absence of GT normals, we propose a consistency loss Ld−n(D, N) to align the predicted depth and normal. This loss is computed based on the similarity between a pseudo-normal map generated from the predicted depth using the least square method [41], and the predicted normal itself. Different from previous methods, [33], [41], this loss operates as a self-supervision mechanism, requiring no depth or normal ground truth labels. Note that here we use the depth D in the real world instead of the one in the canonical space to calculate depth-normal consistency. The overall losses are as follows.

, where wd = 0.5, wn = 1, wd−n = 0.01 serve as weights to balance the loss items.

4 EXPERIMENTS

Dataset details. We have curated a comprehensive dataset comprising 18 public RGB-D datasets, totaling over 16 million data points for training purposes. It spreads over diverse indoor and outdoor scenes. Notably, approximately 10 million frames are annotated with normal, but the majority of these annotations pertain to indoor scenes exclusively. It’s worth mentioning that all datasets have provided camera intrinsic parameters. Apart from the test split of training datasets, we collect 7 unseen datasets for robustness and generalization evaluation. Details of employed training and testing data are reported in Tab. 5. Implementation details. In our experiments, we employ different network architectures and aim to provide diverse choices for the community, including convnets and transformers. For convnets, we employ an UNet architecture with the ConvNext-large [99] backbone. ImageNet-22K pretrained weights are used for initialization. For transformers, we apply DINO v2-reg [83], [100] vision transformers [84] (ViT) as backbones, DPT [28] as decoders.

We use AdamW with a batch size of 192, an initial learning rate 0.0001 for all layers, and the polynomial decaying method with the power of 0.9. We train our models on 48 A100 GPUs for 800k iterations. Following the DiverseDepth [18], we balance all datasets in a minibatch to ensure each dataset accounts for an almost equal ratio. During training, images are processed by the canonical camera transformation module, flipped horizontally with a 50% chance, and then randomly cropped into 512 × 960

pixels for convnets and 616 × 1064 for vision transformers. In the ablation experiments, training settings are different as we sample 5000 images from each dataset for training. We trained on 8 GPUs for 150K iterations. Details of networks architectures, training setups, and efficiency analysis are presented in the supplementary materials. Fine-tuning experiments on KITTI and NYU are conducted on 8 GPUs with 20K further steps.

Evaluation details for monocular depth and normal estimation. a) To show the robustness of our metric depth estimation method, we test on 8 zero-shot benchmarks, including NYUv2 [58], KITTI [59], NuScenes [103], iBIMS-1 [104], DIODE [105] (indoor and outdoor, and full), ETH3D [106]. Following previous works [17], absolute relative error (AbsRel), the accuracy under threshold ( δi < 1.25i, i = 1, 2, 3), root mean squared error (RMS), root mean squared error in log space (RMS log), and log10 error (log10) metrics are employed. For KITTI and NYU benchmarks, we report the result of zero-shot and fine-tuning testing results. b) For normal estimation tasks and ablations, employ several error metrics to assess performance. Specifically, we calculate the mean (mean), median (median), and rooted mean square (RMS normal) of the angular error as well as the accuracy under threshold of {11.25◦, 22.5◦, 30.0◦} consistent with methodologies established in previous studies [33]. We conduct in-domain evaluation using the Scannet dataset, while the NYU and iBIMS-1 datasets are reserved for zero-shot generalization testing. c) Furthermore, we also follow

TABLE 2 – Quantitative comparison of surface normals on NYUv2, ibims-1, and ScanNet normal benchmarks. ‘ZS’ means zero-shot testing and ‘FT’ performs post fine-tuneing on the target dataset. Methods trained only on NYU are highlighted with grey. Best results are in bold and second bests are underlined. Our method ranks first over all benchmarks.

current affine-invariant depth benchmarks [25], [29] (Tab. 4) to evaluate the generalization ability on 5 zero-shot datasets, i.e., NYUv2, DIODE, ETH3D, ScanNet [107], and KITTI. We mainly compare with large-scale data trained models. Note that in this benchmark we follow existing methods to apply the scale shift alignment before evaluation.

We report results with different canonical transformation methods (CSTM lable and CSTM image) on the ConvNextLarge model (Conv-L in Tab. 1 and Tab. 2). As CSTM label is slightly better, more results using this method from multisize ViT-models (ViT-S for Small, ViT-L for Large, ViT-g for giant2) are reported. Note that all models for zero-shot testing use the same checkpoints except for fine-tuning experiments. Evaluation details for reconstruction and SLAM. a) To evaluate our metric 3D reconstruction quality, we randomly sample 9 unseen scenes from NYUv2 and use colmap [108] to obtain the camera poses for multi-frame reconstruction. Chamfer distance and the F-score [109] are used to evaluate the reconstruction accuracy. b) In dense-SLAM experiments, following Li et al. [110], we test on the KITTI odometry benchmark [59] and evaluate the average translational RMS drift ( %, trel ) and rotational RMS drift ( ◦/100m, rrel) errors [59]. Evaluation on metric depth benchmarks. To evaluate the accuracy of predicted metric depth, firstly, we compare with state-of-the-art (SoTA) metric depth prediction methods on NYUv2 [58], KITTI [111]. We use the same model to do all evaluations. Results are reported in Tab. 1. Firstly, comparing with existing overfitting methods, which are trained on benchmarks for hundreds of epochs, our zero-shot testing (‘ZS’ in the table) without any fine-tuning or

TABLE 3 – Quantitative comparison with SoTA metric depth methods on 5 unseen benchmarks. For SoTA methods, we use their NYUv2 and KITTI models for indoor and outdoor scene evaluation respectively, while we use the same model for all zero-shot testing.

TABLE 4 – Comparison with SoTA affine-invariant depth methods on 5 zero-shot transfer benchmarks. Our model significantly outperforms previous methods and sets new state-of-the-art. Following the benchmark setting, all methods have manually aligned the scale and shift.

metric adjustment already achieves comparable or even better performance on some metrics. Then comparing with robust monocular depth estimation methods, such as Zerodepth [90] and ZoeDepth [36], our zero-shot testing is also better than them. Further post finetuning (‘FT in the table’) lifts our method to the 1st rank.

Furthermore, We collect 5 unseen datasets to do more metric accuracy evaluation. These datasets contain a wide range of indoor and outdoor scenes, including rooms, buildings, and driving scenes. The camera models are also varied. We mainly compare with the SoTA metric depth estimation methods and take their NYUv2 and KITTI models for indoor and outdoor scene evaluation respectively. From Tab. 3, we observe that although NuScenes is similar to KITTI, existing methods face a noticeable performance decrease. In contrast, our model is more robust. Generalization over diverse scenes. Affine-invariant depth benchmarks decouple the scale’s effect, which aims to evaluate the model’s generalization ability to diverse scenes. Recent impact works, such as MiDaS, LeReS, and DPT, achieved promising performance on them. Following them, we test on 5 datasets and manually align the scale and shift to the ground-truth depth before evaluation. Results are reported in Tab. 4. Although our method enforces the network to recover more challenging metric information, our method outperforms them by a large margin on all datasets. Evaluation on surface normal benchmarks. We evaluate our methods on ScanNet, NYU, and iBims-1 surface normal benchmarks. Results are reported in Tab. 2. Firstly, we organize a zero-shot testing benchmark on NYU dataset, see methods denoted with ‘ZS’ in the table. We compare

with existing methods which are trained on ScanNet or Taskonomy and have achieved promising performance on them, such as Polymax [34] and Bae et al. [33]. Our method surpasses them over most metrics. Then, comparing with methods that have been overfitted the NYU data domain for hundreds of epochs (see methods marked with blue), our zero-shot testing outperforms them on all metrics. Our postfinetuned models (see ‘FT’ marks) further boost the performance. Similarly, we also achieve SoTA performance on iBims-1 and Scannet benchmarks. For the iBims-1 dataset, we follow IronDepth [42] to generate the ground-truth normal annotations.

TABLE 5 – Training and testing datasets used for experiments.

Fig. 11 – Qualitative comparisons of metric depth and surface normals for iBims, DIODE, NYU, Eth3d, Nuscenes, and self-collected drone datasets. We present visualization results of our predictions (‘Ours Depth’ / ‘Ours Normal’), groundtruth labels (‘GT Depth’ / ‘GT Normal’) and results from other metric depth (‘ZoeDepth’ [36]) and surface normal methods (‘Bae et al.’ [33] and ‘OmniData’ [35]).

Fig. 12 – Qualitative comparisons of metric depth and surface normals in the wild. We present visualization results of our predictions (‘Ours Depth’ / ‘Ours Normal’) and results from other metric depth (‘ZoeDepth’ [36]) and surface normal methods (‘Bae et al.’ [33] and ‘OmniData’ [35]).

4.1 Zero-shot Generalization

Qualitative comparisons of surface normals and depths. We visualize our predictions in Fig. 11. A comparison with another widely used generalized metric depth method, ZoeDepth [36], demonstrates that our approach produces depth maps with superior details on fine-grained structures (objects in row1, suspension lamp in row4, beam in row8), and better foreground/background distinction (row 11, 12). In terms of surface normal prediction, our normal maps exhibits significantly finer details compared to Bae. et al. [33] and can handle some cases where their method fail (row7, 8, 9). Our method not only generalizes well across diverse scenarios but can also be directly applied to unseen camera models like the fisheye camera shown in row 12. More visualization results for in-the-wild images are presented in Fig. 12, including comic-style (Row 2) and CG(computer graphics)-generated objects (Row5)

4.2 Applications Based on Our Method

In these experiments, we apply the CSTM image model to various tasks.

3D scene reconstruction . To demonstrate our work can recover the 3D metric shape in the wild, we first do the quantitative comparison on 9 NYUv2 scenes, which are unseen during training. We predict the per-frame metric depth and then fuse them together with provided camera poses. Results are reported in Tab. 6. We compare with the video consistent depth prediction method (RCVD [128]), the unsupervised video depth estimation method (SC-DepthV2 [129]), the 3D scene shape recovery method (LeReS [25]), affine-invariant depth estimation method (DPT [28]), and the multi-view stereo reconstruction method (DPSNet [48]). Apart from DPSNet and our method, other methods have to align the scale with the ground truth

depth for each frame. Although our method does not aim for the video or multi-view reconstruction problem, our method can achieve promising consistency between frames and reconstruct much more accurate 3D scenes than others on these zero-shot scenes. From the qualitative comparison in Fig. 13. our reconstructions have much less noise and outliers. Dense-SLAM mapping. Monocular SLAM is an important robotics application. It only relies on a monocular video input to create the trajectory and dense 3D mapping. Owing to limited photometric and geometric constraints, existing methods face serious scale drift problems in large scenes and cannot recover the metric information. Our robust metric depth estimation method is a strong depth prior to the SLAM system. To demonstrate this benefit, we naively input our metric depth to the SoTA SLAM system, DroidSLAM [37], and evaluate the trajectory on KITTI. We do not do any tuning on the original system. Trajectory comparisons are reported in Tab. 7. As Droid-SLAM can access accurate per-frame metric depth, like an RGB-D SLAM, the translation drift ( trel ) decreases significantly. Furthermore, with our depths, Droid-SLAM can perform denser and more accurate 3D mapping. An example is shown in Fig. 3 and more cases are shown in the supplementary materials.

We also test on the ETH3D SLAM benchmarks. Results are reported in Tab. 8. Droid with our depths has much better SLAM performance. As the ETH3D scenes are all small-scale indoor scenes, the performance improvement is less than that on KITTI. Metrology in the wild. To show the robustness and accuracy of our recovered metric 3D, we download Flickr photos captured by various cameras and collect coarse camera intrinsic parameters from their metadata. We use our CSTM image model to reconstruct their metric shape and measure structures’ sizes (marked in red in Fig. 14),

TABLE 6 – Quantitative comparison of 3D scene reconstruction with LeReS [25], DPT [28], RCVD [128], SC-DepthV2 [129], and a learning-based MVS method (DPSNet [48]) on 9 unseen NYUv2 scenes. Apart from DPSNet and ours, other methods have to align the scale with ground truth depth for each frame. As a result, our reconstructed 3D scenes achieve the best performance.

Fig. 13 – Reconstruction of zero-shot scenes with multiple views. We sample several NYUv2 scenes for 3D reconstruction comparison. As our method can predict accurate metric depth, thus all frame’s predictions are fused together for scene reconstruction. By contrast, LeReS [25]’s depth is up to an unknown scale and shift, which causes noticeable distortions. DPSNet [48] is a multi-view stereo method, which cannot work well on low-texture regions.

TABLE 7 – Comparison with SoTA SLAM methods on KITTI. We input predicted metric depth to the DroidSLAM [37] (‘Droid+Ours’), which outperforms others by a large margin on trajectory accuracy.

TABLE 8 – Comparison of VO error on ETH3D benchmark. Droid SLAM system is input with our depth (‘Droid + Ours’), and ground-truth depth (‘Droid + GT’). The average trajectory error is reported.

while the ground-truth sizes are in blue. It shows that our measured sizes are very close to the ground-truth sizes. Monocular reconstruction in the wild. To further visualize

the reconstruction quality of our recovered metric depth, we randomly collect images from the internet and recover their metric 3D and normals. As there is no focal length provided, we select proper focal lengths according to the reconstructed shape and normal maps. The reconstructed pointclouds are colorized by their corresponding normals (Different views are marked by red and orange arrays in Fig. 15).

4.3 Ablation Study

Ablation on canonical transformation. We study the effect of our proposed canonical transformation for the input images (‘CSTM input’) and the canonical transformation for the ground-truth labels (‘CSTM output’). Results are reported in Tab. 9. We train the model on sampled mixed data (90K images) and test it on 6 datasets. A naive baseline (‘Ours w/o CSTM’) is to remove CSTM modules and enforce the same supervision as ours. Without CSTM, the model is unable to converge when training on mixed metric datasets and cannot achieve metric prediction ability on zero-shot datasets. This is why recent mixed-data training methods compromise learning the affine-invariant depth to

Fig. 14 – Metrology of in-the-wild scenes. We collect several Flickr photos, which are captured by various cameras. With photos’ metadata, we reconstruct the 3D metric shape and measure structures’ sizes. Red and blue marks are ours and ground-truth sizes respectively.

Fig. 15 – Reconstruction from in-the-wild single images. We collect web images and select proper focal lengths. The reconstructed pointclouds are colorized by normals.

avoid metric issues. In contrast, our two CSTM methods both can enable the model to achieve the metric prediction ability, and they can achieve comparable performance. Tab. 1 also shows comparable performance. Therefore, both adjusting the supervision and the input image appearance during training can solve the metric ambiguity issues. Furthermore, we compare with CamConvs [38], which encodes the camera model in the decoder with a 4-channel feature. ‘CamConvs’ employ the same training schedule, model, and training data as ours. This method enforces the network to implicitly understand various camera models from the image appearance and then bridges the imaging size to the real-world size. We believe that this method challenges the data diversity and network capacity, thus their performance is worse than ours.

Ablation on canonical space. We study the effect of the

TABLE 9 – Effectiveness of our CSTM. CamConvs [38] directly encodes various camera models in the network, while we perform a simple yet effective transformation to solve the metric ambiguity. Without CSTM, the model achieve transferable metric prediction ability.

canonical camera, i.e., the canonical focal length. We train the model on the small sampled dataset and test it on the validation set of training data and testing data. The average AbsRel error is calculated. We experiment on 3 different focal lengths, i.e., 500, 1000, 1500. Experiments show that focal = 1000 has slightly better performance than others, see supplementary for more details.

Effectiveness of the random proposal normalization loss. To show the effectiveness of our proposed random proposal normalization loss (RPNL), we experiment on the sampled small dataset. Results are shown in Tab. 10. We test on the DDAD, Lyft, DrivingStereo (DS), NuScenes (NS), KITTI, and NYUv2. The ‘baseline’ employs all losses except our RPNL. We compare it with ‘baseline + RPNL’ and ‘baseline + SSIL [27]’. We can observe that our proposed random proposal normalization loss can further improve the performance. In contrast, the scale-shift invariant loss [27], which does the normalization on the whole image, can only slightly improve the performance.

Effectiveness of joint optimization. We assess the impact of joint optimization on both depth and normal estimation using small datasets sampled with ViT-small models over a 4-step iteration. The evaluation is conducted on the NYU indoor dataset and the DIODE outdoor dataset, both of which include normal labels for the convenience of evaluation. In Tab. 11, we start by training the same-architecture networks ‘without depth’ or ‘without normal’ prediction. Compared to our joint optimization approach, both singlemodality models exhibit slightly worse performance.To further demonstrate the benefit of joint optimization and the incorporation of large-scale outdoor data prior to normal estimation, we train a model using only the Taskonomy dataset (i.e., ’W.o. mixed datasets’), which shows inferior results on DIODE(outdoor). We also verify the effectiveness of the recurrent blocks and the consistency loss. Removing either of them (‘W.o. consistency’ / ‘W.o. recurrent block’) could lead to drastic performance degradation for normal estimation, particularly for outdoor scenarios like DIODE(Outdoor). Furthermore, we present some visualization comparisons in Fig 16. Training surface normal and depth together without the consistency loss (’W.o. consistency’) results in notably poorer predicted normals compared to our full method (’Ours normal’). Additionally, if the model learns the normal individually (’W.o. depth’), the performance also degrades.

Fig. 16 – Effect of joint depth-normal optimization. We compare normal maps learned by different strategies on several outdoor examples. Learning normal only ‘without depth’ leads to flattened surfaces, since most of the normal labels lie on planes. In addition, ‘without consistency’ imposed between depth and normal, the predictions become much coarser.

TABLE 11 – Effectiveness of joint optimization. Joint optimization surpasses independent estimation. For outdoor normal estimation, this module introduces geometry clues from large-scale depth data. The proposed recurrent block and depth-normal consistency constraint are essential for the optimization

The efficiency analysis of the joint optimization module is presented in the supplementary materials.

Selection of intermediate normal representation. During optimization, unnormalized normal vectors are utilized as the intermediate representation. Here we explore three additional representations (1) A vector defined in so3 representing 3D rotation upon a reference direction. We implement this vector by lietorch [37]. (2) An azimuthal angle and a polar angle. (3) A 2D homogeneous vector [79]. All the representations investigated are additive and can be surjectively transferred into surface normal. In this experiment, we only change the representations and compare the performances. Surprisingly, according to Table 12, the naive unnormalized normal performs the best. We hypothesize that this simplest representation reduces the learning difficulty.

TABLE 12 – More selection of intermediate normal representation.

Best optimizing steps To determine the optimal number of optimization steps for various ViT models, we vary different steps to refine depth and normal. Table 13 illustrates that increasing the number of iteration steps does not consistently improve results. Moreover, the ideal number of steps may differ based on the model size, with larger models generally benefiting from more extensive optimization.

TABLE 13 – Select the best joint optimizing steps for different ViT models. We find the best step varying with model size and choose the best-fitted steps according to the following experiment results. All models are trained following the settings in Tab. 11

5 CONCLUSION

In this paper, we introduce a family of geometric foundation models for zero-shot monocular metric depth and surface normal estimation. We propose solutions to address challenges in both metric depth estimation and surface normal estimation. To resolve depth ambiguity caused by varying focal lengths, we present a novel canonical camera space transformation method. Additionally, to overcome the scarcity of outdoor normal data labels, we introduce a joint depth-normal optimization framework that leverages knowledge from large-scale depth annotations.

Our approach enables the integration of millions of data samples captured by over 10,000 cameras to train a unified metric-depth and surface-normal model. To enhance the model’s robustness, we curate a dataset comprising over 16 million samples for training. Zero-shot evaluations demonstrate the effectiveness and robustness of our method. For downstream applications, our models are capable of reconstructing metric 3D from a single view, enabling metrology on randomly collected internet images and dense mapping of large-scale scenes. With their precision, generalization, and versatility, Metric3D v2 models serve as geometric foundational models for monocular perception.

Authors’ photographs and biographies not available at the time of submission.

REFERENCES

[1] B. Yang, S. Rosa, A. Markham, N. Trigoni, and H. Wen, “Dense 3d object reconstruction from a single depth view,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 12, pp. 2820–2834, 2018. 1

[2] J. Ju, C. W. Tseng, O. Bailo, G. Dikov, and M. Ghafoorian, “Dgrecon: Depth-guided neural 3d scene reconstruction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18184–18194, 2023. 1

[3] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4460–4470, 2019. 1

[4] K. Deng, A. Liu, J.-Y. Zhu, and D. Ramanan, “Depth-supervised nerf: Fewer views and faster training for free,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12882–12891, 2022. 1

[5] B. Roessle, J. T. Barron, B. Mildenhall, P. P. Srinivasan, and M. Nießner, “Dense depth priors for neural radiance fields from sparse input views,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12892–12901, 2022. 1

[6] Z. Yu, S. Peng, M. Niemeyer, T. Sattler, and A. Geiger, “Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction,” Advances in neural information processing systems, vol. 35, pp. 25018–25032, 2022. 1, 5

[7] C. Jiang, H. Zhang, P. Liu, Z. Yu, H. Cheng, B. Zhou, and S. Shen, “H2-mapping: Real-time dense mapping using hierarchical hybrid representation,” arXiv preprint arXiv:2306.03207, 2023. 1

[8] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” arXiv: Comp. Res. Repository, p. 2206.10092, 2022. 1

[9] Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, and J. M. Alvarez, “Fb-occ: 3d occupancy prediction based on forwardbackward view transformation,” arXiv preprint arXiv:2307.01492, 2023. 1

[10] R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” in European Conference on Computer Vision, pp. 340–356, Springer, 2020. 1

[11] J. Behley and C. Stachniss, “Efficient surfel-based slam using 3d laser range data in urban environments.,” in Robotics: Science and Systems, vol. 2018, p. 59, 2018. 1, 5

[12] R. Mur-Artal and J. D. Tard´os, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, 2017. 1, 13

[13] T. Schops, T. Sattler, and M. Pollefeys, “Bad slam: Bundle adjusted direct rgb-d slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 134–144, 2019. 1

[14] H. Zhu, H. Yang, X. Wu, D. Huang, S. Zhang, X. He, T. He, H. Zhao, C. Shen, Y. Qiao, et al., “Ponderv2: Pave the way for 3d foundataion model with a universal pre-training paradigm,” arXiv preprint arXiv:2310.08586, 2023. 1

[15] J. Zhou, J. Wang, B. Ma, Y.-S. Liu, T. Huang, and X. Wang, “Uni3d: Exploring unified 3d representation at scale,” arXiv preprint arXiv:2310.06773, 2023. 1

[16] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger, “Unifying flow, stereo and depth estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 1

[17] W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “New CRFs: Neural window fully-connected CRFs for monocular depth estimation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2022. 1, 2, 4, 9, 10

[18] W. Yin, Y. Liu, and C. Shen, “Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction,” IEEE Trans. Pattern Anal. Mach. Intell., 2021. 1, 2, 4, 5, 8, 10

[19] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4009–4018, 2021. 1, 2, 4, 9, 10

[20] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformerbased attention networks for continuous pixel-wise prediction,” in Proc. IEEE Int. Conf. Comp. Vis., 2021. 1, 2, 9

[21] K. Xian, C. Shen, Z. Cao, H. Lu, Y. Xiao, R. Li, and Z. Luo, “Monocular relative depth perception with web stereo data supervision,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 311– 320, 2018. 2, 5

[22] K. Xian, J. Zhang, O. Wang, L. Mai, Z. Lin, and Z. Cao, “Structureguided ranking loss for single image depth prediction,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 611–620, 2020. 2, 5

[23] W. Chen, S. Qian, D. Fan, N. Kojima, M. Hamilton, and J. Deng, “Oasis: A large-scale dataset for single image 3d in the wild,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 679–688, 2020. 2, 5

[24] W. Chen, Z. Fu, D. Yang, and J. Deng, “Single-image depth perception in the wild,” in Proc. Advances in Neural Inf. Process. Syst., pp. 730–738, 2016. 2, 5

[25] W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021. 2, 3, 4, 5, 8, 9, 10, 12, 13

[26] W. Yin, J. Zhang, O. Wang, S. Niklaus, S. Chen, Y. Liu, and C. Shen, “Towards accurate reconstruction of 3d scene shape from a single monocular image,” IEEE Trans. Pattern Anal. Mach. Intell., 2022. 2, 5

[27] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Trans. Pattern Anal. Mach. Intell., 2020. 2, 3, 5, 8, 10, 14

[28] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 12179– 12188, 2021. 2, 8, 10, 12, 13

[29] C. Zhang, W. Yin, Z. Wang, G. Yu, B. Fu, and C. Shen, “Hierarchical normalization for robust monocular depth estimation,” Proc. Advances in Neural Inf. Process. Syst., 2022. 2, 3, 5, 9, 10

[30] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in Proc. IEEE Int. Conf. Comp. Vis., 2019. 2, 4, 9

[31] B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” arXiv preprint arXiv:2312.02145, 2023. 2

[32] T. Do, K. Vuong, S. I. Roumeliotis, and H. S. Park, “Surface normal estimation of tilted images via spatial rectifier,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 265–280, Springer, 2020. 2, 5, 9

[33] G. Bae, I. Budvytis, and R. Cipolla, “Estimating and exploiting the aleatoric uncertainty in surface normal estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13137–13146, 2021. 2, 5, 8, 9, 10, 11, 12

[34] X. Yang, L. Yuan, K. Wilber, A. Sharma, X. Gu, S. Qiao, S. Debats, H. Wang, H. Adam, M. Sirotenko, et al., “Polymax: General dense prediction with mask transformer,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1050– 1061, 2024. 2, 5, 9, 10

[35] A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 10786– 10796, 2021. 2, 3, 5, 10, 11, 12

[36] S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,” arXiv preprint arXiv:2302.12288, 2023. 3, 9, 10, 11, 12

[37] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,” vol. 34, pp. 16558–16569, 2021. 3, 4, 12, 13, 15

[38] J. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “CAM-Convs: camera-aware multi-scale convolutions for single-view depth,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 11826–11835, 2019. 2, 14

[39] S. Peng, S. Zhang, Z. Xu, C. Geng, B. Jiang, H. Bao, and X. Zhou, “Animatable neural implicit surfaces for creating avatars from videos,” arXiv: Comp. Res. Repository, p. 2203.08133, 2022. 3

[40] J. Huang, Y. Zhou, T. Funkhouser, and L. J. Guibas, “Framenet: Learning local canonical frames of 3d surfaces from a single rgb image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8638–8647, 2019. 3, 9, 10

[41] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 283–291, 2018. 3, 5, 8, 9

[42] G. Bae, I. Budvytis, and R. Cipolla, “Irondepth: Iterative refinement of single-view depth using surface normal and its uncertainty,” arXiv preprint arXiv:2210.03676, 2022. 3, 5, 7, 9, 10

[43] J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local spatial propagation network for depth completion,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pp. 120–136, Springer, 2020. 3, 5

[44] S. Shao, Z. Pei, X. Wu, Z. Liu, W. Chen, and Z. Li, “Iebins: Iterative elastic bins for monocular depth estimation,” arXiv preprint arXiv:2309.14137, 2023. 3, 5, 7, 9

[45] L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field transforms for stereo matching,” in Int. Conf. 3D. Vis., 2021. 3, 5, 7

[46] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 402–419, Springer, 2020. 3, 5, 7

[47] L. Sun, W. Yin, E. Xie, Z. Li, C. Sun, and C. Shen, “Improving monocular visual odometry using learned depth,” IEEE Transactions on Robotics, vol. 38, no. 5, pp. 3173–3186, 2022. 4

[48] S. Im, H.-G. Jeon, S. Lin, and I.-S. Kweon, “Dpsnet: End-to-end deep plane sweep stereo,” in Proc. Int. Conf. Learn. Representations, 2019. 4, 12, 13

[49] R. Mur-Artal and J. D. Tard´os, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017. 4

[50] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann, K. Sunkavalli, and M. Chandraker, “Single view metrology in the wild,” in Proc. Eur. Conf. Comp. Vis., pp. 316–333, Springer, 2020. 4

[51] J. T. Barron and J. Malik, “Shape, illumination, and reflectance from shading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 8, pp. 1670–1687, 2014. 4

[52] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, “Pixel2mesh: Generating 3d mesh models from single RGB images,” in Proc. Eur. Conf. Comp. Vis., pp. 52–67, 2018. 4

[53] J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. Freeman, and J. Tenenbaum, “Learning shape priors for single-view 3d completion and reconstruction,” in Proc. Eur. Conf. Comp. Vis., pp. 646–662, 2018. 4

[54] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li, “Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2304–2314, 2019. 4

[55] S. Saito, T. Simon, J. Saragih, and H. Joo, “Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 84– 93, 2020. 4

[56] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structure from a single still image,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 5, pp. 824–840, 2008. 4

[57] C. Zhang, W. Yin, G. Yu, Z. Wang, T. Chen, B. Fu, J. T. Zhou, and C. Shen, “Robust geometry-preserving depth estimation using differentiable rendering,” in Proc. IEEE Int. Conf. Comp. Vis., 2023. 4

[58] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Proc. Eur. Conf. Comp. Vis., pp. 746–760, Springer, 2012. 4, 5, 9, 10

[59] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” Int. J. Robot. Res., 2013. 4, 9, 10

[60] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Proc. Advances in Neural Inf. Process. Syst., pp. 2366–2374, 2014. 4, 8

[61] K. Wang, F. Gao, and S. Shen, “Real-time scalable dense surfel mapping,” in 2019 International conference on robotics and automation (ICRA), pp. 6919–6925, IEEE, 2019. 5

[62] J. Wang, P. Wang, X. Long, C. Theobalt, T. Komura, L. Liu, and W. Wang, “Neuris: Neural reconstruction of indoor scenes using normal priors,” in European Conference on Computer Vision, pp. 139–155, Springer, 2022. 5

[63] X. Qi, Z. Liu, R. Liao, P. H. Torr, R. Urtasun, and J. Jia, “Geonet++: Iterative geometric neural network with edge-aware refinement for joint depth and surface normal estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 969– 984, 2020. 5, 9

[64] X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks for surface normal estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 539–547, 2015. 5, 9

[65] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, pp. 2650–2658, 2015. 5, 9

[66] S. Liao, E. Gavves, and C. G. Snoek, “Spherical regression: Learning viewpoints, surface normals and 3d rotations on n-spheres,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9759–9767, 2019. 5

[67] L. Ladick`y, B. Zeisl, and M. Pollefeys, “Discriminatively trained dense surface normal estimation,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 468–484, Springer, 2014. 5, 9

[68] B. Li, C. Shen, Y. Dai, A. Van Den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1119–1127, 2015. 5

[69] X. Long, Y. Zheng, Y. Zheng, B. Tian, C. Lin, L. Liu, H. Zhao, G. Zhou, and W. Wang, “Adaptive surface normal constraint for geometric estimation from monocular images,” arXiv preprint arXiv:2402.05869, 2024. 5

[70] X. Long, C. Lin, L. Liu, W. Li, C. Theobalt, R. Yang, and W. Wang, “Adaptive surface normal constraint for depth estimation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 12849–12858, 2021. 5

[71] W. Chen, D. Xiang, and J. Deng, “Surface normals in the wild,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1557–1566, 2017. 5

[72] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8934–8943, 2018. 5

[73] H. Liu, T. Lu, Y. Xu, J. Liu, and L. Wang, “Learning optical flow and scene flow with bidirectional camera-lidar fusion,” arXiv preprint arXiv:2303.12017, 2023. 5

[74] X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity learned with convolutional spatial propagation network,” in Proceedings of the European conference on computer vision (ECCV), pp. 103–119, 2018. 5

[75] M. Hu, S. Wang, B. Li, S. Ning, L. Fan, and X. Gong, “Penet: Towards precise and efficient image guided depth completion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13656–13662, IEEE, 2021. 5

[76] Z. Ma, Z. Teed, and J. Deng, “Multiview stereo with cascaded epipolar raft,” in European Conference on Computer Vision, pp. 734– 750, Springer, 2022. 5

[77] G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21919– 21928, 2023. 5

[78] J. E. Lenssen, C. Osendorfer, and J. Masci, “Deep iterative surface normal estimation,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 11247–11256, 2020. 5

[79] W. Zhao, S. Liu, Y. Wei, H. Guo, and Y.-J. Liu, “A confidencebased iterative solver of depths and surface normals for deep multi-view stereo,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6168–6177, 2021. 5, 15

[80] W. Yin, Y. Liu, C. Shen, A. v. d. Hengel, and B. Sun, “The devil is in the labels: Semantic segmentation from sentences,” arXiv: Comp. Res. Repository, p. 2202.02002, 2022. 5

[81] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., pp. 8748–8763, PMLR, 2021. 5

[82] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “Mseg: A composite dataset for multi-domain semantic segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2879–2888, 2020. 5

[83] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023. 5, 8

[84] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” Proc. Int. Conf. Learn. Representations, 2021. 5, 8, 10

[85] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022. 5, 10

[86] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gambaretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for deep single image camera calibration,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2354–2363, 2018. 5

[87] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, 2019. 8

[88] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating fine-scaled depth maps from single rgb images,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3372–3380, 2017. 9

[89] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV), pp. 239–248, IEEE, 2016. 9

[90] V. Guizilini, I. Vasiljevic, D. Chen, R. Ambrus,, and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9233–9243, 2023. 9, 10

[91] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” arXiv:2401.10891, 2024. 9

[92] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang, “Learning monocular depth by distilling cross-domain stereo networks,” in Proc. Eur. Conf. Comp. Vis., pp. 484–500, 2018. 9

[93] D. F. Fouhey, A. Gupta, and M. Hebert, “Unfolding an indoor origami world,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 687–702, Springer, 2014. 9

[94] A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2d-3d alignment via surface normal prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5965– 5974, 2016. 9

[95] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille, “Surge: Surface regularized geometry estimation from a single image,” Advances in Neural Information Processing Systems, vol. 29, 2016. 9

[96] Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Patternaffinitive propagation across depth, surface normal and semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4106–4115, 2019. 9

[97] R. Wang, D. Geraghty, K. Matzen, R. Szeliski, and J.-M. Frahm, “Vplnet: Deep single view normal estimation with vanishing points and lines,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 689–698, 2020. 9

[98] J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,” arXiv: Comp. Res. Repository, p. 1907.10326, 2019. 9

[99] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 11976–11986, 2022. 8, 10

[100] T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” arXiv preprint arXiv:2309.16588, 2023. 8

[101] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1492–1500, 2017. 10

[102] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113, 2022. 10

[103] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 11621–11631, 2020. 9, 10

[104] T. Koch, L. Liebel, F. Fraundorfer, and M. Korner, “Evaluation of cnn-based single-image depth estimation methods,” in Eur. Conf. Comput. Vis. Worksh., pp. 0–0, 2018. 9, 10

[105] I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, et al., “Diode: A dense indoor and outdoor depth dataset,” arXiv: Comp. Res. Repository, p. 1908.00463, 2019. 9, 10

[106] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3260–3269, 2017. 9, 10

[107] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5828–5839, 2017. 9, 10

[108] J. L. Sch¨onberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in Proc. Eur. Conf. Comp. Vis., 2016. 9

[109] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–13, 2017. 9

[110] S. Li, X. Wu, Y. Cao, and H. Zha, “Generalizing to the open world: Deep visual odometry with online adaptation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 13184–13193, 2021. 9

[111] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3354–3361, IEEE, 2012. 9

[112] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020. 10

[113] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet, “Level 5 perception dataset 2020.” https://level-5. global/level5/data/, 2019. 10

[114] G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “Drivingstereo: A large-scale dataset for stereo matching in au-

tonomous driving scenarios,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019. 10

[115] J. Cho, D. Min, Y. Kim, and K. Sohn, “DIML/CVL RGB-D dataset: 2m RGB-D images of natural indoor and outdoor scenes,” arXiv: Comp. Res. Repository, 2021. 10

[116] B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays, “Argoverse 2: Next generation datasets for selfdriving perception and forecasting,” in Proc. Advances in Neural Inf. Process. Syst., 2021. 10

[117] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 10

[118] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo event camera dataset for driving scenarios,” IEEE Robotics and Automation Letters, 2021. 10

[119] M. Lopez-Antequera, P. Gargallo, M. Hofinger, S. R. Bul`o, Y. Kuang, and P. Kontschieder, “Mapillary planet-scale depth dataset,” in Proc. Eur. Conf. Comp. Vis., vol. 12347, pp. 589–604, 2020. 10

[120] P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, Y. Wang, and D. Yang, “Pandaset: Advanced sensor suite dataset for autonomous driving,” in IEEE Int. Intelligent Transportation Systems Conf., 2021. 10

[121] Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and M. Cazorla, “Uasol, a large-scale high-resolution outdoor stereo dataset,” Scientific data, vol. 6, no. 1, pp. 1–14, 2019. 10

[122] Y. Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” arXiv preprint arXiv:2001.10773, 2020. 10

[123] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454, 2020. 10

[124] A. Zamir, A. Sax, , W. Shen, L. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., IEEE, 2018. 10

[125] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al., “The replica dataset: A digital replica of indoor spaces,” arXiv preprint arXiv:1906.05797, 2019. 10

[126] S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,” arXiv preprint arXiv:2109.08238, 2021. 10

[127] M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10912–10922, 2021. 10

[128] J. Kopf, X. Rong, and J.-B. Huang, “Robust consistent video depth estimation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021. 12, 13

[129] J.-W. Bian, H. Zhan, N. Wang, T.-J. Chin, C. Shen, and I. Reid, “Auto-rectify network for unsupervised indoor depth estimation,” IEEE Trans. Pattern Anal. Mach. Intell., 2021. 12, 13

[130] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1983–1992, 2018. 13

[131] S. Song, M. Chandraker, and C. C. Guest, “High accuracy monocular sfm and scale correction for autonomous driving,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 730–743, 2015. 13

Supplementary Materials for Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation

Details for ConvNet models. In our work, our encoder employs the ConvNext [?] networks, whose pretrained weight is from the official released ImageNet-22k pretraining. The decoder follows the adabins [?]. We set the depth bins number to 256, and the depth range is [0.3m, 150m]. We establish 4 flip connections from different levels of encoder blocks to the decoder to merge more low-level features. An hourglass subnetwork is attached to the head of the decoders to enhance background predictions.

Details for ViT models. We use dino-v2 transformers [?] with registers [?] as our encoders, which are pretrained on a curated dataset with 142M images. DPT [?] is used as the decoders. For the ViT-S and ViT-L variants, the DPT decoders take only the last-layer normalized encoder features as the input for stabilized training. The giant ViT-g model instead takes varying-layer features, the same as the original DPT settings. Different from the convnets models above, we use depth bins ranging from [0.1m, 200m] for ViT models.

Details for recurrent blocks. As illustrated in Fig 1. Each recurrent block updates hierarchical features maps {Ht1/14, Ht1/7, Ht1/4} at { 114, 17, 14} scales and the intermediate predictions s ˆ Dtc , ˆat each it- eration step t. This block compromises three ConvGRU sub-blocks to refine feature maps at different scales, and two projection heads and to predict updates for depth and normal respectively. The feature maps are gradually refined from the coarsest

( Ht1/14 ) to the finest ( Ht1/4 ). For instance, the refined feature map at the 114 scale Ht+11/14 is fed into the sec- ond ConvGRU sub-block to refine the -scale feature map Ht1/7 . Finally, the projection heads Gd, Gn em-

ploys a concatenation of original predictions ˆ Dtc , ˆand the to finest feature map Ht1/4 to predict the up- date items ∆ ˆ Dt+1c , ∆ ˆ Nt+1 . Both projection heads are composed of two linear layers with a sandwiched ReLU activation layer.

Resource comparison of different models. We compare the resource and performance among our model families in Tab. 1. All inference-time and GPU memory results are computed on an Nvida-A100 40G GPU with the original pytorch implemented models (No engineering optimization like TensorRT or

Figure 1: Detailed structure of the update block. Inspired by RAFTStereo [?], we build slowfast ConvGRU sub-blocks (denoted as ‘C-GRU’) to refine hierarchical hidden feature maps. Projection heads Gd, Gn are attached to the end of the final ConvGRU to prediction update items for the predictions.

ONNX). Generally, the enormous ViT-Large/giantbackbone models enjoy better performance, while the others are more deployment-friendly. In addition, our models built in classical en-decoder schemes run much faster than the recent diffusion counterpart [?].

We collect over 16M data from 18 public datasets for training. Datasets are listed in Tab. 2. When training the ConvNeXt-backbone models, we use a smaller collection containing the following 11 datasets with 8M images: DDAD [?], Lyft [?], DrivingStereo [?], DIML [?], Argoverse2 [?], Cityscapes [?], DSEC [?], Maplillary PSD [?], Pandaset [?], UASOL[?], and Taskonomy [?]. In the autonomous driving datasets, including DDAD [?], Lyft [?], DrivingStereo [?], Argoverse2 [?], DSEC [?], and Pandaset [?], have provided LiDar and camera intrinsic and extrinsic parameters. We project the LiDar to image planes to obtain ground-truth depths. In contrast, Cityscapes [?], DIML [?], and UASOL [?] only provide calibrated stereo images. We use raftstereo [?] to achieve pseudo ground-truth depths. Mapillary PSD [?] dataset provides paired RGB-D, but the depth maps are achieved from a structure-from-motion method. The camera intrinsic parameters are estimated from the SfM. We believe that such achieved metric information is noisy. Thus we do not enforce learning-metric-depth loss on this data, i.e., Lsilog , to reduce the effect of noises. For the Taskonomy [?] dataset, we follow LeReS [?] to obtain the instance planes, which are employed in the pair-wise normal regression loss. During training, we employ the training strategy from [?] to balance all datasets in each training batch.

The testing data is listed in Tab. 2. All of them are captured by high-quality sensors. In testing, we employ their provided camera intrinsic parameters to perform our proposed canonical space transformation.

Finetuning protocols. To finetune the large-scale-data trained models on some specific datasets, we use

the ADAM optimizer with the initial learning rate beginning at 10and linear decayed to 10within 6K steps. Notably, such finetune does not require a large batch size like large-scale training. We use in practice a batch-size of 16 for the ViT-g model and 32 for ViT-L. The models will converge quickly in approximately 2K steps. To stabilize finetuning, we also leverage the predictions ˜˜N of the prefinetuned model as alternative pseudo labels. These labels can sufficiently impose supervision upon the annotation-absent regions. The complete loss can be formulated as:

where and N are the predicted depth in the canonical space and surface normal, D∗c and are the groundtruth labels, Ld, Ln , and Ld−n are the losses for depth, normal, and depth-normal consistency introduced in the main text.

Evaluation of zero-shot 3D scene reconstruction. In this experiment, we use all methods’ released models to predict each frame’s depth and use the ground-truth poses and camera intrinsic parameters to reconstruct point clouds. When evaluating the reconstructed point cloud, we employ the iterative closest point (ICP) [?] algorithm to match the predicted point clouds with ground truth by a pose transformation matrix. Finally, we evaluate the Chamfer distance and F-score on the point cloud. Reconstruction of in-the-wild scenes. We collect several photos from Flickr. From their associated camera metadata, we can obtain the focal length ˆf and the pixel size . According to ˆf/δ , we can obtain the pixel-represented focal length for 3D reconstruction and achieve the metric information. We use meshlab software to measure some structures’ size on point clouds. More visual results are shown in Fig. 6.

Generalization of metric depth estimation. To evaluate our method’s robustness of metric recovery, we test on 7 zero-shot datasets, i.e. NYU, KITTI, DIODE (indoor and outdoor parts), ETH3D, iBims-1, and NuScenes. Details are reported in Tab. 2. We

Table 1: Comparative analysis of resource and performance across our model families includes evaluation of resource utilization metrics such as inference speed, memory usage, and the proportion of optimization modules. Additionally, we assess metric depth performance on KITTI/NYU datasets and normal performance on NYUv2 dataset, with results derived from checkpoints without fine-tuning. For ViT models, inference speed is measured using 16-bit precision (Bfloat16), which is the same precision as the training setup.

use the officially provided focal length to predict the metric depths. All benchmarks use the same depth model for evaluation. We don’t perform any scale alignment.

Evaluation on affine-invariant depth benchmarks. We follow existing affine-invariant depth estimation methods to evaluate 5 zero-shot datasets. Before evaluation, we employ the least square fitting to align the scale and shift with ground truth [?]. Previous methods’ performance is cited from their papers.

Dense-SLAM Mapping. This experiment is conducted on the KITTI odometry benchmark. We use our model to predict metric depths, and then naively input them to the Droid-SLAM system as an initial depth. We do not perform any finetuning but Table 2: Training and testing datasets used for experiments.

directly run their released codes on KITTI. With Droid-SLAM predicted poses, we unproject depths to the 3D point clouds and fuse them together to achieve dense metric mapping. More qualitative results are shown in Fig. 5.

1.3 More Visual Results

Qualitative comparison of depth and normal estimation. In Figs 2, 3, we compare visualized depth and normal maps from the Vit-g CSTM label model with ZoeDepth [?], Bae etal [?], and Omnidata [?]. In Figs. 4, 8, 9, and 10, We show the qualitative comparison of our depth maps from the ConvNeXt-L CSTM label model with Adabins [?], NewCRFs [?], and Omnidata [?]. Our results have much fine-grained details and less artifacts.

Reconstructing 360◦ NuScenes scenes. Current autonomous driving cars are equipped with several pin-hole cameras to capture 360views. Capturing the surround-view depth is important for autonomous driving. We sampled some scenes from the testing data of NuScenes. With our depth model, we can obtain the metric depths for 6-ring cameras. With the provided camera intrinsic and extrinsic parameters, we unproject the depths to the 3D point cloud and merge all views together. See Fig. 7 for details. Note that 6-ring cameras have different camera intrinsic parameters. We can observe that all views’ point clouds can be fused together consistently.

Figure 2: Depth and normal estimation. The visual comparison of predicted depth and normal on indoor/outdoor scenes from NYUv2, iBims, Eth3d, and ScanNet. Our depth and normal maps come from the ViT-g CSTM label model.

Figure 3: Depth and normal estimation. The visual comparison of predicted depth and normal on driving scenes from KITTI, Nuscenes, DIML, DDAD, and Waymo. Our depth and normal maps come from the ViT-g CSTM label model.

Figure 4: The visual comparison of predicted depth on iBims, ETH3D, and DIODE. Our depth maps come from the ConvNeXt-L CSTM label model.

Figure 5: Dense-SLAM Mapping. Existing SOTA mono-SLAM methods usually face scale drift problems in large-scale scenes and are unable to achieve the metric scale. We show the ground-truth trajectory and Droid-SLAM [?] predicted trajectory and their dense mapping. Then, we naively input our metric depth to Droid-SLAM, which can recover a much more accurate trajectory and perform the metric dense mapping.

Figure 6: 3D metric reconstruction of in-the-wild images. We collect several Flickr images and use our model to reconstruct the scene. The focal length information is collected from the photo’s metadata. From the reconstructed point cloud, we can measure some structures’ sizes. We can observe that sizes are in a reasonable range.

Figure 7: 3D reconstruction of 360◦ views. Current autonomous driving cars are equipped with several pin-hole cameras to capture 360views. With our model, we can reconstruct each view and smoothly fuse them together. We can see that all views can be well merged together without scale inconsistency problems. Testing data are from NuScenes. Note that the front view camera has a different focal length from other views.

Figure 8: Depth estimation. The visual comparison of predicted depth on iBims, ETH3D, and DIODE. Our depth maps come from the ConvNeXt-L CSTM label model.

Figure 9: Depth estimation. The visual comparison of predicted depth on iBims, ETH3D, and DIODE. Our depth maps come from the ConvNeXt-L CSTM label model.

Figure 10: Depth estimation. The visual comparison of predicted depth on iBims, ETH3D, and DIODE. Our depth maps come from the ConvNeXt-L CSTM label model.

Designed for Accessibility and to further Open Science