Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: boldline
  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.05056v1 [cs.CV] 08 Mar 2024
11institutetext: Harbin Institute of Technology

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

Yifan Mao    Jian Liu*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT   Xianming Liu
Equal contribution.Corresponding Author.
Abstract

Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model’s depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code are available at: https://github.com/hitcslj/SSD.

Keywords:
Robust monocular depth estimation Self training Stable Diffusion

1 Introduction

Monocular depth estimation (MDE) is crucial in computer vision, providing important cues for various downstream applications such as autonomous driving and robotics. However, obtaining accurate 3D depth information from a single image presents a geometrically ill-posed challenge. Traditional methods like stereo matching and structure from motion have demonstrated restricted performance in this regard [30, 14]. Since 2014, the emergence of deep learning has markedly enhanced depth estimation performance. Deep learning models can acquire rich prior knowledge from data, facilitating scene understanding and presenting promising solutions for depth estimation. Consequently, numerous methods for monocular depth estimation have emerged, encompassing supervised approaches [6, 5, 21, 1, 19, 42] and self-supervised methods [8, 46, 11, 12].

Although MDE methods perform well under standard conditions, such as sunny weather, they become less effective in challenging conditions like darkness and adverse weather. These limitations arise from invalidating crucial assumptions, such as photometric consistency and reliable ground truth, in challenging scenarios. Additionally, current datasets lack sufficient samples capturing challenging scenarios, with a notable scarcity of dedicated datasets tailored to address these conditions. Recent research has explored Robust Monocular Depth Estimation (RMDE) under challenging conditions, categorizing them into two groups: model-based and data-based approaches. Model-based approaches [37, 34, 17, 45, 32] aim to enhance the model’s capability of handling challenging conditions by modifying its architecture. Conversely, data-based approaches achieve RMDE by enhancing the image signal [33, 44, 10, 29, 22] through techniques such as domain adaptation or utilizing data from other modalities [20, 9, 31].

Although previous model-based methods have achieved satisfactory results, they rely on complex pipelines tailored for specific challenging conditions. This approach constrains their ability to reason and adapt to diverse challenging conditions. On the other hand, methods based on data from other modalities face challenges in obtaining high-quality data and often require post-processing. Augmenting the image signal in data-based methods can mitigate some of these limitations by employing simpler model architectures. These methods utilize GANs as translation models to generate training samples; however, GANs often lack generalizability and demonstrate limited diversity in the generated samples. Furthermore, adapting the model to multiple domains necessitates training multiple GANs, leading to additional training costs.

Refer to caption
Figure 1: Shortcomings of GAN. Compared to GAN, which suffers from issues like noise, fake rainy effects, and blurriness, our GDT can generate more diverse and realistic images.

The main objective of this study is to introduce a comprehensive paradigm for Robust Monocular Depth Estimation (RMDE) aimed at overcoming the earlier-mentioned limitations. We aim to leverage valuable prior knowledge from stable diffusion [28] as the cornerstone of our approach. Modern generative diffusion models [28] have undergone extensive training on large-scale datasets, empowering them to produce high-quality images. Despite their potential, the use of generative diffusion models for RMDE remains largely unexplored in current literature. Figure 1 visually illustrates the drawbacks of GAN-based translation models compared to our Generative Diffusion Model-based Translation (GDT) model. To address this gap, we employ a self-training approach that integrates the stable diffusion prior effectively. Our experiments confirm the feasibility and promising performance of this approach.

In our experiments, we found that the conventional ResNet architecture [15] was not fully leveraging the potential of the stable diffusion prior. To overcome this limitation, we aimed to introduce a more potent feature encoder. Inspired by the Depth Anything approach [39], we integrated DINOv2 into our encoder architecture to extract more effective and generic features from the samples, thereby enhancing the overall model performance. Furthermore, we noted the utility of semantic feature alignment in our work, prompting us to introduce semantic loss. Moreover, we introduced the teacher loss to improve the distillation process for depth estimation. This novel loss function facilitated the student model in acquiring the correct knowledge from the teacher model while preventing any erroneous knowledge transfer. Our contributions can be summarized as follows:

  • We are the first to introduce stable diffusion into RMDE and propose a general paradigm that leverages the diffusion prior for robust depth estimation.

  • We present a plug-and-play translation model based on generative diffusion models that can be readily applied in various scenarios.

  • Our method outperforms existing approaches on the nuScenes and RobotCar datasets, achieving SOTA performance.

2 Related Works

2.1 Monocular Depth Estimation

MDE methods can be categorized into two types: supervised and self-supervised. Eigen et al. [6] introduced the CNN-based architecture for depth estimation, laying the foundation for supervised MDE. Over time, supervised MDE methods have evolved into regression methods[5, 16] and classification methods[7, 1, 19]. In contrast, self-supervised MDE does not rely on costly depth ground truth. It generates the supervisory signal using stereo image pairs[8, 11] or adjacent frames in a video[46, 12]. These methods reconstruct the image based on the positional relationship between the cameras and the estimated depth, improving depth estimation accuracy by reducing the difference between the reconstructed image and the target image. However, both supervised and self-supervised methods are susceptible to the effects of darkness and adverse weather conditions, as depicted in Figure 2. Therefore, there is a need for RMDE methods capable of overcoming the poor performance of current approaches in challenging lighting conditions.

Refer to caption
Figure 2: Darkness and weather effects on sensors. The images above are from the nuScenes[3]. In night-time photos, RGB images often exhibit noise, textureless regions, and blurriness, which are not conducive to self-supervised learning. Additionally, rainy weather can introduce blur and reflections, leading to sparse and unreliable LiDAR signals, which are not suitable for supervised learning.

2.2 Robust Monocular Depth Estimation

In recent years, significant progress has been made in Robust Monocular Depth Estimation (RMDE), with several methods demonstrating promising results. These methods can be broadly categorized into two groups: model-based and data-based approaches.

Model-based methods aim to enhance the model architecture to handle challenging conditions. DeFeat-Net [32] proposes a unified framework for learning robust monocular depth estimation and dense feature representation, specifically targeting improved performance under darkness. RNW [37] employs image enhancement techniques and an adversarial approach to enhance model performance in dark environments. WSGD [34] addresses darkness by estimating flow maps and modeling light changes between adjacent frames. MonoViT [45] utilizes Vision Transformer (ViT) [4] to extract image features, enabling the model to perform well under various weather conditions.

Data-based methods focus on leveraging additional modalities or augmenting the image signal through techniques such as domain adaptation. DEISR [20] utilizes sparse Radar data to enable depth estimation under adverse conditions. R4Dyn [9] demonstrates the benefits of weak supervision using sparse Radar data during training and the advantages of using Radar data as an additional input during inference for RMDE. DET [31] leverages thermal images to achieve depth estimation in darkness and adverse weather conditions. ADFA [33] employs adversarial training to enable the model to adapt to darkness. ADDS [22] utilizes different feature extractors to extract invariant and private features in different domains, enabling the extraction of universal image features and improving depth estimation. ITDFA [44] employs CycleGAN to translate images from daytime to other domains and extracts features from different domains to handle darkness and adverse weather. Robust-Depth [29] introduces bi-directional pseudo-supervision loss and pseudo-supervised pose loss to compensate for the performance degradation caused by the use of translated images. WeatherDepth [36] introduces contrastive learning based on domain adaptation methods. Md4all [10] achieves robust depth estimation by not distinguishing images in standard and challenging conditions, allowing the model to learn from the trained teacher network.

It is worth noting that the robustness of a model encompasses not only its ability to handle challenging conditions but also its generalization to other corruption types such as noise, blur, and digital artifacts. Several works [27, 41, 26, 2, 38] have explored the generalization of MDE models and strategies to improve their performance in various corruption types. In contrast, our work specifically focuses on enhancing the performance of MDE models in real-world weather corruptions and darkness.

2.3 Generative Diffusion Models

DDPM is a generative model, also known as a diffusion model, that achieves image generation by performing a diffusion process in the image space. The impressive generative power of the diffusion model has led to the desire to incorporate control conditions into the generated images. In the field of text-based image generation, Rombach et al. proposed the latent diffusion model (LDM) [28], which performs the diffusion process in the latent space. They also utilize the cross-attention mechanism [35] to introduce conditions for the LDM. Their text-to-image model is now known as Stable Diffusion. To control the spatial structure of an image, Zhang et al. proposed ControlNet [43], which provides an additional control image, such as a depth map, semantic map, canny map, etc., to govern the spatial structure of the generated image. To control the style of generated images, Hu et al. proposed IP-Adapter [40], an effective and lightweight neural network architecture aimed at achieving image prompt capability for pre-trained text-to-image diffusion models. The text-to-image diffusion models can generate images in the style of the images inputted into the IP-Adapter. IP-Adapter achieves this through a decoupled cross-attention mechanism where the cross-attention layers for text features and image features are separate. Our GDT model supports depth maps, text, and prompts for black-and-white or rainy day images as conditions, enabling the generation of images that satisfy the aforementioned conditions.

3 Methodology

In this paper, we proposed the SSD, a novel approach aims at stealing stable diffusion prior for RMDE. SSD incorporates a new translation model called GDT, which is based on generative diffusion models. To adapt to GDT for RMDE, we integrate DINOv2 into our depth model’s architecture, it helps to extract universal image features. Besides, we optimized the distillation loss used for knowledge distillation. Our approach is generalized and can be adapted to a variety of challenging conditions.

3.1 Preliminaries

In supervised monocular depth estimation, DepthNet are trained using sensor data as ground truth. The prediction process can be represented as D=𝒟(I)𝐷𝒟𝐼D=\mathcal{D}(I)italic_D = caligraphic_D ( italic_I ), where I𝐼Iitalic_I is the input image, D𝐷Ditalic_D is the dense depth map of I𝐼Iitalic_I, and 𝒟𝒟\mathcal{D}caligraphic_D is the DepthNet. For self-supervised MDE, adjacent frames in videos are used to train DepthNet. In addition to the DepthNet, a PoseNet is required to estimate the ego-motion between consecutive frames. For a frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in a video, DepthNet is used to predict the corresponding depth map Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and PoseNet is used to predict the camera ego-motion Pttsubscript𝑃𝑡superscript𝑡P_{t\rightarrow t^{\prime}}italic_P start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The process of predicting the camera ego-motion can be expressed as Ptt=𝒫(It,It)subscript𝑃𝑡superscript𝑡𝒫subscript𝐼𝑡subscript𝐼superscript𝑡P_{t\rightarrow t^{\prime}}=\mathcal{P}(I_{t},I_{t^{\prime}})italic_P start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = caligraphic_P ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), where Itsubscript𝐼superscript𝑡I_{t^{\prime}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT represents the adjacent frame of Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sampled from {It1,It+1}subscript𝐼𝑡1subscript𝐼𝑡1\{I_{t-1},I_{t+1}\}{ italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT }, and 𝒫𝒫\mathcal{P}caligraphic_P represents the PoseNet. The depth map and camera ego-motion are then used to synthesize the target image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This process is described in Equation 1, where K𝐾Kitalic_K represents the camera intrinsics, Proj()𝑃𝑟𝑜𝑗Proj(\cdot)italic_P italic_r italic_o italic_j ( ⋅ ) is the function that outputs the 2D coordinates, and delimited-⟨⟩\langle\cdot\rangle⟨ ⋅ ⟩ is the sampling operator.

Itt=ItProj(Dt,Ptt,K)subscript𝐼superscript𝑡𝑡subscript𝐼superscript𝑡delimited-⟨⟩𝑃𝑟𝑜𝑗subscript𝐷𝑡subscript𝑃𝑡superscript𝑡𝐾I_{t^{\prime}\rightarrow t}=I_{t^{\prime}}\langle Proj(D_{t},P_{t\rightarrow t% ^{\prime}},K)\rangleitalic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_t end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ italic_P italic_r italic_o italic_j ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_K ) ⟩ (1)

Finally, the photometric loss (pe𝑝𝑒peitalic_p italic_e) is calculated as defined in Equation 2, where α𝛼\alphaitalic_α is set to 0.85. In Monodepth2 [12], the per-pixel photometric loss p𝑝\mathcal{L}{p}caligraphic_L italic_p is defined as mintpe(It,Itt)superscript𝑡𝑝𝑒subscript𝐼𝑡subscript𝐼superscript𝑡𝑡\min\limits{t^{\prime}}pe(I_{t},I_{t^{\prime}\rightarrow t})roman_min italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_t end_POSTSUBSCRIPT ).

pe(Ia,Ib)=α2(1SSIM(Ia,Ib))+(1α)IaIb𝑝𝑒subscript𝐼𝑎subscript𝐼𝑏𝛼21𝑆𝑆𝐼𝑀subscript𝐼𝑎subscript𝐼𝑏1𝛼normsubscript𝐼𝑎subscript𝐼𝑏pe(I_{a},I_{b})=\frac{\alpha}{2}(1-SSIM(I_{a},I_{b}))+(1-\alpha)||I_{a}-I_{b}||italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ( 1 - italic_S italic_S italic_I italic_M ( italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) + ( 1 - italic_α ) | | italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | | (2)

3.2 Generative Diffusion Model-based Translation

As discussed in Section 1, GAN-based translation models have exhibited certain limitations. In this paper, we present a novel translation model called GDT (Generative Diffusion Model-based Translation). The pipeline of GDT is depicted in Figure 3. The objective of GDT is to generate a training sample Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT that closely resembles the day-clear image Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in terms of depth. To achieve this, we leverage the Stable Diffusion prior and introduce several control mechanisms to transform the day-clear image into challenging conditions while preserving specific characteristics.

First, we employ the BLIP2 model [18] to obtain a scene description Tcapsubscript𝑇𝑐𝑎𝑝T_{cap}italic_T start_POSTSUBSCRIPT italic_c italic_a italic_p end_POSTSUBSCRIPT, which aids in preserving the content of the image. Second, we utilize the d2i model of the controlNet [43] to maintain approximate depth consistency. We have observed that the clearer the depth information, the more realistic the generated RGB image becomes. Thus, we employ PatchFusion to enhance the depth estimation obtained from the MiDaS model. Third, we address the challenge of introducing diverse styles into the generated images. We combine the Tcapsubscript𝑇𝑐𝑎𝑝T_{cap}italic_T start_POSTSUBSCRIPT italic_c italic_a italic_p end_POSTSUBSCRIPT with challenging condition descriptors. However, we encountered an issue where the generated challenging condition images lacked realism. To overcome this, we introduce an image prompt for challenging conditions. In our implementation, we randomly select night images or rain images from the train dataset as image prompts. We utilize the IPAdapter model [40] to incorporate both text prompts and image prompts as inputs for image generation. The text prompt assists in preserving the content, while the image prompt facilitates style transfer.

Refer to caption
Figure 3: The GDT pipeline incorporates multiple large models and PatchFusion to generate high-quality training samples.

In comparison to previous GAN-based methods, our GDT approach generates images with greater diversity and does not necessitate training for each specific scene. It can be considered a plug-and-play module. As more powerful generative diffusion models continue to emerge, our GDT method will become even more robust. The generation of the image Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT can be expressed as follows:

Ig=SD(IP(Tp,Ip),CN(Dh),z)subscript𝐼𝑔𝑆𝐷𝐼𝑃subscript𝑇𝑝subscript𝐼𝑝𝐶𝑁subscript𝐷𝑧I_{g}=SD(IP(T_{p},I_{p}),CN(D_{h}),z)italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_S italic_D ( italic_I italic_P ( italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , italic_C italic_N ( italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , italic_z ) (3)

Here, SD𝑆𝐷SDitalic_S italic_D represents the Stable Diffusion prior, IP𝐼𝑃IPitalic_I italic_P denotes the IPAdapter model that combines the text prompt Tpsubscript𝑇𝑝T_{p}italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and image prompt Ipsubscript𝐼𝑝I_{p}italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, CN𝐶𝑁CNitalic_C italic_N corresponds to the controlNet model for depth consistency (utilizing Dhsubscript𝐷D_{h}italic_D start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT as input), and z𝑧zitalic_z represents the latent noise variable.

Refer to caption
Figure 4: SSD framework for robust depth estimation. The Student Net receives guidance from the Teacher Net, leveraging a stable diffusion prior. The semantic loss ensures semantic consistency, while the teacher loss enables the Student Net to learn beyond the capabilities of the Teacher Net.

3.3 Stealing Stable Diffusion Prior

In order to steal a stable diffusion prior for robust depth estimation, we have made modifications to the architecture of Monodepth2 [12]. Additionally, we have employed a self-training strategy and introduced additional loss functions. Figure 4 provides a brief illustration of our SSD pipeline.

3.3.1 Network Architecture

In the context of generative diffusion model-based translation and GAN-based translation, the fundamental difference between them lies in their ability to generate data. Both approaches need to adhere to the assumption of constant depth before and after image translation. As a result, the translation process of GAN is more akin to style transfer, where the images before and after translation essentially belong to the same dataset.

However, the generative diffusion model-based translation, leveraging its powerful generative capabilities, guarantees that the image before and after translation will have the same depth while experiencing significant changes in content. This process can be regarded as a form of data scale-up. To achieve effective data scale-up, a simple encoder alone is insufficient in adequately extracting image features to further enhance model performance. Therefore, we have made modifications to the depth model based on Monodepth2 [12]. We have moved away from the original ResNet-based architecture [15] and introduced DINOv2 [25] as the encoder in our depth model. This choice of encoder facilitates the extraction of robust image features. Additionally, we utilize pre-trained weights, enabling the depth model to inherit rich semantic information and scene understanding capabilities. Furthermore, we employ DPT depth head [26] as the decoder to regress depth. As for PoseNet, we maintain the original ResNet-based network architecture without making any changes.

3.3.2 Self-Training Strategy

Our self-training strategy follows the approach introduced in [10], which has proven to be simple and effective in improving the performance of depth models. We adopt this strategy, which involves training a teacher network on day-clear samples and subsequently training a student network.

The first step is to train the teacher network using day-clear samples. This ensures that the teacher network produces reliable results specifically for day-clear scenarios. For instance, in the case of self-supervised monocular depth estimation, training the depth model using night samples can be challenging due to the lack of texture and the inability to utilize correspondences between adjacent frames. Photometric loss may fail in such scenarios.

Next, we train the student network to have the ability to perceive depth information across a variety of scenarios. To achieve this, we train the student network using a diverse range of training samples. For a day-clear training sample Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, it can potentially be translated into a challenging condition training sample Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Here, we make the basic assumption that the depth between Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT remains relatively consistent. There is P𝑃Pitalic_P % probability of no changes and 1P1𝑃1-P1 - italic_P % probability of changing into Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT where P=|C|/(|C|+1)𝑃𝐶𝐶1P=|C|/(|C|+1)italic_P = | italic_C | / ( | italic_C | + 1 ) and |C|𝐶|C|| italic_C | stands for the number of interest challenging conditions C𝐶Citalic_C. We leverage the teacher model to generate the pseudo ground truth Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. This pseudo ground truth Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT also serves as a guidance for the student model when processing the GDT-translated Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

We then input a mixture of Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT into the student model and generate the predicted depth map Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. By computing the distillation loss using Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, our model becomes capable of estimating depth not only under day-clear conditions but also under challenging conditions. This self-training strategy enables the student network to learn from both the teacher network and the GDT-translated samples, improving its ability to estimate depth across a diverse range of scenarios.

3.3.3 Loss computation

We ultimately employed two loss functions: the teacher loss tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the semantic alignment loss ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The teacher loss addresses the limitations of the naive distillation loss, while the semantic alignment loss draws inspiration from [39] and aims to better utilize the semantic similarity between Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is suitable for both supervised and self-supervised learning and tsubscript𝑡\mathcal{L}_{t}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is only suitable for self-supervised learning and replaced by supervised loss in supervised learning.

In self-supervised MDE, the teacher model struggles to accurately estimate the depth of each pixel, while the student model may outperform the teacher model in specific pixels. Although the teacher model initially provides guidance to the student model during training, as training progresses, the student model becomes more capable of independently estimating depth. In some cases, the student model’s estimates may be more accurate than those of the teacher model. However, due to the limitations of the distillation loss used, the student model relies solely on the teacher model’s estimates, hindering further improvement in its performance. On this basis, we propose the teacher loss. The core idea of teacher loss is to mask pixels corresponding to unreasonable depths estimated by the teacher model when calculating distillation loss.

For a day-clear frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we use the DepthNet of teacher model and student model to generate the corresponding dense depth map Dttsuperscriptsubscript𝐷𝑡𝑡D_{t}^{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Dtssuperscriptsubscript𝐷𝑡𝑠D_{t}^{s}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Although the inputs of student model may be translated into Itgsuperscriptsubscript𝐼𝑡𝑔I_{t}^{g}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, we use the adjacent frames Itsubscript𝐼superscript𝑡I_{t^{\prime}}italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to reconstruct the image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, we use the PoseNet of the teacher model to estimate the camera ego-motion Ptttsubscriptsuperscript𝑃𝑡𝑡superscript𝑡P^{t}_{t\rightarrow t^{\prime}}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We obtain the Ittssubscriptsuperscript𝐼𝑠superscript𝑡𝑡I^{s}_{t^{\prime}\rightarrow t}italic_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_t end_POSTSUBSCRIPT and Itttsubscriptsuperscript𝐼𝑡superscript𝑡𝑡I^{t}_{t^{\prime}\rightarrow t}italic_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_t end_POSTSUBSCRIPT using the Dtssuperscriptsubscript𝐷𝑡𝑠D_{t}^{s}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, Dttsuperscriptsubscript𝐷𝑡𝑡D_{t}^{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Ptttsubscriptsuperscript𝑃𝑡𝑡superscript𝑡P^{t}_{t\rightarrow t^{\prime}}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as shown in equation 4.

Itts=ItProj(Dts,Pttt,K)Ittt=ItProj(Dtt,Pttt,K)subscriptsuperscript𝐼𝑠superscript𝑡𝑡subscript𝐼superscript𝑡delimited-⟨⟩𝑃𝑟𝑜𝑗superscriptsubscript𝐷𝑡𝑠superscriptsubscript𝑃𝑡superscript𝑡𝑡𝐾subscriptsuperscript𝐼𝑡superscript𝑡𝑡subscript𝐼superscript𝑡delimited-⟨⟩𝑃𝑟𝑜𝑗superscriptsubscript𝐷𝑡𝑡superscriptsubscript𝑃𝑡superscript𝑡𝑡𝐾\begin{split}I^{s}_{t^{\prime}\rightarrow t}=I_{t^{\prime}}\langle Proj(D_{t}^% {s},P_{t\rightarrow t^{\prime}}^{t},K)\rangle\\ I^{t}_{t^{\prime}\rightarrow t}=I_{t^{\prime}}\langle Proj(D_{t}^{t},P_{t% \rightarrow t^{\prime}}^{t},K)\rangle\end{split}start_ROW start_CELL italic_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_t end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ italic_P italic_r italic_o italic_j ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_K ) ⟩ end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_t end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ italic_P italic_r italic_o italic_j ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_K ) ⟩ end_CELL end_ROW (4)

For self-supervised MDE, the lower the photometric loss, the more reasonable the estimated depth. If Dttsuperscriptsubscript𝐷𝑡𝑡D_{t}^{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is more reasonable than Dtssuperscriptsubscript𝐷𝑡𝑠D_{t}^{s}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, then the photometric loss calculated by Dttsuperscriptsubscript𝐷𝑡𝑡D_{t}^{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (T-pe𝑝𝑒peitalic_p italic_e) is lower than Dtssuperscriptsubscript𝐷𝑡𝑠D_{t}^{s}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (S-pe𝑝𝑒peitalic_p italic_e). Thus, we design the mask M𝑀Mitalic_M to select the pixels with higher S-pe𝑝𝑒peitalic_p italic_e than T-pe𝑝𝑒peitalic_p italic_e as shown in equation 5, where [][\ ][ ] represents the Iverson brackets.

M=[mintpe(It,Ittt)>mintpe(It,Itts)]𝑀delimited-[]subscriptsuperscript𝑡𝑝𝑒subscript𝐼𝑡subscriptsuperscript𝐼𝑡𝑡superscript𝑡subscriptsuperscript𝑡𝑝𝑒subscript𝐼𝑡subscriptsuperscript𝐼𝑠𝑡superscript𝑡M=[\min_{t^{\prime}}pe(I_{t},I^{t}_{t\rightarrow t^{\prime}})>\min_{t^{\prime}% }pe(I_{t},I^{s}_{t\rightarrow t^{\prime}})]italic_M = [ roman_min start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > roman_min start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t → italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] (5)

We use the mask M𝑀Mitalic_M to filter the pixels with unreasonable depth. Our teacher loss is shown in equation 6 where dsubscript𝑑\mathcal{L}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represents the day distillation loss defined in md4all-DD and direct-product\odot represents the pixel-wise product.

t=Mdsubscript𝑡direct-product𝑀subscript𝑑\mathcal{L}_{t}=M\odot\mathcal{L}_{d}caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_M ⊙ caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (6)

Besides teacher loss, we also used the semantic loss proposed in[39]. As the images translated by our GDT remain the semantic information similar to origin day-clear image Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, we want to use the pre-trained DINOv2 to produce auxiliary image features for semantic align.

s=11HWi=1HWcos(fi,fi)subscript𝑠11𝐻𝑊superscriptsubscript𝑖1𝐻𝑊subscript𝑓𝑖subscript𝑓superscript𝑖\mathcal{L}_{s}=1-\frac{1}{HW}\sum_{i=1}^{HW}\cos(f_{i},f_{i^{\prime}})caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT roman_cos ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (7)

4 Experiments and Results

4.1 Experimental Setup

4.1.1 Datasets and Metrics

In accordance with md4all [10], we choose nuScenes [3] and Oxford RobotCar [24] as the datasets for our study. NuScenes is a comprehensive dataset for autonomous driving, offering abundant sensor data and precise annotations. It encompasses around 1,000 urban driving scenes, encompassing diverse weather conditions, traffic scenarios, and road types. For the RMDE task, we partitioned the dataset based on visibility. Ordinarily, the dataset contains over 34,000 samples, with 28,130 samples allocated for training and 6,019 samples for validation. However, we utilize only 15,120 day-clear samples for training, and the remaining 6,019 samples are further divided based on visibility (i.e., day-clear, night, day-rain). The depth range for testing is 0.1-80 meters. RobotCar is a dataset gathered in Oxford, UK. In our study, we exclusively utilize day and night scenes, excluding rainy scenarios. The dataset comprises 16,563 day training samples and 1,411 test samples (702 day and 709 night). The depth range for testing is 0.1-50 meters.

In our experiments, we employ commonly adopted metrics for depth estimation, namely absRel, RMSE, δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and sqRel. Specifically, we use the first three metrics for evaluating the model’s performance on nuScenes, while all four metrics are utilized for assessing the model’s performance on RobotCar.

4.1.2 Implementation Details

Both our teacher and student depth models utilize the architecture described in Section 3.3. In our experiments, we employ the DINOv2 encoder based on ViT-base as the encoder for our depth model, and we utilize ResNet-18 as the pose net encoder. Both the student and teacher networks were trained for 20 epochs. The image input size for the pose encoder is 576×\times×320, while the DINOv2 encoder has an input size of 784×\times×518 (the output size is resized to 576×\times×320 for alignment). During both stages of training, the learning rate for our depth model, which includes the DINOv2 encoder and DPT head, was set to 5e-6. The learning rate for the teacher’s pose net was set to 5e-5 (the student’s pose net was dropped). We utilized the AdamW optimizer and applied a linear schedule to decay the learning rate. The batch size was set to 4. The parameters of the DINOv2 encoder were initialized with pre-trained weights. For self-training, the DINOv2 encoder was guided by semantic loss using pre-trained weights, not the teacher DINOv2 encoder. For GDT (Guided Depth Training), we utilized pre-trained models, including MiDaS, PatchFusion, BLIP-2, Stable Diffusion v1.5, ControlNet v1.1, and IP-Adapter, without further training. Additionally, we chose the p49 mode for PatchFusion. Since the night images generated by SSD are brighter than typical night images, we adjusted their brightness to simulate more realistic outdoor night scenes. All experiments were conducted on a single 24GB RTX 4090 GPU.

4.2 Quantitative Results

day-clear night day-rain
Method sup. tr.data absRel\,\downarrow RMSE\,\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ absRel\,\downarrow RMSE\,\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑ absRel\,\downarrow RMSE\,\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑
(a) MonoDepth2[12] M* a 0.1477 6.771 85.25 2.3332 32.940 10.54 0.4114 9.442 60.58
(b) PackNet-Sfm[13] Mv d 0.1567 7.230 82.64 0.2617 11.063 56.64 0.1645 8.288 77.07
(c) RNW[37] M* dn 0.2872 9.185 56.21 0.3333 10.098 43.72 0.2952 9.341 57.21
(d) md4all-baseline Mv d 0.1333 6.459 85.88 0.2419 10.922 58.17 0.1572 7.453 79.49
(e) md4all-AD[10] Mv dT(nr) 0.1523 6.853 83.11 0.2187 9.003 68.84 0.1601 7.832 78.97
(f) md4all-DD[10] Mv dT(nr) 0.1366 6.452 84.61 0.1921 8.507 71.07 0.1414 7.228 80.98
(g) SSD-T M*v d 0.1223 6.132 87.39 0.2103 8.024 67.70 0.1718 6.988 78.87
(h) SSD-S M*vst dT(nr) 0.1217 5.982 87.13 0.1939 8.038 72.31 0.1410 6.474 82.87
Table 1: Evaluation of self-supervised methods on the nuScenes[3] validation set. SSD-T: teacher model, SSD-S: student model, Supervisions(sup.): M: via monocular videos, *: test-time median-scaling via LiDAR, v: weak velocity loss, s: semantic loss, t: teacher loss. Training data (tr.data): d: day-clear, T: translated in, n: night, r: day-rain, a: all. Visual support: 𝟏stsuperscript𝟏𝑠𝑡\textbf{1}^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT, 2nd¯¯superscript2𝑛𝑑\underline{2^{nd}}under¯ start_ARG 2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT end_ARG.

The nuScenes dataset [3] is divided into three scenarios: day-clear, night, and day-rain. Following the setup proposed in [10], we evaluated the performance of our model in each of these scenarios. The quantitative results are presented in Table 1.

Due to limitations imposed by the model architecture (which requires 3D convolutions) and the number of samples in the dataset, (b)PackNet’s performance shows slight improvement for night and rain scenes but is greatly reduced for day-clear scenes. The weak velocity loss in (b)PackNet contributes to obtaining metric depth. Although (c)RNW employs a unique image enhancement method and adversarial training, it performs worse than our baseline. On the other hand, (e)md4all-AD and (f)md4all-DD are simple and efficient methods that enhance the model’s robustness. Building upon the md4all approach, we have made improvements to the model architecture and translation model. Comparing (d) and (g), our model architecture (DINOv2 encoder with DPT head) demonstrates significant superiority over monoDepth2. Additionally, thanks to the pre-trained weights from the DINOv2 encoder, our depth model can extract more robust features. Comparing (g) and (h), our student model outperforms the teacher model. In the night scenario, absRel is reduced by 7.8%, and in the day-rain scenario, absRel is reduced by 17.9%. This demonstrates that the student model indeed benefits from the GDT-transformed images and effectively incorporates the stable diffusion prior.

day – RobotCar night – RobotCar
Method sup. tr.data absRel sqRel RMSE δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT absRel sqRel RMSE δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
(a) Monodepth2 [12] M*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT d 0.1196 0.670 3.164 86.38 0.3029 1.724 5.038 45.88
(b) DeFeatNet [32] M*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT a 0.2470 2.980 7.884 65.00 0.3340 4.589 8.606 58.60
(c) ADIDS [23] M*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT a 0.2390 2.089 6.743 61.40 0.2870 2.569 7.985 49.00
(d) RNW [37] M*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT a 0.2970 2.608 7.996 43.10 0.1850 1.710 6.549 73.30
(e) WSGD [34] M*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT a 0.1760 1.603 6.036 75.00 0.1740 1.637 6.302 75.40
(f) md4all-DD [10] Mv dT(n) 0.1128 0.648 3.206 87.13 0.1219 0.784 3.604 84.86
(g) SSD-T Mv dT(n) 0.1069 0.762 3.241 89.38 0.1137 0.685 3.460 86.14
(h) SSD-S Mv dT(n) 0.1058 0.718 3.203 89.09 0.1143 0.823 3.646 87.36
Table 2: Evaluation of self-supervised works on the RobotCar [24] test set. Notation from Table 1.

The RobotCar dataset [24] is divided into day and night scenarios. (a), (f), and (g) demonstrate the strong generalization capability of our model architecture. Despite not being trained on night data, our model performs well and even outperforms md4all-DD. Furthermore, (g) and (h) illustrate that, despite already achieving high performance, our GDT provides a prior that enables the student model to surpass the teacher model. This leads to further improvements in depth estimation accuracy for both day and night scenarios.

4.3 Qualitative Results

Refer to caption
Figure 5: Comparison of samples from the nuScenes dataset [3] among monodepth2 [12], md4all-DD [10], and our self-supervised teacher model SSD-T, as well as the student model SSD-S.
Refer to caption
Figure 6: Comparison of samples from the RobotCar dataset [24] among monodepth2 [12], md4all-DD [10], and our self-supervised teacher model SSD-T, as well as the student model SSD-S.

The qualitative results depicted in Figure 5 and Figure 6 showcase the performance of our SSD (Single Shot Depth) model in depth estimation, specifically on the nuScenes [3] and RobotCar [24] datasets. In the nuScenes dataset, our SSD model demonstrates enhanced performance in extracting depth information from regions with low illumination, such as night scenes. Moreover, our approach effectively restores depth information even in rainy scenes. Figure 5 highlights the advantages of our SSD-T (teacher) and SSD-S (student) models over the monodepth2 and md4all-DD methods in night scenes. Our models accurately estimate the depth of cars and pedestrians, whereas the other methods fail to do so. However, in rainy scenes, SSD-T’s performance is compromised due to the superior perceptive ability of the DINOv2 architecture. SSD-T mistakenly perceives virtual images in water as objects, leading to incorrect depth estimations. Nonetheless, SSD-S, which incorporates semantic loss and teacher loss, achieves excellent results. Figure 6 displays the results obtained in both day-clear and night scenes. In the day-clear scene, both SSD-T and SSD-S successfully capture the edge of the wall in their depth maps, whereas monodepth2 and md4all-DD fail to do so. In the night scene, monodepth2 produces unreliable depth maps, whereas our SSD-T and SSD-S accurately estimate the depth of the bicycle on the right side.

4.4 Ablation Study

We conducted ablation experiments on the nuScenes validation set to validate the effectiveness of our proposed approach. The results are presented in Table 3. In the ablation study, our metrics represent the average performance of the model across all samples in the validation set. In Method 1, we changed the original translation model of md4all to our GDT, without incorporating PatchFusion. The performance of the model improved as GDT can produce detailed images that preserve fine depth information. In Method 2, we introduced PatchFusion, which enhanced the model’s performance by generating high-resolution depth maps. These maps contribute to the preservation of fine depth details and aid GDT in producing more accurate images. During knowledge distillation, the teacher model may produce unreasonable depth estimations. In Method 3, we replaced the distillation loss with the teacher loss, resolving this issue and further improving the model’s performance. Considering the powerful generative capability of SSD, it is necessary to have a more robust backbone to produce reliable image features. DINOv2, known for its ability to generate robust image features, is suitable for our task. In Method 4, after adopting DINOv2 as our encoder, the performance of the depth model improved significantly. In Method 5, we employed semantic loss to align the image features, resulting in a further improvement in model performance.

method backbone translation model loss metrics-all
ResNet-18 DINOv2 GAN MiDaS PatchFusion distillation loss teacher loss semantic loss absRel\,\downarrow RMSE\,\downarrow δ1subscript𝛿1absent\delta_{1}\uparrowitalic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↑
0 0.1513 6.761 81.51
1 0.1529 6.887 81.03
2 0.1497 6.806 81.16
3 0.1485 6.801 81.30
4 0.1360 6.333 84.07
5 0.1320 6.266 84.96
Table 3: Ablation study on the nuScenes dataset. We compare the effects of different training settings on the depth model.

5 Conclusion

This paper introduces SSD, a practical solution for robust monocular depth estimation. We underscore the significance of the generative diffusion model prior in generating challenging samples. We have successfully integrated the stable diffusion prior into our depth estimation model using a self-training approach. With the emergence of stronger generative diffusion models and more accurate control models, along with advancements in depth model capabilities, the potential for robust depth estimation can be further enhanced. Furthermore, the SSD paradigm can extend to other dense prediction tasks, such as robust semantic segmentation and 3D occupancy prediction, offering promising avenues for future research and applications.

References

  • [1] Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4009–4018 (June 2021)
  • [2] Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)
  • [3] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020)
  • [4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [5] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2015)
  • [6] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. vol. 27. Curran Associates, Inc. (2014), https://proceedings.neurips.cc/paper_files/paper/2014/file/7bccfde7714a1ebadf06c5f4cea752c1-Paper.pdf
  • [7] Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2002–2011 (2018)
  • [8] Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 740–756. Springer International Publishing, Cham (2016)
  • [9] Gasperini, S., Koch, P., Dallabetta, V., Navab, N., Busam, B., Tombari, F.: R4dyn: Exploring radar for self-supervised monocular depth estimation of dynamic scenes. In: 2021 International Conference on 3D Vision (3DV). pp. 751–760 (2021). https://doi.org/10.1109/3DV53792.2021.00084
  • [10] Gasperini, S., Morbitzer, N., Jung, H., Navab, N., Tombari, F.: Robust monocular depth estimation under challenging conditions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8177–8186 (October 2023)
  • [11] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [12] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
  • [13] Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2485–2494 (2020)
  • [14] Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
  • [15] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [16] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV). pp. 239–248. IEEE (2016)
  • [17] Li, J., Wang, Y., Huang, Z., Zheng, J., Xian, K., Cao, Z., Zhang, J.: Diffusion-augmented depth prediction with sparse annotations. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 2865–2876. MM ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3581783.3611807, https://doi.org/10.1145/3581783.3611807
  • [18] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [19] Li, Z., Wang, X., Liu, X., Jiang, J.: Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
  • [20] Lin, J.T., Dai, D., Gool, L.V.: Depth estimation from monocular images and sparse radar data. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 10233–10240 (2020). https://doi.org/10.1109/IROS45743.2020.9340998
  • [21] Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 2024–2039 (2016). https://doi.org/10.1109/TPAMI.2015.2505283
  • [22] Liu, L., Song, X., Wang, M., Liu, Y., Zhang, L.: Self-supervised monocular depth estimation for all day images using domain separation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12737–12746 (2021)
  • [23] Liu, L., Song, X., Wang, M., Liu, Y., Zhang, L.: Self-supervised monocular depth estimation for all day images using domain separation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12737–12746 (October 2021)
  • [24] Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research (IJRR) 36(1), 3–15 (2017). https://doi.org/10.1177/0278364916679498, http://dx.doi.org/10.1177/0278364916679498
  • [25] Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without supervision (2023)
  • [26] Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)
  • [27] Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence 44(3), 1623–1637 (2020)
  • [28] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (June 2022)
  • [29] Saunders, K., Vogiatzis, G., Manso, L.J.: Self-supervised monocular depth estimation: Let’s talk about the weather. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8907–8917 (October 2023)
  • [30] Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47, 7–42 (2002)
  • [31] Shin, U., Park, J., Kweon, I.S.: Deep depth estimation from thermal image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1043–1053 (June 2023)
  • [32] Spencer, J., Bowden, R., Hadfield, S.: Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14402–14413 (2020)
  • [33] Vankadari, M., Garg, S., Majumder, A., Kumar, S., Behera, A.: Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 443–459. Springer International Publishing, Cham (2020)
  • [34] Vankadari, M., Golodetz, S., Garg, S., Shin, S., Markham, A., Trigoni, N.: When the sun goes down: Repairing photometric losses for all-day depth estimation. In: Liu, K., Kulic, D., Ichnowski, J. (eds.) Proceedings of The 6th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 205, pp. 1992–2003. PMLR (14–18 Dec 2023), https://proceedings.mlr.press/v205/vankadari23a.html
  • [35] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [36] Wang, J., Lin, C., Nie, L., Huang, S., Zhao, Y., Pan, X., Ai, R.: Weatherdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. arXiv preprint arXiv:2310.05556 (2023)
  • [37] Wang, K., Zhang, Z., Yan, Z., Li, X., Xu, B., Li, J., Yang, J.: Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16055–16064 (October 2021)
  • [38] Xian, K., Cao, Z., Shen, C., Lin, G.: Towards robust monocular depth estimation: A new baseline and benchmark. International Journal of Computer Vision pp. 1–19 (2024)
  • [39] Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Unleashing the power of large-scale unlabeled data. arXiv:2401.10891 (2024)
  • [40] Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models (2023)
  • [41] Yin, W., Liu, Y., Shen, C.: Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10), 7282–7295 (2021)
  • [42] Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Newcrfs: Neural window fully-connected crfs for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
  • [43] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3836–3847 (October 2023)
  • [44] Zhao, C., Tang, Y., Sun, Q.: Unsupervised monocular depth estimation in highly complex environments. IEEE Transactions on Emerging Topics in Computational Intelligence 6(5), 1237–1246 (2022). https://doi.org/10.1109/TETCI.2022.3182360
  • [45] Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer. In: 2022 International Conference on 3D Vision (3DV). pp. 668–678. IEEE (2022)
  • [46] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)