¹¹institutetext: Harbin Institute of Technology

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

Yifan Mao Jian Liu

{}^{*}

Xianming Liu
Equal contribution.Corresponding Author.

Abstract

Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model’s depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code are available at: https://github.com/hitcslj/SSD.

Keywords:

Robust monocular depth estimation Self training Stable Diffusion

1 Introduction

Monocular depth estimation (MDE) is crucial in computer vision, providing important cues for various downstream applications such as autonomous driving and robotics. However, obtaining accurate 3D depth information from a single image presents a geometrically ill-posed challenge. Traditional methods like stereo matching and structure from motion have demonstrated restricted performance in this regard [30, 14]. Since 2014, the emergence of deep learning has markedly enhanced depth estimation performance. Deep learning models can acquire rich prior knowledge from data, facilitating scene understanding and presenting promising solutions for depth estimation. Consequently, numerous methods for monocular depth estimation have emerged, encompassing supervised approaches [6, 5, 21, 1, 19, 42] and self-supervised methods [8, 46, 11, 12].

Although MDE methods perform well under standard conditions, such as sunny weather, they become less effective in challenging conditions like darkness and adverse weather. These limitations arise from invalidating crucial assumptions, such as photometric consistency and reliable ground truth, in challenging scenarios. Additionally, current datasets lack sufficient samples capturing challenging scenarios, with a notable scarcity of dedicated datasets tailored to address these conditions. Recent research has explored Robust Monocular Depth Estimation (RMDE) under challenging conditions, categorizing them into two groups: model-based and data-based approaches. Model-based approaches [37, 34, 17, 45, 32] aim to enhance the model’s capability of handling challenging conditions by modifying its architecture. Conversely, data-based approaches achieve RMDE by enhancing the image signal [33, 44, 10, 29, 22] through techniques such as domain adaptation or utilizing data from other modalities [20, 9, 31].

Although previous model-based methods have achieved satisfactory results, they rely on complex pipelines tailored for specific challenging conditions. This approach constrains their ability to reason and adapt to diverse challenging conditions. On the other hand, methods based on data from other modalities face challenges in obtaining high-quality data and often require post-processing. Augmenting the image signal in data-based methods can mitigate some of these limitations by employing simpler model architectures. These methods utilize GANs as translation models to generate training samples; however, GANs often lack generalizability and demonstrate limited diversity in the generated samples. Furthermore, adapting the model to multiple domains necessitates training multiple GANs, leading to additional training costs.

Refer to caption — Figure 1: Shortcomings of GAN. Compared to GAN, which suffers from issues like noise, fake rainy effects, and blurriness, our GDT can generate more diverse and realistic images.

The main objective of this study is to introduce a comprehensive paradigm for Robust Monocular Depth Estimation (RMDE) aimed at overcoming the earlier-mentioned limitations. We aim to leverage valuable prior knowledge from stable diffusion [28] as the cornerstone of our approach. Modern generative diffusion models [28] have undergone extensive training on large-scale datasets, empowering them to produce high-quality images. Despite their potential, the use of generative diffusion models for RMDE remains largely unexplored in current literature. Figure 1 visually illustrates the drawbacks of GAN-based translation models compared to our Generative Diffusion Model-based Translation (GDT) model. To address this gap, we employ a self-training approach that integrates the stable diffusion prior effectively. Our experiments confirm the feasibility and promising performance of this approach.

In our experiments, we found that the conventional ResNet architecture [15] was not fully leveraging the potential of the stable diffusion prior. To overcome this limitation, we aimed to introduce a more potent feature encoder. Inspired by the Depth Anything approach [39], we integrated DINOv2 into our encoder architecture to extract more effective and generic features from the samples, thereby enhancing the overall model performance. Furthermore, we noted the utility of semantic feature alignment in our work, prompting us to introduce semantic loss. Moreover, we introduced the teacher loss to improve the distillation process for depth estimation. This novel loss function facilitated the student model in acquiring the correct knowledge from the teacher model while preventing any erroneous knowledge transfer. Our contributions can be summarized as follows:

•

We are the first to introduce stable diffusion into RMDE and propose a general paradigm that leverages the diffusion prior for robust depth estimation.
•

We present a plug-and-play translation model based on generative diffusion models that can be readily applied in various scenarios.
•

Our method outperforms existing approaches on the nuScenes and RobotCar datasets, achieving SOTA performance.

2 Related Works

2.1 Monocular Depth Estimation

MDE methods can be categorized into two types: supervised and self-supervised. Eigen et al. [6] introduced the CNN-based architecture for depth estimation, laying the foundation for supervised MDE. Over time, supervised MDE methods have evolved into regression methods[5, 16] and classification methods[7, 1, 19]. In contrast, self-supervised MDE does not rely on costly depth ground truth. It generates the supervisory signal using stereo image pairs[8, 11] or adjacent frames in a video[46, 12]. These methods reconstruct the image based on the positional relationship between the cameras and the estimated depth, improving depth estimation accuracy by reducing the difference between the reconstructed image and the target image. However, both supervised and self-supervised methods are susceptible to the effects of darkness and adverse weather conditions, as depicted in Figure 2. Therefore, there is a need for RMDE methods capable of overcoming the poor performance of current approaches in challenging lighting conditions.

2.2 Robust Monocular Depth Estimation

In recent years, significant progress has been made in Robust Monocular Depth Estimation (RMDE), with several methods demonstrating promising results. These methods can be broadly categorized into two groups: model-based and data-based approaches.

Model-based methods aim to enhance the model architecture to handle challenging conditions. DeFeat-Net [32] proposes a unified framework for learning robust monocular depth estimation and dense feature representation, specifically targeting improved performance under darkness. RNW [37] employs image enhancement techniques and an adversarial approach to enhance model performance in dark environments. WSGD [34] addresses darkness by estimating flow maps and modeling light changes between adjacent frames. MonoViT [45] utilizes Vision Transformer (ViT) [4] to extract image features, enabling the model to perform well under various weather conditions.

Data-based methods focus on leveraging additional modalities or augmenting the image signal through techniques such as domain adaptation. DEISR [20] utilizes sparse Radar data to enable depth estimation under adverse conditions. R4Dyn [9] demonstrates the benefits of weak supervision using sparse Radar data during training and the advantages of using Radar data as an additional input during inference for RMDE. DET [31] leverages thermal images to achieve depth estimation in darkness and adverse weather conditions. ADFA [33] employs adversarial training to enable the model to adapt to darkness. ADDS [22] utilizes different feature extractors to extract invariant and private features in different domains, enabling the extraction of universal image features and improving depth estimation. ITDFA [44] employs CycleGAN to translate images from daytime to other domains and extracts features from different domains to handle darkness and adverse weather. Robust-Depth [29] introduces bi-directional pseudo-supervision loss and pseudo-supervised pose loss to compensate for the performance degradation caused by the use of translated images. WeatherDepth [36] introduces contrastive learning based on domain adaptation methods. Md4all [10] achieves robust depth estimation by not distinguishing images in standard and challenging conditions, allowing the model to learn from the trained teacher network.

It is worth noting that the robustness of a model encompasses not only its ability to handle challenging conditions but also its generalization to other corruption types such as noise, blur, and digital artifacts. Several works [27, 41, 26, 2, 38] have explored the generalization of MDE models and strategies to improve their performance in various corruption types. In contrast, our work specifically focuses on enhancing the performance of MDE models in real-world weather corruptions and darkness.

2.3 Generative Diffusion Models

DDPM is a generative model, also known as a diffusion model, that achieves image generation by performing a diffusion process in the image space. The impressive generative power of the diffusion model has led to the desire to incorporate control conditions into the generated images. In the field of text-based image generation, Rombach et al. proposed the latent diffusion model (LDM) [28], which performs the diffusion process in the latent space. They also utilize the cross-attention mechanism [35] to introduce conditions for the LDM. Their text-to-image model is now known as Stable Diffusion. To control the spatial structure of an image, Zhang et al. proposed ControlNet [43], which provides an additional control image, such as a depth map, semantic map, canny map, etc., to govern the spatial structure of the generated image. To control the style of generated images, Hu et al. proposed IP-Adapter [40], an effective and lightweight neural network architecture aimed at achieving image prompt capability for pre-trained text-to-image diffusion models. The text-to-image diffusion models can generate images in the style of the images inputted into the IP-Adapter. IP-Adapter achieves this through a decoupled cross-attention mechanism where the cross-attention layers for text features and image features are separate. Our GDT model supports depth maps, text, and prompts for black-and-white or rainy day images as conditions, enabling the generation of images that satisfy the aforementioned conditions.

3 Methodology

In this paper, we proposed the SSD, a novel approach aims at stealing stable diffusion prior for RMDE. SSD incorporates a new translation model called GDT, which is based on generative diffusion models. To adapt to GDT for RMDE, we integrate DINOv2 into our depth model’s architecture, it helps to extract universal image features. Besides, we optimized the distillation loss used for knowledge distillation. Our approach is generalized and can be adapted to a variety of challenging conditions.

3.1 Preliminaries

In supervised monocular depth estimation, DepthNet are trained using sensor data as ground truth. The prediction process can be represented as $D=\mathcal{D}(I)$ , where $I$ is the input image, $D$ is the dense depth map of $I$ , and $\mathcal{D}$ is the DepthNet. For self-supervised MDE, adjacent frames in videos are used to train DepthNet. In addition to the DepthNet, a PoseNet is required to estimate the ego-motion between consecutive frames. For a frame $I_{t}$ in a video, DepthNet is used to predict the corresponding depth map $D_{t}$ , and PoseNet is used to predict the camera ego-motion $P_{t\rightarrow t^{\prime}}$ . The process of predicting the camera ego-motion can be expressed as $P_{t\rightarrow t^{\prime}}=\mathcal{P}(I_{t},I_{t^{\prime}})$ , where $I_{t^{\prime}}$ represents the adjacent frame of $I_{t}$ sampled from $\{I_{t-1},I_{t+1}\}$ , and $\mathcal{P}$ represents the PoseNet. The depth map and camera ego-motion are then used to synthesize the target image $I_{t}$ . This process is described in Equation 1, where $K$ represents the camera intrinsics, $Proj(\cdot)$ is the function that outputs the 2D coordinates, and $\langle\cdot\rangle$ is the sampling operator.

I_{t^{\prime}\rightarrow t}=I_{t^{\prime}}\langle Proj(D_{t},P_{t\rightarrow t% ^{\prime}},K)\rangle

(1)

Finally, the photometric loss ( $pe$ ) is calculated as defined in Equation 2, where $\alpha$ is set to 0.85. In Monodepth2 [12], the per-pixel photometric loss $\mathcal{L}{p}$ is defined as $\min\limits{t^{\prime}}pe(I_{t},I_{t^{\prime}\rightarrow t})$ .

pe(I_{a},I_{b})=\frac{\alpha}{2}(1-SSIM(I_{a},I_{b}))+(1-\alpha)||I_{a}-I_{b}||

(2)

3.2 Generative Diffusion Model-based Translation

As discussed in Section 1, GAN-based translation models have exhibited certain limitations. In this paper, we present a novel translation model called GDT (Generative Diffusion Model-based Translation). The pipeline of GDT is depicted in Figure 3. The objective of GDT is to generate a training sample $I_{g}$ that closely resembles the day-clear image $I_{d}$ in terms of depth. To achieve this, we leverage the Stable Diffusion prior and introduce several control mechanisms to transform the day-clear image into challenging conditions while preserving specific characteristics.

First, we employ the BLIP2 model [18] to obtain a scene description $T_{cap}$ , which aids in preserving the content of the image. Second, we utilize the d2i model of the controlNet [43] to maintain approximate depth consistency. We have observed that the clearer the depth information, the more realistic the generated RGB image becomes. Thus, we employ PatchFusion to enhance the depth estimation obtained from the MiDaS model. Third, we address the challenge of introducing diverse styles into the generated images. We combine the $T_{cap}$ with challenging condition descriptors. However, we encountered an issue where the generated challenging condition images lacked realism. To overcome this, we introduce an image prompt for challenging conditions. In our implementation, we randomly select night images or rain images from the train dataset as image prompts. We utilize the IPAdapter model [40] to incorporate both text prompts and image prompts as inputs for image generation. The text prompt assists in preserving the content, while the image prompt facilitates style transfer.

In comparison to previous GAN-based methods, our GDT approach generates images with greater diversity and does not necessitate training for each specific scene. It can be considered a plug-and-play module. As more powerful generative diffusion models continue to emerge, our GDT method will become even more robust. The generation of the image $I_{g}$ can be expressed as follows:

I_{g}=SD(IP(T_{p},I_{p}),CN(D_{h}),z)

(3)

Here, $SD$ represents the Stable Diffusion prior, $IP$ denotes the IPAdapter model that combines the text prompt $T_{p}$ and image prompt $I_{p}$ , $CN$ corresponds to the controlNet model for depth consistency (utilizing $D_{h}$ as input), and $z$ represents the latent noise variable.

3.3 Stealing Stable Diffusion Prior

In order to steal a stable diffusion prior for robust depth estimation, we have made modifications to the architecture of Monodepth2 [12]. Additionally, we have employed a self-training strategy and introduced additional loss functions. Figure 4 provides a brief illustration of our SSD pipeline.

3.3.1 Network Architecture

In the context of generative diffusion model-based translation and GAN-based translation, the fundamental difference between them lies in their ability to generate data. Both approaches need to adhere to the assumption of constant depth before and after image translation. As a result, the translation process of GAN is more akin to style transfer, where the images before and after translation essentially belong to the same dataset.

However, the generative diffusion model-based translation, leveraging its powerful generative capabilities, guarantees that the image before and after translation will have the same depth while experiencing significant changes in content. This process can be regarded as a form of data scale-up. To achieve effective data scale-up, a simple encoder alone is insufficient in adequately extracting image features to further enhance model performance. Therefore, we have made modifications to the depth model based on Monodepth2 [12]. We have moved away from the original ResNet-based architecture [15] and introduced DINOv2 [25] as the encoder in our depth model. This choice of encoder facilitates the extraction of robust image features. Additionally, we utilize pre-trained weights, enabling the depth model to inherit rich semantic information and scene understanding capabilities. Furthermore, we employ DPT depth head [26] as the decoder to regress depth. As for PoseNet, we maintain the original ResNet-based network architecture without making any changes.

3.3.2 Self-Training Strategy

Our self-training strategy follows the approach introduced in [10], which has proven to be simple and effective in improving the performance of depth models. We adopt this strategy, which involves training a teacher network on day-clear samples and subsequently training a student network.

The first step is to train the teacher network using day-clear samples. This ensures that the teacher network produces reliable results specifically for day-clear scenarios. For instance, in the case of self-supervised monocular depth estimation, training the depth model using night samples can be challenging due to the lack of texture and the inability to utilize correspondences between adjacent frames. Photometric loss may fail in such scenarios.

Next, we train the student network to have the ability to perceive depth information across a variety of scenarios. To achieve this, we train the student network using a diverse range of training samples. For a day-clear training sample $I_{d}$ , it can potentially be translated into a challenging condition training sample $I_{g}$ . Here, we make the basic assumption that the depth between $I_{d}$ and $I_{g}$ remains relatively consistent. There is $P$ % probability of no changes and $1-P$ % probability of changing into $I_{g}$ where $P=|C|/(|C|+1)$ and $|C|$ stands for the number of interest challenging conditions $C$ . We leverage the teacher model to generate the pseudo ground truth $D_{t}$ for $I_{d}$ . This pseudo ground truth $D_{t}$ also serves as a guidance for the student model when processing the GDT-translated $I_{g}$ .

We then input a mixture of $I_{d}$ and $I_{g}$ into the student model and generate the predicted depth map $D_{s}$ . By computing the distillation loss using $D_{t}$ and $D_{s}$ , our model becomes capable of estimating depth not only under day-clear conditions but also under challenging conditions. This self-training strategy enables the student network to learn from both the teacher network and the GDT-translated samples, improving its ability to estimate depth across a diverse range of scenarios.

3.3.3 Loss computation

We ultimately employed two loss functions: the teacher loss $\mathcal{L}_{t}$ and the semantic alignment loss $\mathcal{L}_{s}$ . The teacher loss addresses the limitations of the naive distillation loss, while the semantic alignment loss draws inspiration from [39] and aims to better utilize the semantic similarity between $I_{d}$ and $I_{g}$ . $\mathcal{L}_{s}$ is suitable for both supervised and self-supervised learning and $\mathcal{L}_{t}$ is only suitable for self-supervised learning and replaced by supervised loss in supervised learning.

In self-supervised MDE, the teacher model struggles to accurately estimate the depth of each pixel, while the student model may outperform the teacher model in specific pixels. Although the teacher model initially provides guidance to the student model during training, as training progresses, the student model becomes more capable of independently estimating depth. In some cases, the student model’s estimates may be more accurate than those of the teacher model. However, due to the limitations of the distillation loss used, the student model relies solely on the teacher model’s estimates, hindering further improvement in its performance. On this basis, we propose the teacher loss. The core idea of teacher loss is to mask pixels corresponding to unreasonable depths estimated by the teacher model when calculating distillation loss.

For a day-clear frame $I_{t}$ , we use the DepthNet of teacher model and student model to generate the corresponding dense depth map $D_{t}^{t}$ and $D_{t}^{s}$ . Although the inputs of student model may be translated into $I_{t}^{g}$ , we use the adjacent frames $I_{t^{\prime}}$ to reconstruct the image $I_{t}$ . Then, we use the PoseNet of the teacher model to estimate the camera ego-motion $P^{t}_{t\rightarrow t^{\prime}}$ . We obtain the $I^{s}_{t^{\prime}\rightarrow t}$ and $I^{t}_{t^{\prime}\rightarrow t}$ using the $D_{t}^{s}$ , $D_{t}^{t}$ and $P^{t}_{t\rightarrow t^{\prime}}$ as shown in equation 4.

\begin{split}I^{s}_{t^{\prime}\rightarrow t}=I_{t^{\prime}}\langle Proj(D_{t}^% {s},P_{t\rightarrow t^{\prime}}^{t},K)\rangle\\ I^{t}_{t^{\prime}\rightarrow t}=I_{t^{\prime}}\langle Proj(D_{t}^{t},P_{t% \rightarrow t^{\prime}}^{t},K)\rangle\end{split}

(4)

For self-supervised MDE, the lower the photometric loss, the more reasonable the estimated depth. If $D_{t}^{t}$ is more reasonable than $D_{t}^{s}$ , then the photometric loss calculated by $D_{t}^{t}$ (T- $pe$ ) is lower than $D_{t}^{s}$ (S- $pe$ ). Thus, we design the mask $M$ to select the pixels with higher S- $pe$ than T- $pe$ as shown in equation 5, where $[\ ]$ represents the Iverson brackets.

M=[\min_{t^{\prime}}pe(I_{t},I^{t}_{t\rightarrow t^{\prime}})>\min_{t^{\prime}% }pe(I_{t},I^{s}_{t\rightarrow t^{\prime}})]

(5)

We use the mask $M$ to filter the pixels with unreasonable depth. Our teacher loss is shown in equation 6 where $\mathcal{L}_{d}$ represents the day distillation loss defined in md4all-DD and $\odot$ represents the pixel-wise product.

\mathcal{L}_{t}=M\odot\mathcal{L}_{d}

(6)

Besides teacher loss, we also used the semantic loss proposed in[39]. As the images translated by our GDT remain the semantic information similar to origin day-clear image $I_{d}$ , we want to use the pre-trained DINOv2 to produce auxiliary image features for semantic align.

\mathcal{L}_{s}=1-\frac{1}{HW}\sum_{i=1}^{HW}\cos(f_{i},f_{i^{\prime}})

(7)

4 Experiments and Results

4.1 Experimental Setup

4.1.1 Datasets and Metrics

In accordance with md4all [10], we choose nuScenes [3] and Oxford RobotCar [24] as the datasets for our study. NuScenes is a comprehensive dataset for autonomous driving, offering abundant sensor data and precise annotations. It encompasses around 1,000 urban driving scenes, encompassing diverse weather conditions, traffic scenarios, and road types. For the RMDE task, we partitioned the dataset based on visibility. Ordinarily, the dataset contains over 34,000 samples, with 28,130 samples allocated for training and 6,019 samples for validation. However, we utilize only 15,120 day-clear samples for training, and the remaining 6,019 samples are further divided based on visibility (i.e., day-clear, night, day-rain). The depth range for testing is 0.1-80 meters. RobotCar is a dataset gathered in Oxford, UK. In our study, we exclusively utilize day and night scenes, excluding rainy scenarios. The dataset comprises 16,563 day training samples and 1,411 test samples (702 day and 709 night). The depth range for testing is 0.1-50 meters.

In our experiments, we employ commonly adopted metrics for depth estimation, namely absRel, RMSE, $\delta_{1}$ , and sqRel. Specifically, we use the first three metrics for evaluating the model’s performance on nuScenes, while all four metrics are utilized for assessing the model’s performance on RobotCar.

4.1.2 Implementation Details

Both our teacher and student depth models utilize the architecture described in Section 3.3. In our experiments, we employ the DINOv2 encoder based on ViT-base as the encoder for our depth model, and we utilize ResNet-18 as the pose net encoder. Both the student and teacher networks were trained for 20 epochs. The image input size for the pose encoder is 576 $\times$ 320, while the DINOv2 encoder has an input size of 784 $\times$ 518 (the output size is resized to 576 $\times$ 320 for alignment). During both stages of training, the learning rate for our depth model, which includes the DINOv2 encoder and DPT head, was set to 5e-6. The learning rate for the teacher’s pose net was set to 5e-5 (the student’s pose net was dropped). We utilized the AdamW optimizer and applied a linear schedule to decay the learning rate. The batch size was set to 4. The parameters of the DINOv2 encoder were initialized with pre-trained weights. For self-training, the DINOv2 encoder was guided by semantic loss using pre-trained weights, not the teacher DINOv2 encoder. For GDT (Guided Depth Training), we utilized pre-trained models, including MiDaS, PatchFusion, BLIP-2, Stable Diffusion v1.5, ControlNet v1.1, and IP-Adapter, without further training. Additionally, we chose the p49 mode for PatchFusion. Since the night images generated by SSD are brighter than typical night images, we adjusted their brightness to simulate more realistic outdoor night scenes. All experiments were conducted on a single 24GB RTX 4090 GPU.

4.2 Quantitative Results

			day-clear			night			day-rain
Method	sup.	tr.data	absRel $\,\downarrow$	RMSE $\,\downarrow$	$\delta_{1}\uparrow$	absRel $\,\downarrow$	RMSE $\,\downarrow$	$\delta_{1}\uparrow$	absRel $\,\downarrow$	RMSE $\,\downarrow$	$\delta_{1}\uparrow$
(a) MonoDepth2[12]	M*	a	0.1477	6.771	85.25	2.3332	32.940	10.54	0.4114	9.442	60.58
(b) PackNet-Sfm[13]	Mv	d	0.1567	7.230	82.64	0.2617	11.063	56.64	0.1645	8.288	77.07
(c) RNW[37]	M*	dn	0.2872	9.185	56.21	0.3333	10.098	43.72	0.2952	9.341	57.21
(d) md4all-baseline	Mv	d	0.1333	6.459	85.88	0.2419	10.922	58.17	0.1572	7.453	79.49
(e) md4all-AD[10]	Mv	dT(nr)	0.1523	6.853	83.11	0.2187	9.003	68.84	0.1601	7.832	78.97
(f) md4all-DD[10]	Mv	dT(nr)	0.1366	6.452	84.61	0.1921	8.507	71.07	0.1414	7.228	80.98
(g) SSD-T	M*v	d	0.1223	6.132	87.39	0.2103	8.024	67.70	0.1718	6.988	78.87
(h) SSD-S	M*vst	dT(nr)	0.1217	5.982	87.13	0.1939	8.038	72.31	0.1410	6.474	82.87

Table 1: Evaluation of self-supervised methods on the nuScenes[3] validation set. SSD-T: teacher model, SSD-S: student model, Supervisions(sup.): M: via monocular videos, *: test-time median-scaling via LiDAR, v: weak velocity loss, s: semantic loss, t: teacher loss. Training data (tr.data): d: day-clear, T: translated in, n: night, r: day-rain, a: all. Visual support:

\textbf{1}^{st}

\underline{2^{nd}}

The nuScenes dataset [3] is divided into three scenarios: day-clear, night, and day-rain. Following the setup proposed in [10], we evaluated the performance of our model in each of these scenarios. The quantitative results are presented in Table 1.

Due to limitations imposed by the model architecture (which requires 3D convolutions) and the number of samples in the dataset, (b)PackNet’s performance shows slight improvement for night and rain scenes but is greatly reduced for day-clear scenes. The weak velocity loss in (b)PackNet contributes to obtaining metric depth. Although (c)RNW employs a unique image enhancement method and adversarial training, it performs worse than our baseline. On the other hand, (e)md4all-AD and (f)md4all-DD are simple and efficient methods that enhance the model’s robustness. Building upon the md4all approach, we have made improvements to the model architecture and translation model. Comparing (d) and (g), our model architecture (DINOv2 encoder with DPT head) demonstrates significant superiority over monoDepth2. Additionally, thanks to the pre-trained weights from the DINOv2 encoder, our depth model can extract more robust features. Comparing (g) and (h), our student model outperforms the teacher model. In the night scenario, absRel is reduced by 7.8%, and in the day-rain scenario, absRel is reduced by 17.9%. This demonstrates that the student model indeed benefits from the GDT-transformed images and effectively incorporates the stable diffusion prior.

			day – RobotCar				night – RobotCar
Method	sup.	tr.data	absRel	sqRel	RMSE	$\delta_{1}$	absRel	sqRel	RMSE	$\delta_{1}$
(a) Monodepth2 [12]	M ${}^{*}$	d	0.1196	0.670	3.164	86.38	0.3029	1.724	5.038	45.88
(b) DeFeatNet [32]	M ${}^{*}$	a	0.2470	2.980	7.884	65.00	0.3340	4.589	8.606	58.60
(c) ADIDS [23]	M ${}^{*}$	a	0.2390	2.089	6.743	61.40	0.2870	2.569	7.985	49.00
(d) RNW [37]	M ${}^{*}$	a	0.2970	2.608	7.996	43.10	0.1850	1.710	6.549	73.30
(e) WSGD [34]	M ${}^{*}$	a	0.1760	1.603	6.036	75.00	0.1740	1.637	6.302	75.40
(f) md4all-DD [10]	Mv	dT(n)	0.1128	0.648	3.206	87.13	0.1219	0.784	3.604	84.86
(g) SSD-T	Mv	dT(n)	0.1069	0.762	3.241	89.38	0.1137	0.685	3.460	86.14
(h) SSD-S	Mv	dT(n)	0.1058	0.718	3.203	89.09	0.1143	0.823	3.646	87.36

Table 2: Evaluation of self-supervised works on the RobotCar [24] test set. Notation from Table 1.

The RobotCar dataset [24] is divided into day and night scenarios. (a), (f), and (g) demonstrate the strong generalization capability of our model architecture. Despite not being trained on night data, our model performs well and even outperforms md4all-DD. Furthermore, (g) and (h) illustrate that, despite already achieving high performance, our GDT provides a prior that enables the student model to surpass the teacher model. This leads to further improvements in depth estimation accuracy for both day and night scenarios.

4.3 Qualitative Results

The qualitative results depicted in Figure 5 and Figure 6 showcase the performance of our SSD (Single Shot Depth) model in depth estimation, specifically on the nuScenes [3] and RobotCar [24] datasets. In the nuScenes dataset, our SSD model demonstrates enhanced performance in extracting depth information from regions with low illumination, such as night scenes. Moreover, our approach effectively restores depth information even in rainy scenes. Figure 5 highlights the advantages of our SSD-T (teacher) and SSD-S (student) models over the monodepth2 and md4all-DD methods in night scenes. Our models accurately estimate the depth of cars and pedestrians, whereas the other methods fail to do so. However, in rainy scenes, SSD-T’s performance is compromised due to the superior perceptive ability of the DINOv2 architecture. SSD-T mistakenly perceives virtual images in water as objects, leading to incorrect depth estimations. Nonetheless, SSD-S, which incorporates semantic loss and teacher loss, achieves excellent results. Figure 6 displays the results obtained in both day-clear and night scenes. In the day-clear scene, both SSD-T and SSD-S successfully capture the edge of the wall in their depth maps, whereas monodepth2 and md4all-DD fail to do so. In the night scene, monodepth2 produces unreliable depth maps, whereas our SSD-T and SSD-S accurately estimate the depth of the bicycle on the right side.

4.4 Ablation Study

We conducted ablation experiments on the nuScenes validation set to validate the effectiveness of our proposed approach. The results are presented in Table 3. In the ablation study, our metrics represent the average performance of the model across all samples in the validation set. In Method 1, we changed the original translation model of md4all to our GDT, without incorporating PatchFusion. The performance of the model improved as GDT can produce detailed images that preserve fine depth information. In Method 2, we introduced PatchFusion, which enhanced the model’s performance by generating high-resolution depth maps. These maps contribute to the preservation of fine depth details and aid GDT in producing more accurate images. During knowledge distillation, the teacher model may produce unreasonable depth estimations. In Method 3, we replaced the distillation loss with the teacher loss, resolving this issue and further improving the model’s performance. Considering the powerful generative capability of SSD, it is necessary to have a more robust backbone to produce reliable image features. DINOv2, known for its ability to generate robust image features, is suitable for our task. In Method 4, after adopting DINOv2 as our encoder, the performance of the depth model improved significantly. In Method 5, we employed semantic loss to align the image features, resulting in a further improvement in model performance.

method	backbone		translation model			loss			metrics-all
method	ResNet-18	DINOv2	GAN	MiDaS	PatchFusion	distillation loss	teacher loss	semantic loss	absRel $\,\downarrow$	RMSE $\,\downarrow$	$\delta_{1}\uparrow$
0	✓		✓			✓			0.1513	6.761	81.51
1	✓			✓		✓			0.1529	6.887	81.03
2	✓			✓	✓	✓			0.1497	6.806	81.16
3	✓			✓	✓		✓		0.1485	6.801	81.30
4		✓		✓	✓		✓		0.1360	6.333	84.07
5		✓		✓	✓		✓	✓	0.1320	6.266	84.96

Table 3: Ablation study on the nuScenes dataset. We compare the effects of different training settings on the depth model.

5 Conclusion

This paper introduces SSD, a practical solution for robust monocular depth estimation. We underscore the significance of the generative diffusion model prior in generating challenging samples. We have successfully integrated the stable diffusion prior into our depth estimation model using a self-training approach. With the emergence of stronger generative diffusion models and more accurate control models, along with advancements in depth model capabilities, the potential for robust depth estimation can be further enhanced. Furthermore, the SSD paradigm can extend to other dense prediction tasks, such as robust semantic segmentation and 3D occupancy prediction, offering promising avenues for future research and applications.

References

[1] Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4009–4018 (June 2021)
[2] Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)
[3] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020)
[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[5] Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2015)
[6] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. vol. 27. Curran Associates, Inc. (2014), https://proceedings.neurips.cc/paper_files/paper/2014/file/7bccfde7714a1ebadf06c5f4cea752c1-Paper.pdf
[7] Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2002–2011 (2018)
[8] Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. pp. 740–756. Springer International Publishing, Cham (2016)
[9] Gasperini, S., Koch, P., Dallabetta, V., Navab, N., Busam, B., Tombari, F.: R4dyn: Exploring radar for self-supervised monocular depth estimation of dynamic scenes. In: 2021 International Conference on 3D Vision (3DV). pp. 751–760 (2021). https://doi.org/10.1109/3DV53792.2021.00084
[10] Gasperini, S., Morbitzer, N., Jung, H., Navab, N., Tombari, F.: Robust monocular depth estimation under challenging conditions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8177–8186 (October 2023)
[11] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
[12] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
[13] Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2485–2494 (2020)
[14] Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
[15] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[16] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV). pp. 239–248. IEEE (2016)
[17] Li, J., Wang, Y., Huang, Z., Zheng, J., Xian, K., Cao, Z., Zhang, J.: Diffusion-augmented depth prediction with sparse annotations. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 2865–2876. MM ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3581783.3611807, https://doi.org/10.1145/3581783.3611807
[18] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[19] Li, Z., Wang, X., Liu, X., Jiang, J.: Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
[20] Lin, J.T., Dai, D., Gool, L.V.: Depth estimation from monocular images and sparse radar data. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 10233–10240 (2020). https://doi.org/10.1109/IROS45743.2020.9340998
[21] Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 2024–2039 (2016). https://doi.org/10.1109/TPAMI.2015.2505283
[22] Liu, L., Song, X., Wang, M., Liu, Y., Zhang, L.: Self-supervised monocular depth estimation for all day images using domain separation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12737–12746 (2021)
[23] Liu, L., Song, X., Wang, M., Liu, Y., Zhang, L.: Self-supervised monocular depth estimation for all day images using domain separation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 12737–12746 (October 2021)
[24] Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 Year, 1000km: The Oxford RobotCar Dataset. The International Journal of Robotics Research (IJRR) 36(1), 3–15 (2017). https://doi.org/10.1177/0278364916679498, http://dx.doi.org/10.1177/0278364916679498
[25] Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without supervision (2023)
[26] Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)
[27] Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence 44(3), 1623–1637 (2020)
[28] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (June 2022)
[29] Saunders, K., Vogiatzis, G., Manso, L.J.: Self-supervised monocular depth estimation: Let’s talk about the weather. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8907–8917 (October 2023)
[30] Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47, 7–42 (2002)
[31] Shin, U., Park, J., Kweon, I.S.: Deep depth estimation from thermal image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1043–1053 (June 2023)
[32] Spencer, J., Bowden, R., Hadfield, S.: Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14402–14413 (2020)
[33] Vankadari, M., Garg, S., Majumder, A., Kumar, S., Behera, A.: Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 443–459. Springer International Publishing, Cham (2020)
[34] Vankadari, M., Golodetz, S., Garg, S., Shin, S., Markham, A., Trigoni, N.: When the sun goes down: Repairing photometric losses for all-day depth estimation. In: Liu, K., Kulic, D., Ichnowski, J. (eds.) Proceedings of The 6th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 205, pp. 1992–2003. PMLR (14–18 Dec 2023), https://proceedings.mlr.press/v205/vankadari23a.html
[35] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
[36] Wang, J., Lin, C., Nie, L., Huang, S., Zhao, Y., Pan, X., Ai, R.: Weatherdepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions. arXiv preprint arXiv:2310.05556 (2023)
[37] Wang, K., Zhang, Z., Yan, Z., Li, X., Xu, B., Li, J., Yang, J.: Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 16055–16064 (October 2021)
[38] Xian, K., Cao, Z., Shen, C., Lin, G.: Towards robust monocular depth estimation: A new baseline and benchmark. International Journal of Computer Vision pp. 1–19 (2024)
[39] Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Unleashing the power of large-scale unlabeled data. arXiv:2401.10891 (2024)
[40] Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models (2023)
[41] Yin, W., Liu, Y., Shen, C.: Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(10), 7282–7295 (2021)
[42] Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Newcrfs: Neural window fully-connected crfs for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
[43] Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3836–3847 (October 2023)
[44] Zhao, C., Tang, Y., Sun, Q.: Unsupervised monocular depth estimation in highly complex environments. IEEE Transactions on Emerging Topics in Computational Intelligence 6(5), 1237–1246 (2022). https://doi.org/10.1109/TETCI.2022.3182360
[45] Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer. In: 2022 International Conference on 3D Vision (3DV). pp. 668–678. IEEE (2022)
[46] Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)