Monocular Depth Estimation using ZoeDepth : Our Experience

11 min readMay 1, 2023

Often seen as a challenging problem, monocular depth estimation involves predicting the depth of each pixel given a single input image. It has multiple applications — in autonomous driving, robotics, navigation and even 3D reconstruction. However, it was one interesting application that came our way during a recent challenge that prompted us to explore this method.

We used the ZoeDepth method in the recently concluded AICrowd monocular depth estimation challenge, hosted by AICrowd and Amazon Prime Air. This article will focus on the theoretical and practical concepts behind this method, and will also give a detailed account of the performance of this method on a custom dataset.

The link of the challenge is given here — https://www.aicrowd.com/challenges/scene-understanding-for-autonomous-drone-delivery-suadd-23/problems/mono-depth-perception

The dataset link is given here (suadd’23 monocular depth estimation) — https://www.aicrowd.com/challenges/scene-understanding-for-autonomous-drone-delivery-suadd-23/problems/mono-depth-perception/dataset_files

Table of Contents —

1. Monocular Depth Estimation
2. An Overview on MiDAS
3. The ZoeDepth architecture
4. Custom Training on the SUADD’23 Dataset
5. Results and Observations

Monocular Depth Estimation

From a first glance the task may seem similar to most segmentation tasks. Here we simply need to assign a depth value to each pixel of an input image, however, an important difference is that unlike in semantic segmentation we don’t have a fixed set of classes. The range of depth values per image will be different for different images and for this reason we can treat depth estimation as a per-pixel regression task.

The depth values per pixel can either be in a physical unit (cm, meters) or in relative units. But why have different types of units for the same task? Lets find out.

Metric Depth Estimation (MDE)

Generally considered to be the more popular branch, this version of monocular depth estimation involves providing depth values in an absolute physical unit (e.g. meters).

Fig 1: Depth values change depending on how close objects are. Photo taken from paper https://arxiv.org/pdf/1810.01849.pdf

Intuitively, this may seem to be more helpful as the depth estimation model can give depth values in actual units (depending on how far objects are from the camera). As a result, we can have multiple use cases out of this where an application can use these values to make decisions pertaining to its functioning. For example, a robotic arm can use these depth values to estimate where objects are, and since the scale is definitive (meters), it can make accurate movements to pick up various objects.

Relative Depth Estimation (RDE)

Though not as popular but an equally important task, relative depth maps contain depth values per pixel that are consistent only relative to each other. This means that there is no definitive scale, the depth values only indicate how close or far aways objects are compared to one other, it provides no real information on the actual distance of the object from the camera.

Fig 2 Relative depth estimation. Photo taken from paper https://www.researchgate.net/publication/305082047_Light_field_completion_using_focal_stack_propagation

From a quick comparison, it may seem that metric depth maps provide more useful information, however, there is a key drawback. When training on the MDE task, its very difficult to get models to generalize on different scenes. This is especially the case if datasets have different ranges of min_depth to max_depth. Take for example, the NYU-Depth V2 and KITTI datasets — the NYU-Depth V2 is an indoor dataset while the KITTI is an outdoor dataset, the former will have a much lower range of depth values as compared to the latter. If we decide to train a model on both datasets combined, the model wont be able to accurately capture the scale info of the two completely different datasets.

Now what about Relative depth estimation?

Well, for this task, you will face no such problem as above. Since there is no scale, and all depth values are relative values, you can combine datasets and expect the model to generalize well. The only problem is that these values don’t carry any significant meaning as there is no scale!

The solution proposed by Zoedepth is to integrate both RDE and MDE by first performing relative depth estimation pre-training and then finetuning on metric depth estimation datasets (NYU-Depth V2 and KITTI). To understand this, we first need an overview on MiDAS.

An overview on MiDAS

MiDAS is considered to be a very popular model when performing monocular depth estimation, its often considered to be a baseline and its code can be found here.

The original implementation of MiDAS uses a convolutional backbone, however the newer releases uses transformer models — namely the DPT model.

Fig 3 : Architecture of the DPT (Dense Prediction Transformer) model following an encoder-decoder structure. The transformer (ViT) is the basic building block of the encoder here. Photo taken from paper — https://arxiv.org/abs/2302.12288

The use of DPT by MiDAS for monocular depth estimation gives a significantly improved performance over the convolutional method used in the originally. According to the paper -

Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network.

The MiDAS repo also provides details on how well its different models perfrom relative to each other -

Fig 4 : A performance overview of different transformer models used by MiDAS. Here the dpt_beit_large_384 is what gets used for finetuning by ZoeDepth_NK. Photo taken from repository — https://github.com/isl-org/MiDaS

Now you might be wondering, why are we going through MiDAS if the subject of this article is ZoeDepth? Well, that’s because ZoeDepth uses the MiDAS depth estimation framework while also adding a series of new head modules as well as a new training routine.

To understand exactly how ZoeDepth differs, lets go to the next section.

The ZoeDepth Architecture

Fig 5 : The architecture of the ZoeDepth model. Note here that Zoedepth uses the MiDAS depth estimation framework while also using a new head module (Metric Bins Module) that helps in the calculation of metric depth. Photo taken from paper — https://arxiv.org/abs/2302.12288

Pre-training and Finetuning

Zoedepth attempts to combine the two tasks of relative depth estimation and metric depth estimation. According to the paper -

We propose the first approach that combines both worlds (Relative and Metric depth estimation), leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12- NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

The 12 datasets on which the pre-training is done are as follows — HRWSI, BlendedMVS, ReDWeb, DIML-Indoor, 3D Movies, MegaDepth, WSVD, TartanAir, ApolloScape and IRS, KITTI and NYU Depth v2. Following this, the model is finetuned on the metric depth estimation task, for which it primarily uses NYU Depth V2 and KITTI.

Metric bin Module

Another novel feature that the authors introduced is the metrics bin module that acts as a head for predicting the metric depth per pixel. We didn’t use this feature for our competition as the task in hand was based on relative depth estimation. However, to explain this feature, lets look at the below diagram -

The metrics bin module is connected to the MiDAS decoder and predicts the bin centers to be used for metric depth prediction. The actual bin centers are predicted by the bottleneck layer (see above diagram last layer) while the other decoder layers provide attractors that pull the bin centers to their final positions. The formula for the adjustment of bin center is given by — c’ᵢ = cᵢ + △cᵢ , where

Here α and γ are hyperparameters, and nₐ is the number of attractor points (green circles) per decoder layer.

Each decoder layer (excepting last) will output nₐ attractor points {aₖ : k = 1, …, nₐ} using an MLP for each pixel position. We can substitute the values to the equation above to compute △cᵢ, and using this we can determine the updated bin center location for each decoder layer.

According to the paper, the authors use values Nₜₒₜₐₗ = 64 bins and {16, 8, 4, 1} attractors for the first 4 decoder layers. Remember that the bin centers are denoted by the blue vertical lines in fig 6, and there are 64 of those lines that are predicted for each pixel. These 64 values per pixel now need to be combined to give a single depth value per pixel. The equation below performs this task, where pi is the probability value assigned to each bin center and ci is the location of the bin center.

Loss Function

The paper mentions that the scale invariant log loss was used, given by -

There were other loss functions implemented as well in the code, such as the L1 loss (otherwise known as the Mean Absolute Error Loss), a scale and shift invariant loss and a domain classifier loss (which implemented Cross Entropy Loss for ZoeDepth_NK). I did experiment with the other losses, however, apart from the domain classifier loss, none of the others stabilized. The SILog loss given above stabilized well and I will be talking about it in the next section.

Custom Training on the Suadd’23 Dataset

The task focused on relative depth estimation (RDE) with the images containing aerial views as seen from drones flying at different heights. We used a 90:10 train-test split and kept the image spatial dimensions unchanged. Since we purposely kept a high resolution, we had to use a small batch size during training. We also kept most of the hyperparameters unchanged during the initial training.

Fig 7 : An example of an image that is part of the dataset. Taken from Suadd’23 dataset — https://www.aicrowd.com/challenges/scene-understanding-for-autonomous-drone-delivery-suadd-23/problems/mono-depth-perception/dataset_files

If you want to train on your own dataset, make sure your data (ground truth labels) are in the relative depth map form (like in fig 2). Our data was very similar to the NYU Depth V2 dataset found here, the depth values were given as uint16 values and invalid values represented with 0.

Fig 8 : This config can be found in config.py, make sure to add your data_path, gt_path, data_path_eval and gt_path_eval. Also the nyudepthv2_train/test files need to be modified with your data. Photo taken by Author.

The config.py file contains a set of datasets that the ZoeDepth model can train on. By default it trains on the NYU dataset, ideally we should’ve created our own dictionary containing config variables for our suadd dataset. However, we chose not to and instead modified the NYU dictionary with our updated file names.

Once your done configuring your variables in the config.py file, you can also check the train_mono.py/train_mix.py files to change config variables there. After doing that, run the command -

This will run the ZoeDepth model for a single head. Do remember that the head isn’t really necessary for this task as we are not doing metric depth estimation, however the training will update the head parameters regardless.

If you want to run training on ZoeDepth_NK, then one additional change will be required. In place of line 30 in train_mix.py, we must import a different data loader, simply add this line near the top.

We now need to use this new data loader instead. Change the following lines to this -

We replaced the MixedNYUKITTI data loader with the DepthDataLoader. The reason for this is that the ZoeDepth_NK model expects two different datasets (NYU and KITTI) hence there will be two separate heads. Since we want to finetune on a single dataset, we need to use the DepthDataLoader instead.

After making this change, run the following command to start training ZoeDepth_NK using a single head-

And that is how you start the training! Now lets look at the results of this in the next section.

Results and Observations

We initially trained the ZoeDepth model keeping most of the hyperparameters intact, the loss function used here was the SILog Loss, which did a good job in converging.

Graph of SILog training of ZoeDepth trained for 20 epochs. Photo taken by Author.

In addition to this the test SIlog loss which was evaluated every quarter of an epoch did equally well.

Graph of SILog test of ZoeDepth trained for 20 epochs. Photo taken by Author.

To give more details of the metrics measured, we give the following graphs that depict the absolute relative error and the root mean square error. They have also converged as expected.

Graph of abs_rel error and rmse of ZoeDepth trained for 20 epochs. Photo taken by Author.

The model by default is configured to use the AdamW optimizer and a OneCycleLR scheduler during training of both ZoeDepth and ZoeDepth_NK models. During our tuning step, we did experiment with different optimizers and schedulers, however, we found no improved performance over the default configuration.

You can try this too by going to the base_trainer.py file and changing the optimizer, scheduler and/or other hyperparameters in the BaseTrainer() class. The ZoeDepth and ZoeDepth_NK inherits from this base class, meaning the training routine is mostly the same for both models.

However, one interesting difference we observed in the training code of ZoeDepth and ZoeDepth_NK is the use of a cross entropy based domain classifier loss for ZoeDepth_NK. I didn’t quite understand the use of this loss function, and when using it for training, the loss converged after only a few steps-

Graph of Domain Classifier loss. Photo taken by Author.

Since it converged early on, it didn’t seem to have any effect after a few training steps. Meaning that only the SILog Loss was actually considered. I didn’t understand why this happened, and if you have any insights regarding this, do write in the comments.

We also tried including other losses as well such as the L1 loss and a scale and shift invariant loss, however, there was no improvement over the original SILog Loss. This meant that SILog loss was the most ideal of the losses during training.

Another config variable we tuned was the max_depth variable found in config.py

Fig 9: NYU dataset as described in config.py. Photo taken by Author.

The max depth variable served as an upper bound for relative depth values, meaning that depth values of over 10 predicted by ZoeDepth were bounded. However, this value was specifically for NYU Depth V2 which is an indoor dataset.

We increased the max depth value and it did lead to a slightly better performance on our data. We hypothesize that this is due to the fact that our dataset was an outdoor dataset, where the range of relative depth values is much higher. By increasing the max depth, we reduced the model’s constraints and allowed it to predict higher values, leading to better results.

In short, understanding the config variables helped us tune the model in a better way, and we did improve our performance by tuning the right variables.

Concluding Notes

This being our first competition on monocular depth estimation, we learnt a lot through continuous experimentation and reading. The paper explained the model architecture and method brilliantly, and for this I sincerely thank the authors of the paper and repo for making things so easy to understand.

I hope this article served its purpose of explaining the task and model used. Thank you for reading.

References

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization…

arxiv.org

https://arxiv.org/abs/2103.13413

https://arxiv.org/abs/1907.01341v3

GitHub - isl-org/ZoeDepth: Metric depth estimation from a single image

It is recommended to fetch the latest MiDaS repo via torch hub before proceeding: Clone this repo: git clone…

github.com

GitHub - isl-org/MiDaS: Code for robust monocular depth estimation described in "Ranftl et. al…

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation…