Monocular Depth Estimation using U-Net

A complete walk through of depth estimation with implementation.

6 min readJul 12, 2021

For long in time, we have been relying on spatial images while that has done millions of good but it has one little constraint that is, it cannot explain itself in our real-world which is essential in recent complex applications like Elonmusk cars, jeff’s robots, or even Boston Spot-dogs. So, it is time to move ahead to look into something else, something that is advanced and more suitable. When I just sit and ponder upon that, my inner voice says “Increase dimensions” echoing twice or thrice. I don’t complain about that because that’s how we do Machine learning right.

In fact, someone actually thought about it and increased the dimension of the spatial image in an atypical way. I believe ‘dimension’ is not the right word to say, they encoded the information of how far each objects are placed from the camera as pixel values either in meters or inches or we can name it. The image is 2-dimensional having ‘x,y’ axes and the pixel values represent the 3rd dimension say ‘z’ axis.

There are Stereo cameras specifically designed for this, having two lenses side-by-side like human eyes. But, most of the cameras in our world constitute a single lens so we are going to tackle this problem on the software side more exactly we will imitate the functionality of the stereo cameras. And, AI does a damn good job in that.

Our main task is to create a depth map from an RGB image similar to one that is produced in the Stereo cameras. This is an active area of research with more than 100 papers releasing every year, in this article we are focused on the paper titled “High Quality Monocular Depth Estimation via Transfer Learning”. This paper adopts simple and powerful deep architecture “U-Net” for this task and going by the title, part of the architecture is composed of “DenseNet-169” trained on the ImageNet database. We will reproduce the methodologies and set up an experiment as proposed in the paper and see through the results it generates.

The scope of this article is to discuss the three main parts such as Data pipeline, Model architecture, and the training pipeline of this system in detail. Also for the best experience and fun and more fun, I would advise running cells residing in these Kaggle or Colab notebooks.

Data pipeline:

We will be using the NYU-Depth-V2 dataset used in their official paper which consists of indoor scenes that best describe the object in the context of the depth-estimation problem. Instead of looking at a certain object in the image, we are more focused on where the objects are placed in the image as in the entire scene. So, this dataset consists of video frames from different rooms like bedrooms, Kitchen, Study-rooms, offices and so on. Sample image pair would be as below,

This dataset consists of only indoor images, did you notice that I mentioned “see through the results” earlier. The architecture will be trained and its generalization capability will be assessed on outdoor video, because why NI thought of training the architecture in this indoor dataset and assess its generalization capability by passing an outdoor image or video because why Not?!

Some dataset stats include, contains 50688 pairs for training where we split it 80:20 for validation and a reserved set of 640 pairs for testing. As the paper proposes we keep the RGB image of shape 640x480 and depth maps of shape 320x240. We normalize the pairs to be in the range {0,1} and augmenting pairs by horizontal flips with 0.5 probability.

The paper advises using a batch size of 8, the custom data generator class produces data tuple of shape (8,480,640,3) for images and (8,240,320,1) for depth maps.

Model architecture:

U-Net was first proposed for abnormality localization in medical images with this paper. It has the ability for pixel-level localization and distinguishes unique patterns. It looks like a ‘U’ shaped structure where we term some part to be encoder and another part to be a decoder and each block of the decoder is connected to the corresponding block of the encoder using Skip connections. This helps it to have more latent spaces than its traditional counterpart and that said, it uses the information right from the input space and all of its intermediate representations to map the sample onto some well-defined latent space and then computing the output from it. These skip connections are experimentally validated in many pieces of research to solve the degradation problem.

As in the paper, we use “DenseNet-169” architecture as our encoder with pre-trained weights and for the decoder, we will start with a 1x1 convolutional layer with the same number of output channels as the output of the truncated encoder and 4 upsampling blocks each comprising a bilinear upsampling layer followed by two 3x3 convolution layer with output filters set to half the number of input filters. And, the output of the pooling layer from the encoder is concatenated with the output of upsampling layer for each block. Finally, each convolutional layer is accommodated with Batch Normalization layers and each upsampling block, except for the last one, is followed by a leaky ReLU activation function.

We have a deeper architecture consisting 1664 feature maps in its last encoder layer. In colab pro instance with V100 GPU and 25Gb RAM, it takes around 60 minutes for each epoch. For smaller architecture with not more than 512 layers, I encourage you to refer to the implementation in this repo. While experimenting with both these architecture, it shows to be train and validation metrics are somewhat similar range but it actually differs in the generalization capability of the model. Deeper models perform much better with 10% higher accuracy in the with-hold test set and also the depth maps are sharper.

Training Pipeline:

I start with the important decisive factor for the outcome of this problem, the loss function. We use the custom loss function as proposed in the paper combining three measures which are Structural similarity for the gray-scale rendered ground-truth and predicted maps, and computing the F1-score of Sobel gradient operated maps with a threshold of 0.5 for getting sharp edges and also the cosine distance between the ground-truth and predicted maps. By doing this, we make the network to learn intra-relationship and inter-relationship of the maps while with-holding strong and sharp edges.

We will use an Adam optimizer with a weight decay rate of 1e-6 and starting learning rate of 0.0001 with polynomial decay also some checkpoints to keep a track of it. Finally, we will enable the entire encoder layers to be trainable and use the famous model.fit() to start the training for 10 epochs. And now, lean back and enjoy those pretty little numbers changing with a progress bar pushing itself right, also maybe with pop corn!

After around 12 hours of training, I got around 95% validation and 85% test set accuracy. By looking at the test set predictions I could see these,

which are pretty decent with sharp and tight edges. Also, as said earlier I passed in the video recorded in the outside environment and got something like,

Colormap: OpenCv JET

Immediately we can notice the flickering between the frames. That is because this model is designed for prediction on a single image so I ran the model for each frame in the images and combined the results. There will be some prediction variation between the frames. For a smoother transition, we have to look at the model that takes the temporal domain into consideration.

The code is available at Colab or Kaggle platform. Some reference include the official paper and their implementation.

Thanks for taking your time to visit this line. :)

Monocular Depth Estimation using U-Net

A complete walk through of depth estimation with implementation.

Data pipeline:

Model architecture:

Training Pipeline:

Written by Bala Manikandan