(Review) High Quality Monocular Depth Estimation via Transfer Learning

Published in

Data And Beyond

10 min readAug 27, 2023

Hey , hey ,heeeey , hold on your hats folks ! I’ve got the inside scoop on a seriously intriguing paper , but before we dive into the nitty-gritty let me pique your curiosity with a question: Have you ever wondered how computers manage to turn a plain flat image into a mesmerizing 3D wonderland?

actually , it’s a mystery to most of us

Well, i’ve got for you this paper that talks about the building block of that challenge 🎆The Monocular Depth Estimation 🎆!

1- Introduction

In the world of computational exploration, comprehending a 3D scene from a 2D image akin to human perception necessitates the estimation of pixel-level depths , this intricate endeavor involves estimating the distance of every single pixel gauging the distances of objects similar to how our brains decipher the world around us ! as depicted in the provided figure, the yellow pixels correspond to objects in close proximity, whereas the bluish tones indicate objects that are positioned at a greater distance

but how are computers able to perform this so smoothly and with a high accuracy ? … sure , they employ convolutional neural networks cnn ( the digital brainiacs that can help computers “see” depth) … But, and here’s the twist, it’s not all smooth sailing . sometimes, their depth maps turn out a tad blurry , not crystal clear like our vision

this is where the authors behind this paper rolled up their sleeves and taken up the mantle , they’ve concocted a simpler yet more effective solution : a turbocharged convolutional neural network (CNN) armed with a streamlined architecture and underpinned by the knowledge of transfer learning

this innovation sets the stage for a remarkable leap forward , The key lies in its ability to tap into the reservoir of pre-existing knowledge stored within pretrained models , seamlessly integrating this knowledge into the current model’s framework

so , this paper propose three main contributions

a transfer learning-based network architecture for more accurate depth estimations, enhancing object boundary capture while using fewer parameters and iterations
a corresponding loss function, learning strategy, and data augmentation technique to expedite learning
a new testing dataset of photorealistic synthetic indoor scenes with precise ground truth, allowing for robust evaluation of depth estimating CNNs

2- Related Work

reconstructing 3D scenes from RGB images is a challenging endeavor, compounded by issues like incomplete coverage and intricate materials traditional approaches often necessitate hardware or multiple vantage points but recent CNN techniques excel at generating real-time depth maps from individual images

ones of the latest advancements are :

Multi-View Stereo Reconstruction: This involves using CNNs to handle scenarios where we have pairs of images or sequences of frames
Transfer learning : this methodology involves utilizing the parameters of a pretrained model, originally designed for classification, in the encoder part , the focus is on preserving higher spatial resolution, leading to sharper depth estimations, especially when combined with skip connections.
Encoder-Decoder networks : these networks plays a big role in many vision tasks and show promise in both supervised (with labeled data) and unsupervised (without labeled data) depth estimation these networks are a key part of the ongoing improvements in this field

3- Proposed Method

3.1. Network architecture

The proposed depth estimation network employs an encoder-decoder architecture, as shown in Fig. 2. the encoder phase employs a pre-trained DenseNet-169 on ImageNet to convert the input RGB image into a feature vector , this vector then undergoes up-sampling layers to generate the final depth map at half the input resolution , the decoder comprises these up-sampling layers and skip-connections , unlike recent methods Batch Normalization and other advanced layers has been skipped

note : the experiments shows that a basic decoder comprising a 2× bilinear upsampling followed by two standard convolutional layers achieves excellent results

3.2. Loss function

they aim to create a loss function that strikes a balance between two critical aspects

first, minimizes the difference between the depth values predicted by our network denoted as “y-hat”
second, penalizes distortions in high-frequency details within the depth image which often correspond to object boundaries in the scene

thus a weighted sum of three loss functions is adopted
L(y, yˆ) = λLdepth(y, yˆ) + Lgrad(y, yˆ) + LSSIM(y, yˆ)

The first loss term Ldepth is the point-wise L1 loss defined on the depth values it calculates the absolute difference between the predicted depth value and the corresponding actual depth value for each point in the image , the sum of these differences is then normalized by the total number of points n :

The second loss term Lgrad is the L1 loss defined over the image gradient g of the depth image , here, we calculate the absolute differences in the x and y components of the gradients of both the predicted depth map and the groundtruth depth map

Lastly, LSSIM This term incorporates the Structural Similarity Index (SSIM) , a metric for image reconstruction tasks

SSIM captures the structural and textural similarity between two images

these three loss terms are combined to create our overall loss function, denoted as L(y, yˆ), as a weighted sum:

now, here is the remaining problem that we need to fix with these loss terms

the loss terms can get too big when we’re dealing with large actual depth values , to handle this we have a smart trick instead of using the depth values directly, we use their “reciprocal” (like the opposite of a fraction) we also make a new target depth map , called “y” which is made by taking the reciprocal of the original depth map “Yorig.” (y = m/Yorig) where m is the maximum depth in the scene

3.3. Augmentation Policy

they’ve only applied horizontal flipping to images (with a 0.5 chance) because other transformations might not make sense in terms of depth interpretation as for vertical flipping ( applying it to an image capturing an indoor scene may not contribute to the learning of expected statistical properties (e.g. geometry of the floors and ceilings)).

while image rotation could help, they’ve avoided it since it can create invalid data

in terms of color, swapping color channels like red and green enhances performance effectively (with a 0.25 probability).

4- Experiments

4.1. Datasets

4.1.1. NYU Depth v2 : Indoor scenes dataset f 640 × 480 , it contains 120K training
samples and 654 testing samples , the model is trained on a 50k subset considering inpainted missing depth values and the predictions are at half resolution .
For the testing phase full test image depth are predicted and upscaled by 2× to evaluate against ground truth.
4.1.2. KITTI: Outdoor scenes dataset with stereo images 1241 × 376 and corresponding 3D laser scans , the model is trained on a 26k subset considering inpainted missing depth values and images for the encoder architecture are scaled
For the testing phase the output depth image are scaled and averaged with its mirror image’s prediction for the final output

4.2. Implementation Details

. implemented using TensorFlow and trained it on four NVIDIA TITAN Xp GPUs
. the encoder is based on DenseNet169 pretrained on ImageNet, while decoder weights are randomly initialized
. the ADAM optimizer is employed with a lr = of 0.0001, β1 = 0.9 , β2 = 0.999
. batch size (training) = 8
. training for the NYU Depth v2 dataset → 1 million iterations over 20 hours
. For the KITTI dataset → 300K iterations over 9 hours

4.3. Evaluation

4.3.1. Quantitative Evaluation: quantitatively compared against state-of-the-art using six standard metrics

4.3.2. Qualitative Results: three experiments are conducted to assess result quality on the NYU Depth v2 test set
- the first experiment uses a perception-based metric, measuring structural similarity (mSSIM) between ground truth and predicted depth maps
- the second experiment evaluates edge formation by comparing gradient magnitude images and computing F1 scores
- the third experiment calculates mean cosine distance between normal maps extracted from predicted and ground truth depth images

as can be seen, the proposed approach produces depth estimations at higher quality where depth edges better match those of the ground truth
and with significantly fewer artifacts

4.4. Comparing Performance

the depth estimation network’s performance is compared to state-of-the-art on the NYU Depth v2 dataset (Tab. 1) and the KITTI dataset (Tab. 2).
on NYU Depth v2, the model outperforms previous state-of-the-art in most metrics, using fewer parameters, training iterations, and data
scaling predictions to account for scene scale errors aligns with state-of-the-art results on all metrics
on NYU Depth v2, the method surpasses state-of-the-art based on defined quality metrics (Tab. 3), considering methods with pre-trained models and code
on the KITTI dataset, the method ranks as the second best on standard metrics ,however , due to the nature of sparse depth maps the learning process doesn’t converge well for such data

despite rankings, visual results (Fig. 3) demonstrate superior depth map quality compared to state-of-the-art

4.5. Ablation Studies

(analyzing our model’s components and highlighting the effects of changes)

encoder depth : using DenseNet-201 yields lower validation loss, but it’s slow and memory-intensive
decoder depth: reducing decoder’s features halves performance and creates instability
color augmentation: disabling color channel swapping leads to significant reduction and overfitting (color augmentation’s impact on neural network is a potential future research area)

tested variations of the standard model’s components through validation loss on the NYU Depth v2 dataset (750K iterations)

pre-trained model:
- compared effects of encoder initialization: random weights vs. pre-trained ImageNet
- scratch-initialized encoder led to significantly higher validation loss ( purple)
skip Connections:
- explored removing skip connections between encoder and decoder layers
- skip-less model decreased validation loss (green)
batch size:
- analyzed different batch sizes impact on performance
- validation loss for batch sizes 2 and 16 compared to batch size 8 in the standard model (red and blue) , batch size 8 exhibited the best performance while keeping training time reasonable

4.6. Generalizing to Other Datasets:

unreal-1k, a new dataset with photo-realistic indoor scenes and accurate ground truth depths has been introduced , it has been sampled from unreal engine renderings
they compared the NYU Depth v2-trained model to two other supervised methods trained on the same dataset , the aim is to assess how well models generalize across different data distributions
quantitative and qualitative comparisons show the method’s superiority in terms of average errors and mSSIM.

5- Conclusion

here is revised set of informations extracted from this insightful paper :

a Convolutional neural network for depth map estimation from single RGB images has been introduced in this paper
leveraged recent advances in network architecture and pre-trained models
the constructed encoder initialized with meaningful weights outperforms costly multistage depth networks
achieved state-of-the-art performance on NYU Depth v2 and a new Unreal-1K dataset
the main focus of this paper is highlighted on producing higher quality depth maps that accurately capture object boundaries
transfer learning will leverage future advances in backbone architecture designs to elevate the overall performance of the encoder-decoder network
the proposed architecture can pave the way for embedding quality depth estimation in compact devices

and finally , i want to express my sincere gratitude to each and every one of you for taking the time to read my blog your engagement and interest in are truly appreciated

Muchas gracias ✨💜

References :

High Quality Monocular Depth Estimation via Transfer Learning
Ibraheem Alhashim (KAUST) , Peter Wonka (KAUST)