(Review) High Quality Monocular Depth Estimation via Transfer Learning
Hey , hey ,heeeey , hold on your hats folks ! Iāve got the inside scoop on a seriously intriguing paper , but before we dive into the nitty-gritty let me pique your curiosity with a question: Have you ever wondered how computers manage to turn a plain flat image into a mesmerizing 3D wonderland?
actually , itās a mystery to most of us
Well, iāve got for you this paper that talks about the building block of that challenge šThe Monocular Depth Estimation š!
1- Introduction
In the world of computational exploration, comprehending a 3D scene from a 2D image akin to human perception necessitates the estimation of pixel-level depths , this intricate endeavor involves estimating the distance of every single pixel gauging the distances of objects similar to how our brains decipher the world around us ! as depicted in the provided figure, the yellow pixels correspond to objects in close proximity, whereas the bluish tones indicate objects that are positioned at a greater distance
but how are computers able to perform this so smoothly and with a high accuracy ? ā¦ sure , they employ convolutional neural networks cnn ( the digital brainiacs that can help computers āseeā depth) ā¦ But, and hereās the twist, itās not all smooth sailing . sometimes, their depth maps turn out a tad blurry , not crystal clear like our vision
this is where the authors behind this paper rolled up their sleeves and taken up the mantle , theyāve concocted a simpler yet more effective solution : a turbocharged convolutional neural network (CNN) armed with a streamlined architecture and underpinned by the knowledge of transfer learning
this innovation sets the stage for a remarkable leap forward , The key lies in its ability to tap into the reservoir of pre-existing knowledge stored within pretrained models , seamlessly integrating this knowledge into the current modelās framework
so , this paper propose three main contributions
- a transfer learning-based network architecture for more accurate depth estimations, enhancing object boundary capture while using fewer parameters and iterations
- a corresponding loss function, learning strategy, and data augmentation technique to expedite learning
- a new testing dataset of photorealistic synthetic indoor scenes with precise ground truth, allowing for robust evaluation of depth estimating CNNs
2- Related Work
reconstructing 3D scenes from RGB images is a challenging endeavor, compounded by issues like incomplete coverage and intricate materials traditional approaches often necessitate hardware or multiple vantage points but recent CNN techniques excel at generating real-time depth maps from individual images
ones of the latest advancements are :
- Multi-View Stereo Reconstruction: This involves using CNNs to handle scenarios where we have pairs of images or sequences of frames
- Transfer learning : this methodology involves utilizing the parameters of a pretrained model, originally designed for classification, in the encoder part , the focus is on preserving higher spatial resolution, leading to sharper depth estimations, especially when combined with skip connections.
- Encoder-Decoder networks : these networks plays a big role in many vision tasks and show promise in both supervised (with labeled data) and unsupervised (without labeled data) depth estimation these networks are a key part of the ongoing improvements in this field
3- Proposed Method
3.1. Network architecture
The proposed depth estimation network employs an encoder-decoder architecture, as shown in Fig. 2. the encoder phase employs a pre-trained DenseNet-169 on ImageNet to convert the input RGB image into a feature vector , this vector then undergoes up-sampling layers to generate the final depth map at half the input resolution , the decoder comprises these up-sampling layers and skip-connections , unlike recent methods Batch Normalization and other advanced layers has been skipped
note : the experiments shows that a basic decoder comprising a 2Ć bilinear upsampling followed by two standard convolutional layers achieves excellent results
3.2. Loss function
they aim to create a loss function that strikes a balance between two critical aspects
- first, minimizes the difference between the depth values predicted by our network denoted as āy-hatā
- second, penalizes distortions in high-frequency details within the depth image which often correspond to object boundaries in the scene
thus a weighted sum of three loss functions is adopted
L(y, yĖ) = Ī»Ldepth(y, yĖ) + Lgrad(y, yĖ) + LSSIM(y, yĖ)
The first loss term Ldepth is the point-wise L1 loss defined on the depth values it calculates the absolute difference between the predicted depth value and the corresponding actual depth value for each point in the image , the sum of these differences is then normalized by the total number of points n :
The second loss term Lgrad is the L1 loss defined over the image gradient g of the depth image , here, we calculate the absolute differences in the x and y components of the gradients of both the predicted depth map and the groundtruth depth map
Lastly, LSSIM This term incorporates the Structural Similarity Index (SSIM) , a metric for image reconstruction tasks
SSIM captures the structural and textural similarity between two images
these three loss terms are combined to create our overall loss function, denoted as L(y, yĖ), as a weighted sum:
now, here is the remaining problem that we need to fix with these loss terms
the loss terms can get too big when weāre dealing with large actual depth values , to handle this we have a smart trick instead of using the depth values directly, we use their āreciprocalā (like the opposite of a fraction) we also make a new target depth map , called āyā which is made by taking the reciprocal of the original depth map āYorig.ā (y = m/Yorig) where m is the maximum depth in the scene
3.3. Augmentation Policy
theyāve only applied horizontal flipping to images (with a 0.5 chance) because other transformations might not make sense in terms of depth interpretation as for vertical flipping ( applying it to an image capturing an indoor scene may not contribute to the learning of expected statistical properties (e.g. geometry of the floors and ceilings)).
while image rotation could help, theyāve avoided it since it can create invalid data
in terms of color, swapping color channels like red and green enhances performance effectively (with a 0.25 probability).
4- Experiments
4.1. Datasets
4.1.1. NYU Depth v2 : Indoor scenes dataset f 640 Ć 480 , it contains 120K training
samples and 654 testing samples , the model is trained on a 50k subset considering inpainted missing depth values and the predictions are at half resolution .For the testing phase full test image depth are predicted and upscaled by 2Ć to evaluate against ground truth.
4.1.2. KITTI: Outdoor scenes dataset with stereo images 1241 Ć 376 and corresponding 3D laser scans , the model is trained on a 26k subset considering inpainted missing depth values and images for the encoder architecture are scaled
For the testing phase the output depth image are scaled and averaged with its mirror imageās prediction for the final output
4.2. Implementation Details
. implemented using TensorFlow and trained it on four NVIDIA TITAN Xp GPUs
. the encoder is based on DenseNet169 pretrained on ImageNet, while decoder weights are randomly initialized
. the ADAM optimizer is employed with a lr = of 0.0001, Ī²1 = 0.9 , Ī²2 = 0.999
. batch size (training) = 8
. training for the NYU Depth v2 dataset ā 1 million iterations over 20 hours
. For the KITTI dataset ā 300K iterations over 9 hours
4.3. Evaluation
4.3.1. Quantitative Evaluation: quantitatively compared against state-of-the-art using six standard metrics
4.3.2. Qualitative Results: three experiments are conducted to assess result quality on the NYU Depth v2 test set
- the first experiment uses a perception-based metric, measuring structural similarity (mSSIM) between ground truth and predicted depth maps
- the second experiment evaluates edge formation by comparing gradient magnitude images and computing F1 scores
- the third experiment calculates mean cosine distance between normal maps extracted from predicted and ground truth depth images
as can be seen, the proposed approach produces depth estimations at higher quality where depth edges better match those of the ground truth
and with significantly fewer artifacts
4.4. Comparing Performance
- the depth estimation networkās performance is compared to state-of-the-art on the NYU Depth v2 dataset (Tab. 1) and the KITTI dataset (Tab. 2).
- on NYU Depth v2, the model outperforms previous state-of-the-art in most metrics, using fewer parameters, training iterations, and data
- scaling predictions to account for scene scale errors aligns with state-of-the-art results on all metrics
- on NYU Depth v2, the method surpasses state-of-the-art based on defined quality metrics (Tab. 3), considering methods with pre-trained models and code
- on the KITTI dataset, the method ranks as the second best on standard metrics ,however , due to the nature of sparse depth maps the learning process doesnāt converge well for such data
despite rankings, visual results (Fig. 3) demonstrate superior depth map quality compared to state-of-the-art
4.5. Ablation Studies
(analyzing our modelās components and highlighting the effects of changes)
- encoder depth : using DenseNet-201 yields lower validation loss, but itās slow and memory-intensive
- decoder depth: reducing decoderās features halves performance and creates instability
- color augmentation: disabling color channel swapping leads to significant reduction and overfitting (color augmentationās impact on neural network is a potential future research area)
tested variations of the standard modelās components through validation loss on the NYU Depth v2 dataset (750K iterations)
- pre-trained model:
- compared effects of encoder initialization: random weights vs. pre-trained ImageNet
- scratch-initialized encoder led to significantly higher validation loss ( purple) - skip Connections:
- explored removing skip connections between encoder and decoder layers
- skip-less model decreased validation loss (green) - batch size:
- analyzed different batch sizes impact on performance
- validation loss for batch sizes 2 and 16 compared to batch size 8 in the standard model (red and blue) , batch size 8 exhibited the best performance while keeping training time reasonable
4.6. Generalizing to Other Datasets:
- unreal-1k, a new dataset with photo-realistic indoor scenes and accurate ground truth depths has been introduced , it has been sampled from unreal engine renderings
- they compared the NYU Depth v2-trained model to two other supervised methods trained on the same dataset , the aim is to assess how well models generalize across different data distributions
- quantitative and qualitative comparisons show the methodās superiority in terms of average errors and mSSIM.
5- Conclusion
here is revised set of informations extracted from this insightful paper :
- a Convolutional neural network for depth map estimation from single RGB images has been introduced in this paper
- leveraged recent advances in network architecture and pre-trained models
- the constructed encoder initialized with meaningful weights outperforms costly multistage depth networks
- achieved state-of-the-art performance on NYU Depth v2 and a new Unreal-1K dataset
- the main focus of this paper is highlighted on producing higher quality depth maps that accurately capture object boundaries
- transfer learning will leverage future advances in backbone architecture designs to elevate the overall performance of the encoder-decoder network
- the proposed architecture can pave the way for embedding quality depth estimation in compact devices
and finally , i want to express my sincere gratitude to each and every one of you for taking the time to read my blog your engagement and interest in are truly appreciated
Muchas gracias āØš
References :
- High Quality Monocular Depth Estimation via Transfer Learning
Ibraheem Alhashim (KAUST) , Peter Wonka (KAUST)