Depth Reconstruction With Deep Neural Networks (Part 2)
Depth Reconstruction With Deep Neural Networks (Part 2)
Depth Reconstruction With Deep Neural Networks (Part 2)
Neural Networks
Roy Shilkrot, PhD
Module 5
Introduction
Depth Regression Direct Depth Regression
Encoder Architectures
Direct Depth Regression
● End-to-end training
● Direct gradient flow NN Predictor
● Specialized features tuned to end goal application
● End-to-end data, instead of part-by-part data
● Straightforward implementation
Problems: Estimated
disparity
● Harder to train
● Very deep network
● May require auxiliary supervision
● Requires more data
Direct Depth Regression
● Direct Encoder E
● Encoder-Decoder (“hourglass”) E Latent
Garg et al [2016] use a CNN encoder to predict depth for one view, then use the
other view to supervise an auto-encoder to predict / refine the depth map.
More Encoder Architectures
Gan et al [2018] propose to use encoder
an affinity layer in the encoder
to model relationships between
pixels. They also start with a
coarse prediction of depth
using an encoder network.
encoder
Eigen et al [2014]
Results
coarse fine ground truth
Garg et al [2016]
Results
Gan et al [2018]
Results
input
Ground truth
Eigen ‘14
Garg ‘16
Goddard ‘16
Gan ‘18
Conclusion
Module 5
Module 6
Encoder Decoder
Decoder Architectures
Architectures Stacking
● Direct Encoder E
● Encoder-Decoder (“hourglass”) E Latent
Latent
vector
decoder
Decoder Architectures
Recall Garg et al [2016]. They use fully-convolutional layers with skip connections to decode the latent
representation (and coarse depth map) into a refined depth map.
decoder
Latent
vector
Decoder Architectures
Fischer et al 2015 use a convolutional decoder
to refine the coarse depth map to higher
resolution.
decoder
Stacking Hourglass Networks
Stacking encoder-decoder networks is a common practice in recent deep vision work,
such as human pose estimation and semantic segmentation. In the context of depth
reconstruction, stacking networks acts as a refinement method.
Latent
Latent
Latent
E D E D E D
Coarse to Fine
Stacked Hourglass Architectures
Ilg et al [2016] used FlowNet (Fischer et al 2015)
a number of times, stacked.
refinement
coarse
Stacked Hourglass Architectures
Ummenhofer et al [2016] use a chain of encoder-decoder networks solving different tasks. Three main
components: bootstrap net, iterative net and refinement net. Iterative net is applied recursively to
successively refine the previous estimates. The last component is a single encoder-decoder network
that generates the final upsampled and refined depth map.
refinement
coarse
Joint Task Learning
Depth estimation and many other visual
image understanding problems, such as
segmentation, semantic labelling, and
scene parsing, are strongly correlated and
mutually beneficial.
NN Predictor
Leveraging on the complementarity
properties of these tasks we may jointly
solve these tasks so that one boosts the
performance of another.
Results
Zhang et al [2018]
Results
ground Eigen ‘14 Xu ‘17 Zhang ground Eigen ‘14 Xu ‘17 Zhang
truth ‘18 truth ‘18
Conclusion
Module 6
Module 7
Training Loss Functions
Datasets
Evaluation Metrics
Training Depth Reconstruction Networks
input
predictor prediction ground truth
Loss Function
Many loss functions for depth reconstruction are made from two terms:
L2 :
mAD:
Data Term Losses
L1 L2 smooth L1
● smoothness,
● left-right consistency,
● maximum depth,
● scale-invariant gradient loss
Regularization Loss
Original Augmented
Augmenting Natural Images
Citiscapes MegaDepth KITTI 2015
NYU2 SUN3D
Monkaa
Evaluation Metrics
Conclusion
Overview of Methods
Traditional Hybrid Direct Regression
NN Predictor
aggregate
Feature match
extract
Cost
Estimated
Estimated disparity
disparity
Future Research