Depth Reconstruction With Deep Neural Networks (Part 2)

Depth Reconstruction With Deep
Neural Networks
Roy Shilkrot, PhD
Module 5
Introduction
Depth Regression Direct Depth Regression
Encoder Architectures
Direct Depth Regression
Instead of trying to match features

across images, directly regress
disparity/depth from the input images
NN Predictor
or their learned features.
No direct notion of descriptor matching.
A learned, view-based representation for
depth reconstruction from predefined
viewpoints. Estimated
disparity
Goal: learn a predictor f that predicts
depth map D from an input I.
Advantages:
● End-to-end training
● Direct gradient flow NN Predictor
● Specialized features tuned to end goal application
● End-to-end data, instead of part-by-part data
● Straightforward implementation
Problems: Estimated
disparity
● Harder to train
● Very deep network
● May require auxiliary supervision
● Requires more data
Two major network design approaches:
● Direct Encoder E
● Encoder-Decoder (“hourglass”) E Latent
Usually implemented with convolutional

D
layers, and a mixture of other layers, e.g.
residual blocks, upsampling and SPP. Estimated
disparity
Direct Depth Encoder Architectures
Encoder that attempts to directly predict the disparity/depth map using a convolutional network.
Used often in depth estimation from a single image. The output is in most cases a coarse /
low-resolution depth map.
Direct Depth Encoder Architectures
Eigen et al’s [2014] breakthrough
work is an end-to-end depth
encoder.
A CNN encoder that ends with a

1x4096 feature vector that can
predict a 64x64 coarse depth
map with a fully connected layer.
They also predict a refined

disparity map with additional
convolutions and the coarse
prediction.
More Encoder Architectures
Garg et al [2016] use a CNN encoder to predict depth for one view, then use the
other view to supervise an auto-encoder to predict / refine the depth map.
Gan et al [2018] propose to use encoder
an affinity layer in the encoder
to model relationships between
pixels. They also start with a
coarse prediction of depth
using an encoder network.
Affinity is the correlation

between the absolute features
of two image pixels. Since the
absolute features represents
the local appearance of image
locations, such as edges and
textures, the correlation
operation could effectively
model the appearance
similarities between these
pixels.
Laina et al [2016] modify the ResNet50 architecture to remove fully connected layers.
encoder
Eigen et al [2014]
Results
coarse fine ground truth
Garg et al [2016]
Results
Gan et al [2018]
Results
input
Ground truth
Eigen ‘14
Garg ‘16
Goddard ‘16
Gan ‘18
Conclusion
Module 5
Module 6
Encoder Decoder
Decoder Architectures
Architectures Stacking
Joint Tasks Learning

Two major network design approaches:
● Direct Encoder E
● Encoder-Decoder (“hourglass”) E Latent
We’ve seen typical Encoders with very

D
minimal Decoders.
Estimated
disparity
Encoder-Decoder Strategy
The encoder-decoder strategy is very well used throughout computer vision applications for deep
networks. They learn a “bottleneck” latent representation of the input, and use
unpooling/deconvolutions to unpack the compact representation back to a full-scale output. They also
rely on skip connections to give the decoding layers hints from similar-scale encoding layers.
Recall Laina et al [2016] - they use up-convolutions to decode the latent space. Their intuition is that in unpooling 75% of the resulting
feature maps contain zeros, thus the following convolution mostly operates on zeros which can be avoided. They thus offer a
convolutional layer that avoids zero multiplications and is faster.
Latent
vector
decoder
Recall Garg et al [2016]. They use fully-convolutional layers with skip connections to decode the latent
representation (and coarse depth map) into a refined depth map.
decoder
Latent
vector
Fischer et al 2015 use a convolutional decoder
to refine the coarse depth map to higher
resolution.
decoder
Stacking Hourglass Networks
Stacking encoder-decoder networks is a common practice in recent deep vision work,
such as human pose estimation and semantic segmentation. In the context of depth
reconstruction, stacking networks acts as a refinement method.
Latent
Latent
Latent
E D E D E D
Coarse to Fine
Stacked Hourglass Architectures
Ilg et al [2016] used FlowNet (Fischer et al 2015)
a number of times, stacked.
refinement
coarse
Stacked Hourglass Architectures
Ummenhofer et al [2016] use a chain of encoder-decoder networks solving different tasks. Three main
components: bootstrap net, iterative net and refinement net. Iterative net is applied recursively to
successively refine the previous estimates. The last component is a single encoder-decoder network
that generates the final upsampled and refined depth map.
refinement
optical depth &

flow camera
coarse
Joint Task Learning
Depth estimation and many other visual
image understanding problems, such as
segmentation, semantic labelling, and
scene parsing, are strongly correlated and
mutually beneficial.
NN Predictor
Leveraging on the complementarity
properties of these tasks we may jointly
solve these tasks so that one boosts the
performance of another.
disparity semantic scene parsing

segmentation
Joint Task Architectures
Eigen et al [2014] predict a depth map as well as
surface normals and semantic labels (e.g.
“floor”, “structure”, “furniture” and “props”).
Combining depth and normals is a mutually

beneficial task, since normals are dependent on
local structure derived from depth.
The semantic labels (e.g. “floor” and “wall”) are

also heavily influenced by the depth and
normals.
Joint Task Architectures
Zhang et al [2018] use an encoder-decoder Semantic segmentation and depth estimation
network to predict both depth and semantic results have many common patterns, e.g., they can
segmentation. both reveal the object edges, boundaries or layouts.
Ummenhofer et al [2016]
Results
Zhang et al [2018]
Results
ground Eigen ‘14 Xu ‘17 Zhang ground Eigen ‘14 Xu ‘17 Zhang
truth ‘18 truth ‘18
Conclusion
Module 6
Module 7
Training Loss Functions
and Data Term Losses
Datasets Regularization Term Losses
Data and Augmentation
Datasets
Evaluation Metrics
Training Depth Reconstruction Networks
Optimization problem: (gradient descent solver)
input
predictor prediction ground truth
Loss Function
Many loss functions for depth reconstruction are made from two terms:
data term regularization

term
prediction ground truth

Data Term Losses
The data term measures the error prediction ground truth
between the ground truth and the

estimated depth.
L2 :
mAD:
Data Term Losses
L1 L2 smooth L1
prediction ground truth

[Zhou et al 2017]
Data Term Losses
Obtaining 3D ground truth data is very

expensive. Some techniques use a
reprojection error that allows for
unsupervised learning.
If the estimated disparity/depth map is
as close as possible to the ground truth,
then the discrepancy between the
reference image and the other image
reprojected using the estimated depth
map - is also minimized.
Regularization Losses
We can take assumptions about the

disparity/depth map and incorporate
them into the regularization term.
Examples of constraints include:
● smoothness,
● left-right consistency,
● maximum depth,
● scale-invariant gradient loss
Regularization Loss
Smoothness. It can be measured using

the magnitude of the first or
second-order gradient of the estimated
disparity/depth map.
1st order 2nd order

Training Data
Annotated data in the form of natural images and their
corresponding depth maps is very challenging to obtain.
Most stereo reconstruction algorithms require pairs of

stereo images (real or synthesized), captured with
calibrated cameras, and their corresponding
disparity/depth information as ground truth.
The disparity/depth information can be either in the

form of maps at the same or lower resolution as the
input images.
Some works overcome the need for ground-truth depth

information by training their deep networks without 3D
supervision.
Data Augmentation
To augment training datasets, one can apply to
the existing datasets some geometric and
photometric transformations, e.g., translation,
rotation, and scaling, as well as additive
Gaussian noise and changes in brightness,
contrast, gamma, and color.
Although some transformations are similarity

preserving, they still enrich the datasets.
One advantage of this approach is that it

reduces the network’s generalization error.
Using Synthesized 3D Models
One approach to generate image-depth

annotations is by synthetically rendering
from 3D CAD models 2D and 2.5D views
from various (random) view-points,
poses, and lighting conditions.
They can also be overlayed with random

textures.
Augmenting Natural Images 3D Scene
Natural image - 3D shape/scene pairs.

Another approach is to synthesize
training data by overlaying images
rendered from large 3D model
collections on the top of real images.
Original Augmented
Augmenting Natural Images
Citiscapes MegaDepth KITTI 2015
Flying Things MPI Sintel
NYU2 SUN3D
Monkaa
Evaluation Metrics
The most commonly used quantitative

metrics for evaluating the performance
of a depth estimation algorithms
include:
● Geometric error
● Percentage of Erroneous pixels
● L1 and L2 Relative Difference
● RMSE and logRSME
● “Bad Pixel” (D1): ratio of disparity
errors below a threshold.
[KITTI]
Conclusion
Module 7
Module 8
Conclusions and
Future Future Research Directions
Conclusion
Overview of Methods
Traditional Hybrid Direct Regression
NN Predictor
aggregate
Feature match
extract
Cost
Estimated
Estimated disparity
disparity
Future Research
Input: Most techniques don’t handle

high resolution input or require
calibrated images.
This is mainly due to computation and

memory requirements, and current
hardware limitations.
However, developing lighter deep

architectures remains desirable
especially in mobile platforms.
Future Research
Accuracy: Although refinement modules can improve the resolution of the

estimated depth maps, it is still small compared to the resolution of the images
that can be recovered. As such, deep learning techniques find it difficult to recover
small details, e.g., vegetation and hair. Also, most of the techniques discretize the
depth range. Although some methods can achieve sub- pixel accuracy, changing
the depth range, and the discretization frequency, requires retraining the networks.
Another issue is the accuracy, which, in general, varies for different depth ranges.
Some of the recent works, e.g., [74], tried to address this problem, but it still
remains an open and challenging problem since it is highly related to the data bias
issue and the type of loss functions used to train the network. Accuracy of existing
methods is also affected by complex scenarios, e.g., occlusions and highly
cluttered scenes, and objects with complex material properties.
Future Research
Performance: Complex deep networks are
very expensive in terms of memory
requirements.
Memory footprint is a major issue when
dealing with high resolution images and
when aiming to reconstruct high resolution
depth maps.
While this can be mitigated by using
multi-scale and part-based reconstruction
techniques, it can result in high
computation time.
Future Research
Training: Deep learning techniques rely heavily on
the availability of training datasets annotated with
ground-truth labels. Obtaining ground-truth labels
for depth reconstruction is very expensive.
Existing techniques mitigate this problem by either

designing loss functions that do not require 3D
annotations, or use domain adaptation and transfer
learning strategies.
Domain adaptation techniques are recently

attracting more attention since, with these
techniques, one can train with synthetic data, which
are easier to obtain then real-world data.
Future Research
Data bias and generalization: Most of the
recent deep learning-based depth
reconstruction techniques have been trained
and tested on publicly available benchmarks.
While this gives an indication on their

performances, it is not clear yet how do they
generalize and perform on completely unseen
images, from a completely different category.
We will see the emergence of large datasets,

similar to ImageNet but for 3D reconstruction.
Final Remarks
We have seen aspects of depth reconstruction
using deep learning techniques.
These techniques are achieving acceptable
results, and some recent developments are
even competing, in terms of accuracy, with
traditional techniques.
Since 2014, more than 100 papers on the topic
have been published in major computer vision
and machine learning conferences and journals
We entered a new era where data-driven and
machine learning techniques play a central role
in image-based depth reconstruction.

Depth Reconstruction With Deep Neural Networks (Part 2)

Uploaded by

Copyright:

Available Formats

Depth Reconstruction With Deep Neural Networks (Part 2)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Depth Reconstruction With Deep Neural Networks (Part 2)

Uploaded by

Copyright:

Available Formats

Depth Reconstruction With Deep

Instead of trying to match features

Two major network design approaches:

Usually implemented with convolutional

A CNN encoder that ends with a

They also predict a reﬁned

Aﬃnity is the correlation

Joint Tasks Learning

Two major network design approaches:

We’ve seen typical Encoders with very

optical depth &

disparity semantic scene parsing

Combining depth and normals is a mutually

The semantic labels (e.g. “ﬂoor” and “wall”) are

and Data Term Losses

Datasets Regularization Term Losses

Data and Augmentation

Optimization problem: (gradient descent solver)

data term regularization

prediction ground truth

The data term measures the error prediction ground truth

between the ground truth and the

prediction ground truth

Obtaining 3D ground truth data is very

We can take assumptions about the

Smoothness. It can be measured using

1st order 2nd order

Most stereo reconstruction algorithms require pairs of

The disparity/depth information can be either in the

Some works overcome the need for ground-truth depth

Although some transformations are similarity

One advantage of this approach is that it

One approach to generate image-depth

They can also be overlayed with random

Natural image - 3D shape/scene pairs.

Flying Things MPI Sintel

The most commonly used quantitative

Input: Most techniques don’t handle

This is mainly due to computation and

However, developing lighter deep

Accuracy: Although reﬁnement modules can improve the resolution of the

Existing techniques mitigate this problem by either

Domain adaptation techniques are recently

While this gives an indication on their

We will see the emergence of large datasets,

You might also like