Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Depth Reconstruction With Deep Neural Networks (Part 2)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Depth Reconstruction With Deep

Neural Networks
Roy Shilkrot, PhD
Module 5
Introduction
Depth Regression Direct Depth Regression

Encoder Architectures
Direct Depth Regression

Instead of trying to match features


across images, directly regress
disparity/depth from the input images
NN Predictor
or their learned features.
No direct notion of descriptor matching.
A learned, view-based representation for
depth reconstruction from predefined
viewpoints. Estimated
disparity
Goal: learn a predictor f that predicts
depth map D from an input I.
Direct Depth Regression
Advantages:

● End-to-end training
● Direct gradient flow NN Predictor
● Specialized features tuned to end goal application
● End-to-end data, instead of part-by-part data
● Straightforward implementation

Problems: Estimated
disparity
● Harder to train
● Very deep network
● May require auxiliary supervision
● Requires more data
Direct Depth Regression

Two major network design approaches:

● Direct Encoder E
● Encoder-Decoder (“hourglass”) E Latent

Usually implemented with convolutional


D
layers, and a mixture of other layers, e.g.
residual blocks, upsampling and SPP. Estimated
disparity
Direct Depth Encoder Architectures
Encoder that attempts to directly predict the disparity/depth map using a convolutional network.
Used often in depth estimation from a single image. The output is in most cases a coarse /
low-resolution depth map.
Direct Depth Encoder Architectures
Eigen et al’s [2014] breakthrough
work is an end-to-end depth
encoder.

A CNN encoder that ends with a


1x4096 feature vector that can
predict a 64x64 coarse depth
map with a fully connected layer.

They also predict a refined


disparity map with additional
convolutions and the coarse
prediction.
More Encoder Architectures

Garg et al [2016] use a CNN encoder to predict depth for one view, then use the
other view to supervise an auto-encoder to predict / refine the depth map.
More Encoder Architectures
Gan et al [2018] propose to use encoder
an affinity layer in the encoder
to model relationships between
pixels. They also start with a
coarse prediction of depth
using an encoder network.

Affinity is the correlation


between the absolute features
of two image pixels. Since the
absolute features represents
the local appearance of image
locations, such as edges and
textures, the correlation
operation could effectively
model the appearance
similarities between these
pixels.
More Encoder Architectures
Laina et al [2016] modify the ResNet50 architecture to remove fully connected layers.

encoder
Eigen et al [2014]

Results
coarse fine ground truth
Garg et al [2016]

Results
Gan et al [2018]

Results

input

Ground truth

Eigen ‘14

Garg ‘16

Goddard ‘16

Gan ‘18
Conclusion
Module 5
Module 6
Encoder Decoder
Decoder Architectures
Architectures Stacking

Joint Tasks Learning


Direct Depth Regression

Two major network design approaches:

● Direct Encoder E
● Encoder-Decoder (“hourglass”) E Latent

We’ve seen typical Encoders with very


D
minimal Decoders.
Estimated
disparity
Encoder-Decoder Strategy
The encoder-decoder strategy is very well used throughout computer vision applications for deep
networks. They learn a “bottleneck” latent representation of the input, and use
unpooling/deconvolutions to unpack the compact representation back to a full-scale output. They also
rely on skip connections to give the decoding layers hints from similar-scale encoding layers.
Decoder Architectures
Recall Laina et al [2016] - they use up-convolutions to decode the latent space. Their intuition is that in unpooling 75% of the resulting
feature maps contain zeros, thus the following convolution mostly operates on zeros which can be avoided. They thus offer a
convolutional layer that avoids zero multiplications and is faster.

Latent
vector

decoder
Decoder Architectures
Recall Garg et al [2016]. They use fully-convolutional layers with skip connections to decode the latent
representation (and coarse depth map) into a refined depth map.

decoder

Latent
vector
Decoder Architectures
Fischer et al 2015 use a convolutional decoder
to refine the coarse depth map to higher
resolution.
decoder
Stacking Hourglass Networks
Stacking encoder-decoder networks is a common practice in recent deep vision work,
such as human pose estimation and semantic segmentation. In the context of depth
reconstruction, stacking networks acts as a refinement method.

Latent

Latent

Latent
E D E D E D

Coarse to Fine
Stacked Hourglass Architectures
Ilg et al [2016] used FlowNet (Fischer et al 2015)
a number of times, stacked.
refinement

coarse
Stacked Hourglass Architectures
Ummenhofer et al [2016] use a chain of encoder-decoder networks solving different tasks. Three main
components: bootstrap net, iterative net and refinement net. Iterative net is applied recursively to
successively refine the previous estimates. The last component is a single encoder-decoder network
that generates the final upsampled and refined depth map.

refinement

optical depth &


flow camera

coarse
Joint Task Learning
Depth estimation and many other visual
image understanding problems, such as
segmentation, semantic labelling, and
scene parsing, are strongly correlated and
mutually beneficial.
NN Predictor
Leveraging on the complementarity
properties of these tasks we may jointly
solve these tasks so that one boosts the
performance of another.

disparity semantic scene parsing


segmentation
Joint Task Architectures
Eigen et al [2014] predict a depth map as well as
surface normals and semantic labels (e.g.
“floor”, “structure”, “furniture” and “props”).

Combining depth and normals is a mutually


beneficial task, since normals are dependent on
local structure derived from depth.

The semantic labels (e.g. “floor” and “wall”) are


also heavily influenced by the depth and
normals.
Joint Task Architectures
Zhang et al [2018] use an encoder-decoder Semantic segmentation and depth estimation
network to predict both depth and semantic results have many common patterns, e.g., they can
segmentation. both reveal the object edges, boundaries or layouts.
Ummenhofer et al [2016]

Results
Zhang et al [2018]

Results
ground Eigen ‘14 Xu ‘17 Zhang ground Eigen ‘14 Xu ‘17 Zhang
truth ‘18 truth ‘18
Conclusion
Module 6
Module 7
Training Loss Functions

and Data Term Losses

Datasets Regularization Term Losses

Data and Augmentation

Datasets

Evaluation Metrics
Training Depth Reconstruction Networks

Optimization problem: (gradient descent solver)

input
predictor prediction ground truth
Loss Function

Many loss functions for depth reconstruction are made from two terms:

data term regularization


term

prediction ground truth


Data Term Losses

The data term measures the error prediction ground truth

between the ground truth and the


estimated depth.

L2 :

mAD:
Data Term Losses
L1 L2 smooth L1

prediction ground truth


[Zhou et al 2017]
Data Term Losses

Obtaining 3D ground truth data is very


expensive. Some techniques use a
reprojection error that allows for
unsupervised learning.
If the estimated disparity/depth map is
as close as possible to the ground truth,
then the discrepancy between the
reference image and the other image
reprojected using the estimated depth
map - is also minimized.
Regularization Losses

We can take assumptions about the


disparity/depth map and incorporate
them into the regularization term.
Examples of constraints include:

● smoothness,
● left-right consistency,
● maximum depth,
● scale-invariant gradient loss
Regularization Loss

Smoothness. It can be measured using


the magnitude of the first or
second-order gradient of the estimated
disparity/depth map.

1st order 2nd order


Training Data
Annotated data in the form of natural images and their
corresponding depth maps is very challenging to obtain.

Most stereo reconstruction algorithms require pairs of


stereo images (real or synthesized), captured with
calibrated cameras, and their corresponding
disparity/depth information as ground truth.

The disparity/depth information can be either in the


form of maps at the same or lower resolution as the
input images.

Some works overcome the need for ground-truth depth


information by training their deep networks without 3D
supervision.
Data Augmentation
To augment training datasets, one can apply to
the existing datasets some geometric and
photometric transformations, e.g., translation,
rotation, and scaling, as well as additive
Gaussian noise and changes in brightness,
contrast, gamma, and color.

Although some transformations are similarity


preserving, they still enrich the datasets.

One advantage of this approach is that it


reduces the network’s generalization error.
Using Synthesized 3D Models

One approach to generate image-depth


annotations is by synthetically rendering
from 3D CAD models 2D and 2.5D views
from various (random) view-points,
poses, and lighting conditions.

They can also be overlayed with random


textures.
Augmenting Natural Images 3D Scene

Natural image - 3D shape/scene pairs.


Another approach is to synthesize
training data by overlaying images
rendered from large 3D model
collections on the top of real images.

Original Augmented
Augmenting Natural Images
Citiscapes MegaDepth KITTI 2015

Flying Things MPI Sintel

NYU2 SUN3D
Monkaa
Evaluation Metrics

The most commonly used quantitative


metrics for evaluating the performance
of a depth estimation algorithms
include:
● Geometric error
● Percentage of Erroneous pixels
● L1 and L2 Relative Difference
● RMSE and logRSME
● “Bad Pixel” (D1): ratio of disparity
errors below a threshold.
[KITTI]
Conclusion
Module 7
Module 8
Conclusions and
Future Future Research Directions

Conclusion
Overview of Methods
Traditional Hybrid Direct Regression

NN Predictor

aggregate
Feature match
extract
Cost
Estimated
Estimated disparity
disparity
Future Research

Input: Most techniques don’t handle


high resolution input or require
calibrated images.

This is mainly due to computation and


memory requirements, and current
hardware limitations.

However, developing lighter deep


architectures remains desirable
especially in mobile platforms.
Future Research

Accuracy: Although refinement modules can improve the resolution of the


estimated depth maps, it is still small compared to the resolution of the images
that can be recovered. As such, deep learning techniques find it difficult to recover
small details, e.g., vegetation and hair. Also, most of the techniques discretize the
depth range. Although some methods can achieve sub- pixel accuracy, changing
the depth range, and the discretization frequency, requires retraining the networks.
Another issue is the accuracy, which, in general, varies for different depth ranges.
Some of the recent works, e.g., [74], tried to address this problem, but it still
remains an open and challenging problem since it is highly related to the data bias
issue and the type of loss functions used to train the network. Accuracy of existing
methods is also affected by complex scenarios, e.g., occlusions and highly
cluttered scenes, and objects with complex material properties.
Future Research
Performance: Complex deep networks are
very expensive in terms of memory
requirements.
Memory footprint is a major issue when
dealing with high resolution images and
when aiming to reconstruct high resolution
depth maps.
While this can be mitigated by using
multi-scale and part-based reconstruction
techniques, it can result in high
computation time.
Future Research
Training: Deep learning techniques rely heavily on
the availability of training datasets annotated with
ground-truth labels. Obtaining ground-truth labels
for depth reconstruction is very expensive.

Existing techniques mitigate this problem by either


designing loss functions that do not require 3D
annotations, or use domain adaptation and transfer
learning strategies.

Domain adaptation techniques are recently


attracting more attention since, with these
techniques, one can train with synthetic data, which
are easier to obtain then real-world data.
Future Research
Data bias and generalization: Most of the
recent deep learning-based depth
reconstruction techniques have been trained
and tested on publicly available benchmarks.

While this gives an indication on their


performances, it is not clear yet how do they
generalize and perform on completely unseen
images, from a completely different category.

We will see the emergence of large datasets,


similar to ImageNet but for 3D reconstruction.
Final Remarks
We have seen aspects of depth reconstruction
using deep learning techniques.
These techniques are achieving acceptable
results, and some recent developments are
even competing, in terms of accuracy, with
traditional techniques.
Since 2014, more than 100 papers on the topic
have been published in major computer vision
and machine learning conferences and journals
We entered a new era where data-driven and
machine learning techniques play a central role
in image-based depth reconstruction.

You might also like