A Beginner's Guide to Monocular Depth Estimation

CONFIDENTIAL
A Beginner's Guide to
Monocular Depth Estimation
Ryo Takahashi

2
Introduction
● Q1: What is monocular depth estimation (mono-depth) ?
l A1: technology to produce depth maps with a single camera
● Q2: Why mono-depth is nowadays popular?
l A2-1: LiDAR is not always available because of cost and so on
l A2-2: the self-supervised learning scheme invented by Zhou et al. is technically amazing !!
TRI is working on mono-depth, too [link]

3
Evolution of mono-depth
● SfMLearner [Zhou (Google intern) et al. CVPR ’17]
l the pioneer of recent unsupervised depth learning works
l has performed comparably with LiDAR-supervised models
● struct2depth [Google Brain AAAI '19]
l has performed comparably with stereo-depth methods
● Depth from Videos in the Wild [Google AI arXiv:1904.04998]
l has demonstrated train a mono-depth model on YouTube videos
l learns the camera intrinsic parameters
● PackNet [TRI arXiv:1905.02693]
l predicts depth in units of measurement (e.g. metre)
l has achieved SOTA as of 1905

4
Backbone of mono-depth
● Multiview geometry
l basic setup: two cameras taking the image of same point P
where:
K : camera intrinsics
p : pixel coordinates
x : normalized coordinates
X : camera coordinates
● Recommended materials
l “Camera Calibration and 3D Reconstruction” in OpenCV docs [1, 2]
l “ディジタル画像処理” [amazon.co.jp]
𝑧!
𝑝!
= 𝐾𝑅𝐾"#
𝑧𝑝 + 𝐾𝑡
↔ 𝑧! 𝐾"# 𝑝! = 𝑅𝑧𝐾"# 𝑝 + 𝑡
↔ 𝑧!
𝑥!
= 𝑅𝑧𝑥 + 𝑡
↔ 𝑋! = 𝑅𝑋 + 𝑡

5
● Simultaneously train the depth and pose (ego-motion) prediction networks
while synthesizing target views from source views
l view synthesis loss:
l challenges
l moving objects
l occlusion / disocclusion
l non-Lambertian surface
l remedy: output explainability mask !𝐸!
l encourage non-zero values by a regularizer
l but avoid too challenging pixels
l this approach is really smart ?
SfMLearner
Target
Source 𝐼! =
Project
e 2. Overview of the supervision pipeline based on view syn-
s. The depth network takes only the target view as input, and
uts a per-pixel depth map ˆDt. The pose network takes both the
view (It) and the nearby/source views (e.g., It 1 and It+1)
put, and outputs the relative camera poses ( ˆTt!t 1, ˆTt!t+1).
outputs of both networks are then used to inverse warp the
e views (see Sec. 3.2) to reconstruct the target view, and the
ometric reconstruction loss is used for training the CNNs. By
ing view synthesis as supervision, we are able to train the
e framework in an unsupervised manner from videos.
on, we assume that the scenes we are interested in are mostly
i.e., the scene appearance change across different frames is
nated by the camera motion.
View synthesis as supervision
he key supervision signal for our depth and pose prediction
s comes from the task of novel view synthesis: given one
view of a scene, synthesize a new image of the scene seen
a different camera pose. We can synthesize a target view
source view based on the predicted depth and camera pose
then use bilinear interpolation to obtain the value of the wa
image Îs at location pt.
framework can be applied to standard videos without pose i
mation. Furthermore, it predicts the poses as part of the lear
framework. See Figure 2 for an illustration of our learning pip
for depth and pose estimation.
3.2. Differentiable depth image-based renderin
As indicated in Eq. 1, a key component of our learning fr
work is a differentiable depth image-based renderer that re
structs the target view It by sampling pixels from a source vie
based on the predicted depth map ˆDt and the relative pose ˆTt
Let pt denote the homogeneous coordinates of a pixel i
target view, and K denote the camera intrinsics matrix. We
obtain pt’s projected coordinates onto the source view ps by2
ps ⇠ K ˆTt!s
ˆDt(pt)K 1
pt
Notice that the projected coordinates ps are continuous value
obtain Is(ps) for populating the value of Îs(pt) (see Figur
where
K : #𝑇"→! =
𝑹 𝒕
𝟎$ 1
: ego-motion
for the first conv layer is 32. (b) The pose and explainabilty networks share the first few conv layers, and then branch out to predict 6-D
relative pose and multi-scale explainability masks, respectively. The number of output channels for the first conv layer is 16, and the ker
size is 3 for all the layers except for the first two conv and the last two deconv/prediction layers where we use 7, 5, 5, 7, respectively. S
Section 3.5 for more details.
network’s belief in where direct view synthesis will be success-
fully modeled for each target pixel. Based on the predicted Ês,
the view synthesis objective is weighted correspondingly by
Lvs =
X
<I1,...,IN >2S
X
p
Ês(p)|It(p) Îs(p)| . (3)
Since we do not have direct supervision for Ês, training with the
above loss would result in a trivial solution of the network always
predicting Ês to be zero, which perfectly minimizes the loss. To
resolve this, we add a regularization term Lreg( Ês) that encour-
ages nonzero predictions by minimizing the cross-entropy loss
with constant label 1 at each pixel location. In other words, the
network is encouraged to minimize the view synthesis objective,
but allowed a certain amount of slack for discounting the factors
not considered by the model.
3.4. Overcoming the gradient locality
One remaining issue with the above learning pipeline is that the
gradients are mainly derived from the pixel intensity difference be-
tween I(pt) and the four neighbors of I(ps), which would inhibit
explicit multi-scale and smoothness loss (e.g., as in [14, 16]) t
allows gradients to be derived from larger spatial regions direc
We adopt the second strategy in this work as it is less sensitive
architectural choices. For smoothness, we minimize the L1 no
of the second-order gradients for the predicted depth maps (simi
to [48]).
Our final objective becomes
Lfinal =
X
l
Ll
vs + sLl
smooth + e
X
s
Lreg( Êl
s) ,
where l indexes over different image scales, s indexes over sou
images, and s and e are the weighting for the depth smoothn
loss and the explainability regularization, respectively.
3.5. Network architecture
Single-view depth For single-view depth prediction, we ad
the DispNet architecture proposed in [35] that is mainly based
an encoder-decoder design with skip connections and multi-sc
side predictions (see Figure 4). All conv layers are followed
( #𝐸! : per-pixel soft mask)

6
struct2depth
● Explicitly model 3D motions of moving objects together with ego-motion
V = O0(S1) O0(S2) O0(S3)
E1!2, E2!3 = E(I1 V, I2 V, I3 V )
To model object motion, we first apply the ego-motion es-
timate to obtain the warped sequences (Î1!2, I2, Î3!2) and
( ˆS1!2, S2, ˆS3!2), where the effect of ego-motion has been
removed. Assuming that depth and ego-motion estimates
are correct, misalignments within the image sequence are
caused only by moving objects. Outlines of potentially mov-
ing objects are provided by an off-the-shelf algorithm (He et
al. 2017) (similar to prior work that use optical flow (Yang
et al. 2018a) that is not trained on either of the datasets of
interest). For every object instance in the image, the object
motion estimate M(i)
of the i-th object is computed as:
M
(i)
1!2, M
(i)
2!3 = M (Î1!2 Oi( ˆS1!2),
I2 Oi(S2), Î3!2 Oi( ˆS3!2)) (3)
Note that while M
(i)
1!2, M
(i)
2!3 2 R6
represent object mo-
tions, they are in fact modeling how the camera would have
moved in order to explain the object appearance, rather than
the object motion directly. The actual 3D-motion vectors are
obtained by tracking the voxel movements before and af-
ter the object movement transform in the respective region.
Corresponding to these motion estimates, an inverse warp-
ing operation is done which moves the objects according to
the predicted motions. The final warping result is a combi-
nation of the individual warping from moving objects Î(i)
,
and the ego-motion Î. The full warping Î
(F )
1!2 is:
Î
(F )
1!2 = Î1!2 V
| {z }
Gradient w.r.t. E ,
+
NX
i=1
Î
(i)
1!2 Oi(S2)
| {z }
Gradient w.r.t. M ,
(4)
Core idea:
Compute 6-DoF motion
vectors for each instance
Technical Advantage:
Weights can be updated online !!
Evaluation:
・Notable improvement over SfMLearner !!
・Competitive to stereo-depth and combination-depth
One more thing:

7
Depth from Videos in the Wild [1/2]
● Train a mono-depth model on YouTube8M
l challenges: multiple different camera
l approach: learn camera intrinsics
including lens distortion
l issue: what if 𝑧"
𝑝"
= 𝐾𝑅𝐾#$
𝑧𝑝 + 𝐾𝑡
holds with incorrect *𝐾 ?
l 𝐾𝑅𝐾#$
= *𝐾𝑅 *𝐾#$
can hold ?
l proof: if R ≠ 𝐼, *𝐾 = 𝐾 (see also paper)
train/val with “Quadcopter” videos
the errors are within a few pixels
but motion and intrinsics are correlated…

8
Depth from Videos in the Wild [2/2]
● Super-readable code is available on google-research.git
l you can study diverse training tricks from the author’s comments [see also here]

9
PackNet [1/2]
● Directly fit depth maps to a unit of measurement
l inherit drawback of mono-depth: scale ambiguity
l existing works: scales depth maps with LiDAR’s GT
l PackNet: leverages camera’s velocity
l main contribution: 3D packing and unpacking blocks
l traditional upsampling: fails to preserve enough detail
to recover accurate depth
l PackNet:
∙ Space2Depth: don’t squash spatial dimensions
but hold them
∙ 3D Conv: compress key spatial details

10
PackNet [2/2]
● TRI has achieved SOTA !!

11
Summary and Discussion
● Summary
l self-supervised monocular depth learning has
surprisingly evolved for these 3~4 years
l their backbone is in traditional geometric computer vision. Let’s study again!!
l TRI has competitive technical assets in this area
● Discussion

A Beginner's Guide to Monocular Depth Estimation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Beginner's Guide to Monocular Depth Estimation

Similar to A Beginner's Guide to Monocular Depth Estimation (20)

Recently uploaded

Recently uploaded (20)

A Beginner's Guide to Monocular Depth Estimation