Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
CONFIDENTIAL
A Beginner's Guide to
Monocular Depth Estimation
Ryo Takahashi
2
Introduction
● Q1: What is monocular depth estimation (mono-depth) ?
l A1: technology to produce depth maps with a single camera
● Q2: Why mono-depth is nowadays popular?
l A2-1: LiDAR is not always available because of cost and so on
l A2-2: the self-supervised learning scheme invented by Zhou et al. is technically amazing !!
TRI is working on mono-depth, too [link]
3
Evolution of mono-depth
● SfMLearner [Zhou (Google intern) et al. CVPR ’17]
l the pioneer of recent unsupervised depth learning works
l has performed comparably with LiDAR-supervised models
● struct2depth [Google Brain AAAI '19]
l has performed comparably with stereo-depth methods
● Depth from Videos in the Wild [Google AI arXiv:1904.04998]
l has demonstrated train a mono-depth model on YouTube videos
l learns the camera intrinsic parameters
● PackNet [TRI arXiv:1905.02693]
l predicts depth in units of measurement (e.g. metre)
l has achieved SOTA as of 1905
4
Backbone of mono-depth
● Multiview geometry
l basic setup: two cameras taking the image of same point P
where:
K : camera intrinsics
p : pixel coordinates
x : normalized coordinates
X : camera coordinates
● Recommended materials
l “Camera Calibration and 3D Reconstruction” in OpenCV docs [1, 2]
l “ディジタル画像処理” [amazon.co.jp]
𝑧!
𝑝!
= 𝐾𝑅𝐾"#
𝑧𝑝 + 𝐾𝑡
↔ 𝑧! 𝐾"# 𝑝! = 𝑅𝑧𝐾"# 𝑝 + 𝑡
↔ 𝑧!
𝑥!
= 𝑅𝑧𝑥 + 𝑡
↔ 𝑋! = 𝑅𝑋 + 𝑡
5
● Simultaneously train the depth and pose (ego-motion) prediction networks
while synthesizing target views from source views
l view synthesis loss:
l challenges
l moving objects
l occlusion / disocclusion
l non-Lambertian surface
l remedy: output explainability mask !𝐸!
l encourage non-zero values by a regularizer
l but avoid too challenging pixels
l this approach is really smart ?
SfMLearner
Target
Source 𝐼! =
Project
e 2. Overview of the supervision pipeline based on view syn-
s. The depth network takes only the target view as input, and
uts a per-pixel depth map ˆDt. The pose network takes both the
view (It) and the nearby/source views (e.g., It 1 and It+1)
put, and outputs the relative camera poses ( ˆTt!t 1, ˆTt!t+1).
outputs of both networks are then used to inverse warp the
e views (see Sec. 3.2) to reconstruct the target view, and the
ometric reconstruction loss is used for training the CNNs. By
ing view synthesis as supervision, we are able to train the
e framework in an unsupervised manner from videos.
on, we assume that the scenes we are interested in are mostly
i.e., the scene appearance change across different frames is
nated by the camera motion.
View synthesis as supervision
he key supervision signal for our depth and pose prediction
s comes from the task of novel view synthesis: given one
view of a scene, synthesize a new image of the scene seen
a different camera pose. We can synthesize a target view
source view based on the predicted depth and camera pose
then use bilinear interpolation to obtain the value of the wa
image ˆIs at location pt.
framework can be applied to standard videos without pose i
mation. Furthermore, it predicts the poses as part of the lear
framework. See Figure 2 for an illustration of our learning pip
for depth and pose estimation.
3.2. Differentiable depth image-based renderin
As indicated in Eq. 1, a key component of our learning fr
work is a differentiable depth image-based renderer that re
structs the target view It by sampling pixels from a source vie
based on the predicted depth map ˆDt and the relative pose ˆTt
Let pt denote the homogeneous coordinates of a pixel i
target view, and K denote the camera intrinsics matrix. We
obtain pt’s projected coordinates onto the source view ps by2
ps ⇠ K ˆTt!s
ˆDt(pt)K 1
pt
Notice that the projected coordinates ps are continuous value
obtain Is(ps) for populating the value of ˆIs(pt) (see Figur
where
K : #𝑇"→! =
𝑹 𝒕
𝟎$ 1
: ego-motion
for the first conv layer is 32. (b) The pose and explainabilty networks share the first few conv layers, and then branch out to predict 6-D
relative pose and multi-scale explainability masks, respectively. The number of output channels for the first conv layer is 16, and the ker
size is 3 for all the layers except for the first two conv and the last two deconv/prediction layers where we use 7, 5, 5, 7, respectively. S
Section 3.5 for more details.
network’s belief in where direct view synthesis will be success-
fully modeled for each target pixel. Based on the predicted ˆEs,
the view synthesis objective is weighted correspondingly by
Lvs =
X
<I1,...,IN >2S
X
p
ˆEs(p)|It(p) ˆIs(p)| . (3)
Since we do not have direct supervision for ˆEs, training with the
above loss would result in a trivial solution of the network always
predicting ˆEs to be zero, which perfectly minimizes the loss. To
resolve this, we add a regularization term Lreg( ˆEs) that encour-
ages nonzero predictions by minimizing the cross-entropy loss
with constant label 1 at each pixel location. In other words, the
network is encouraged to minimize the view synthesis objective,
but allowed a certain amount of slack for discounting the factors
not considered by the model.
3.4. Overcoming the gradient locality
One remaining issue with the above learning pipeline is that the
gradients are mainly derived from the pixel intensity difference be-
tween I(pt) and the four neighbors of I(ps), which would inhibit
explicit multi-scale and smoothness loss (e.g., as in [14, 16]) t
allows gradients to be derived from larger spatial regions direc
We adopt the second strategy in this work as it is less sensitive
architectural choices. For smoothness, we minimize the L1 no
of the second-order gradients for the predicted depth maps (simi
to [48]).
Our final objective becomes
Lfinal =
X
l
Ll
vs + sLl
smooth + e
X
s
Lreg( ˆEl
s) ,
where l indexes over different image scales, s indexes over sou
images, and s and e are the weighting for the depth smoothn
loss and the explainability regularization, respectively.
3.5. Network architecture
Single-view depth For single-view depth prediction, we ad
the DispNet architecture proposed in [35] that is mainly based
an encoder-decoder design with skip connections and multi-sc
side predictions (see Figure 4). All conv layers are followed
( #𝐸! : per-pixel soft mask)
6
struct2depth
● Explicitly model 3D motions of moving objects together with ego-motion
V = O0(S1) O0(S2) O0(S3)
E1!2, E2!3 = E(I1 V, I2 V, I3 V )
To model object motion, we first apply the ego-motion es-
timate to obtain the warped sequences (ˆI1!2, I2, ˆI3!2) and
( ˆS1!2, S2, ˆS3!2), where the effect of ego-motion has been
removed. Assuming that depth and ego-motion estimates
are correct, misalignments within the image sequence are
caused only by moving objects. Outlines of potentially mov-
ing objects are provided by an off-the-shelf algorithm (He et
al. 2017) (similar to prior work that use optical flow (Yang
et al. 2018a) that is not trained on either of the datasets of
interest). For every object instance in the image, the object
motion estimate M(i)
of the i-th object is computed as:
M
(i)
1!2, M
(i)
2!3 = M (ˆI1!2 Oi( ˆS1!2),
I2 Oi(S2), ˆI3!2 Oi( ˆS3!2)) (3)
Note that while M
(i)
1!2, M
(i)
2!3 2 R6
represent object mo-
tions, they are in fact modeling how the camera would have
moved in order to explain the object appearance, rather than
the object motion directly. The actual 3D-motion vectors are
obtained by tracking the voxel movements before and af-
ter the object movement transform in the respective region.
Corresponding to these motion estimates, an inverse warp-
ing operation is done which moves the objects according to
the predicted motions. The final warping result is a combi-
nation of the individual warping from moving objects ˆI(i)
,
and the ego-motion ˆI. The full warping ˆI
(F )
1!2 is:
ˆI
(F )
1!2 = ˆI1!2 V
| {z }
Gradient w.r.t. E ,
+
NX
i=1
ˆI
(i)
1!2 Oi(S2)
| {z }
Gradient w.r.t. M ,
(4)
Core idea:
Compute 6-DoF motion
vectors for each instance
Technical Advantage:
Weights can be updated online !!
Evaluation:
・Notable improvement over SfMLearner !!
・Competitive to stereo-depth and combination-depth
One more thing:
7
Depth from Videos in the Wild [1/2]
● Train a mono-depth model on YouTube8M
l challenges: multiple different camera
l approach: learn camera intrinsics
including lens distortion
l issue: what if 𝑧"
𝑝"
= 𝐾𝑅𝐾#$
𝑧𝑝 + 𝐾𝑡
holds with incorrect *𝐾 ?
l 𝐾𝑅𝐾#$
= *𝐾𝑅 *𝐾#$
can hold ?
l proof: if R ≠ 𝐼, *𝐾 = 𝐾 (see also paper)
train/val with “Quadcopter” videos
the errors are within a few pixels
but motion and intrinsics are correlated…
8
Depth from Videos in the Wild [2/2]
● Super-readable code is available on google-research.git
l you can study diverse training tricks from the author’s comments [see also here]
9
PackNet [1/2]
● Directly fit depth maps to a unit of measurement
l inherit drawback of mono-depth: scale ambiguity
l existing works: scales depth maps with LiDAR’s GT
l PackNet: leverages camera’s velocity
l main contribution: 3D packing and unpacking blocks
l traditional upsampling: fails to preserve enough detail
to recover accurate depth
l PackNet:
∙ Space2Depth: don’t squash spatial dimensions
but hold them
∙ 3D Conv: compress key spatial details
10
PackNet [2/2]
● TRI has achieved SOTA !!
11
Summary and Discussion
● Summary
l self-supervised monocular depth learning has
surprisingly evolved for these 3~4 years
l their backbone is in traditional geometric computer vision. Let’s study again!!
l TRI has competitive technical assets in this area
● Discussion

More Related Content

What's hot

Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
LEE HOSEONG
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
taeseon ryu
 
SfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてSfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法について
Ryutaro Yamauchi
 
Understanding neural radiance fields
Understanding neural radiance fieldsUnderstanding neural radiance fields
Understanding neural radiance fields
Varun Bhaseen
 
Super resolution from a single image
Super resolution from a single imageSuper resolution from a single image
Super resolution from a single image
Lakkhana Mallikarachchi
 
Global illumination
Global illuminationGlobal illumination
Global illumination
Dragan Okanovic
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
BaliThorat1
 
Image enhancement lecture
Image enhancement lectureImage enhancement lecture
Image enhancement lecture
ISRAR HUSSAIN
 
Volume Rendering in Unity3D
Volume Rendering in Unity3DVolume Rendering in Unity3D
Volume Rendering in Unity3D
Matias Lavik
 
LDM_ImageSythesis.pptx
LDM_ImageSythesis.pptxLDM_ImageSythesis.pptx
LDM_ImageSythesis.pptx
AkankshaRawat53
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
BRDF レンダリングの方程式
BRDF レンダリングの方程式BRDF レンダリングの方程式
BRDF レンダリングの方程式
康弘 等々力
 
Rendering Techniques in Rise of the Tomb Raider
Rendering Techniques in Rise of the Tomb RaiderRendering Techniques in Rise of the Tomb Raider
Rendering Techniques in Rise of the Tomb Raider
Eidos-Montréal
 
[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency
[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency
[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency
Ryutaro Yamauchi
 
Computer Vision.pptx
Computer Vision.pptxComputer Vision.pptx
Computer Vision.pptx
GDSCIIITDHARWAD
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
Electronic Arts / DICE
 
Depth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep LearningDepth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep Learning
Yu Huang
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
Electronic Arts / DICE
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
Michele Giacalone
 
Global Illumination
Global IlluminationGlobal Illumination
Global Illumination
Masafumi Noda
 

What's hot (20)

Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
 
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose EstimationHRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimation
 
SfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法についてSfM Learner系単眼深度推定手法について
SfM Learner系単眼深度推定手法について
 
Understanding neural radiance fields
Understanding neural radiance fieldsUnderstanding neural radiance fields
Understanding neural radiance fields
 
Super resolution from a single image
Super resolution from a single imageSuper resolution from a single image
Super resolution from a single image
 
Global illumination
Global illuminationGlobal illumination
Global illumination
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
 
Image enhancement lecture
Image enhancement lectureImage enhancement lecture
Image enhancement lecture
 
Volume Rendering in Unity3D
Volume Rendering in Unity3DVolume Rendering in Unity3D
Volume Rendering in Unity3D
 
LDM_ImageSythesis.pptx
LDM_ImageSythesis.pptxLDM_ImageSythesis.pptx
LDM_ImageSythesis.pptx
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
BRDF レンダリングの方程式
BRDF レンダリングの方程式BRDF レンダリングの方程式
BRDF レンダリングの方程式
 
Rendering Techniques in Rise of the Tomb Raider
Rendering Techniques in Rise of the Tomb RaiderRendering Techniques in Rise of the Tomb Raider
Rendering Techniques in Rise of the Tomb Raider
 
[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency
[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency
[論文解説]Unsupervised monocular depth estimation with Left-Right Consistency
 
Computer Vision.pptx
Computer Vision.pptxComputer Vision.pptx
Computer Vision.pptx
 
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time RaytracingSIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
SIGGRAPH 2018 - Full Rays Ahead! From Raster to Real-Time Raytracing
 
Depth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep LearningDepth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep Learning
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
Global Illumination
Global IlluminationGlobal Illumination
Global Illumination
 

Similar to A Beginner's Guide to Monocular Depth Estimation

image_segmentation_ppt.pptx
image_segmentation_ppt.pptximage_segmentation_ppt.pptx
image_segmentation_ppt.pptx
fgdg12
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
Oğul Göçmen
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
Slide_N
 
motion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosmotion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videos
shiva kumar cheruku
 
Deferred Pixel Shading on the PlayStation 3
Deferred Pixel Shading on the PlayStation 3Deferred Pixel Shading on the PlayStation 3
Deferred Pixel Shading on the PlayStation 3
Slide_N
 
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
IRJET Journal
 
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesA Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
IJMER
 
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
IRJET Journal
 
Class[4][19th jun] [three js-camera&amp;light]
Class[4][19th jun] [three js-camera&amp;light]Class[4][19th jun] [three js-camera&amp;light]
Class[4][19th jun] [three js-camera&amp;light]
Saajid Akram
 
project_final_seminar
project_final_seminarproject_final_seminar
project_final_seminar
MUKUL BICHKAR
 
Medial axis transformation based skeletonzation of image patterns using image...
Medial axis transformation based skeletonzation of image patterns using image...Medial axis transformation based skeletonzation of image patterns using image...
Medial axis transformation based skeletonzation of image patterns using image...
International Journal of Science and Research (IJSR)
 
Flow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action RecognitionFlow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action Recognition
IRJET Journal
 
Image Quality Feature Based Detection Algorithm for Forgery in Images
Image Quality Feature Based Detection Algorithm for Forgery in Images  Image Quality Feature Based Detection Algorithm for Forgery in Images
Image Quality Feature Based Detection Algorithm for Forgery in Images
ijcga
 
Keynote at Tracking Workshop during ISMAR 2014
Keynote at Tracking Workshop during ISMAR 2014Keynote at Tracking Workshop during ISMAR 2014
Keynote at Tracking Workshop during ISMAR 2014
Darius Burschka
 
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
mlaij
 
Gesture Recognition Based Video Game Controller
Gesture Recognition Based Video Game ControllerGesture Recognition Based Video Game Controller
Gesture Recognition Based Video Game Controller
IRJET Journal
 
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
IRJET Journal
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
Ivonne Liu
 
998-isvc16
998-isvc16998-isvc16

Similar to A Beginner's Guide to Monocular Depth Estimation (20)

image_segmentation_ppt.pptx
image_segmentation_ppt.pptximage_segmentation_ppt.pptx
image_segmentation_ppt.pptx
 
Log polar coordinates
Log polar coordinatesLog polar coordinates
Log polar coordinates
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
motion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videosmotion and feature based person tracking in survillance videos
motion and feature based person tracking in survillance videos
 
Deferred Pixel Shading on the PlayStation 3
Deferred Pixel Shading on the PlayStation 3Deferred Pixel Shading on the PlayStation 3
Deferred Pixel Shading on the PlayStation 3
 
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
A Pointing Gesture-based Signal to Text Communication System Using OpenCV in ...
 
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesA Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
 
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
 
Class[4][19th jun] [three js-camera&amp;light]
Class[4][19th jun] [three js-camera&amp;light]Class[4][19th jun] [three js-camera&amp;light]
Class[4][19th jun] [three js-camera&amp;light]
 
project_final_seminar
project_final_seminarproject_final_seminar
project_final_seminar
 
Medial axis transformation based skeletonzation of image patterns using image...
Medial axis transformation based skeletonzation of image patterns using image...Medial axis transformation based skeletonzation of image patterns using image...
Medial axis transformation based skeletonzation of image patterns using image...
 
Flow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action RecognitionFlow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action Recognition
 
Image Quality Feature Based Detection Algorithm for Forgery in Images
Image Quality Feature Based Detection Algorithm for Forgery in Images  Image Quality Feature Based Detection Algorithm for Forgery in Images
Image Quality Feature Based Detection Algorithm for Forgery in Images
 
Keynote at Tracking Workshop during ISMAR 2014
Keynote at Tracking Workshop during ISMAR 2014Keynote at Tracking Workshop during ISMAR 2014
Keynote at Tracking Workshop during ISMAR 2014
 
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
AN ENHANCEMENT FOR THE CONSISTENT DEPTH ESTIMATION OF MONOCULAR VIDEOS USING ...
 
Gesture Recognition Based Video Game Controller
Gesture Recognition Based Video Game ControllerGesture Recognition Based Video Game Controller
Gesture Recognition Based Video Game Controller
 
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
An Approach for Image Deblurring: Based on Sparse Representation and Regulari...
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
 
998-isvc16
998-isvc16998-isvc16
998-isvc16
 

Recently uploaded

Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
DianaGray10
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
ScyllaDB
 
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
ScyllaDB
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
ScyllaDB
 
Move Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the PlatformMove Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the Platform
Christian Posta
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
UiPathCommunity
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
NTTDATA INTRAMART
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc
 
Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
UiPathCommunity
 
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
dipikamodels1
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
UiPathCommunity
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
ThousandEyes
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
ScyllaDB
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
Safe Software
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
Enterprise Knowledge
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
Larry Smarr
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
Paige Cruz
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
ThousandEyes
 

Recently uploaded (20)

Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2Communications Mining Series - Zero to Hero - Session 2
Communications Mining Series - Zero to Hero - Session 2
 
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value MigrationThe Strategy Behind ReversingLabs’ Massive Key-Value Migration
The Strategy Behind ReversingLabs’ Massive Key-Value Migration
 
Corporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade LaterCorporate Open Source Anti-Patterns: A Decade Later
Corporate Open Source Anti-Patterns: A Decade Later
 
CTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database MigrationCTO Insights: Steering a High-Stakes Database Migration
CTO Insights: Steering a High-Stakes Database Migration
 
Move Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the PlatformMove Auth, Policy, and Resilience to the Platform
Move Auth, Policy, and Resilience to the Platform
 
Dev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous DiscoveryDev Dives: Mining your data with AI-powered Continuous Discovery
Dev Dives: Mining your data with AI-powered Continuous Discovery
 
intra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_Enintra-mart Accel series 2024 Spring updates_En
intra-mart Accel series 2024 Spring updates_En
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
TrustArc Webinar - Your Guide for Smooth Cross-Border Data Transfers and Glob...
 
Automation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI AutomationAutomation Student Developers Session 3: Introduction to UI Automation
Automation Student Developers Session 3: Introduction to UI Automation
 
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
Call Girls Kochi 💯Call Us 🔝 7426014248 🔝 Independent Kochi Escorts Service Av...
 
Day 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data ManipulationDay 4 - Excel Automation and Data Manipulation
Day 4 - Excel Automation and Data Manipulation
 
APJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes WebinarAPJC Introduction to ThousandEyes Webinar
APJC Introduction to ThousandEyes Webinar
 
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google CloudRadically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
Radically Outperforming DynamoDB @ Digital Turbine with SADA and Google Cloud
 
An Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise IntegrationAn Introduction to All Data Enterprise Integration
An Introduction to All Data Enterprise Integration
 
Building a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data PlatformBuilding a Semantic Layer of your Data Platform
Building a Semantic Layer of your Data Platform
 
From NCSA to the National Research Platform
From NCSA to the National Research PlatformFrom NCSA to the National Research Platform
From NCSA to the National Research Platform
 
The "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community DayThe "Zen" of Python Exemplars - OTel Community Day
The "Zen" of Python Exemplars - OTel Community Day
 
Introduction to ThousandEyes AMER Webinar
Introduction  to ThousandEyes AMER WebinarIntroduction  to ThousandEyes AMER Webinar
Introduction to ThousandEyes AMER Webinar
 

A Beginner's Guide to Monocular Depth Estimation

  • 1. CONFIDENTIAL A Beginner's Guide to Monocular Depth Estimation Ryo Takahashi
  • 2. 2 Introduction ● Q1: What is monocular depth estimation (mono-depth) ? l A1: technology to produce depth maps with a single camera ● Q2: Why mono-depth is nowadays popular? l A2-1: LiDAR is not always available because of cost and so on l A2-2: the self-supervised learning scheme invented by Zhou et al. is technically amazing !! TRI is working on mono-depth, too [link]
  • 3. 3 Evolution of mono-depth ● SfMLearner [Zhou (Google intern) et al. CVPR ’17] l the pioneer of recent unsupervised depth learning works l has performed comparably with LiDAR-supervised models ● struct2depth [Google Brain AAAI '19] l has performed comparably with stereo-depth methods ● Depth from Videos in the Wild [Google AI arXiv:1904.04998] l has demonstrated train a mono-depth model on YouTube videos l learns the camera intrinsic parameters ● PackNet [TRI arXiv:1905.02693] l predicts depth in units of measurement (e.g. metre) l has achieved SOTA as of 1905
  • 4. 4 Backbone of mono-depth ● Multiview geometry l basic setup: two cameras taking the image of same point P where: K : camera intrinsics p : pixel coordinates x : normalized coordinates X : camera coordinates ● Recommended materials l “Camera Calibration and 3D Reconstruction” in OpenCV docs [1, 2] l “ディジタル画像処理” [amazon.co.jp] 𝑧! 𝑝! = 𝐾𝑅𝐾"# 𝑧𝑝 + 𝐾𝑡 ↔ 𝑧! 𝐾"# 𝑝! = 𝑅𝑧𝐾"# 𝑝 + 𝑡 ↔ 𝑧! 𝑥! = 𝑅𝑧𝑥 + 𝑡 ↔ 𝑋! = 𝑅𝑋 + 𝑡
  • 5. 5 ● Simultaneously train the depth and pose (ego-motion) prediction networks while synthesizing target views from source views l view synthesis loss: l challenges l moving objects l occlusion / disocclusion l non-Lambertian surface l remedy: output explainability mask !𝐸! l encourage non-zero values by a regularizer l but avoid too challenging pixels l this approach is really smart ? SfMLearner Target Source 𝐼! = Project e 2. Overview of the supervision pipeline based on view syn- s. The depth network takes only the target view as input, and uts a per-pixel depth map ˆDt. The pose network takes both the view (It) and the nearby/source views (e.g., It 1 and It+1) put, and outputs the relative camera poses ( ˆTt!t 1, ˆTt!t+1). outputs of both networks are then used to inverse warp the e views (see Sec. 3.2) to reconstruct the target view, and the ometric reconstruction loss is used for training the CNNs. By ing view synthesis as supervision, we are able to train the e framework in an unsupervised manner from videos. on, we assume that the scenes we are interested in are mostly i.e., the scene appearance change across different frames is nated by the camera motion. View synthesis as supervision he key supervision signal for our depth and pose prediction s comes from the task of novel view synthesis: given one view of a scene, synthesize a new image of the scene seen a different camera pose. We can synthesize a target view source view based on the predicted depth and camera pose then use bilinear interpolation to obtain the value of the wa image ˆIs at location pt. framework can be applied to standard videos without pose i mation. Furthermore, it predicts the poses as part of the lear framework. See Figure 2 for an illustration of our learning pip for depth and pose estimation. 3.2. Differentiable depth image-based renderin As indicated in Eq. 1, a key component of our learning fr work is a differentiable depth image-based renderer that re structs the target view It by sampling pixels from a source vie based on the predicted depth map ˆDt and the relative pose ˆTt Let pt denote the homogeneous coordinates of a pixel i target view, and K denote the camera intrinsics matrix. We obtain pt’s projected coordinates onto the source view ps by2 ps ⇠ K ˆTt!s ˆDt(pt)K 1 pt Notice that the projected coordinates ps are continuous value obtain Is(ps) for populating the value of ˆIs(pt) (see Figur where K : #𝑇"→! = 𝑹 𝒕 𝟎$ 1 : ego-motion for the first conv layer is 32. (b) The pose and explainabilty networks share the first few conv layers, and then branch out to predict 6-D relative pose and multi-scale explainability masks, respectively. The number of output channels for the first conv layer is 16, and the ker size is 3 for all the layers except for the first two conv and the last two deconv/prediction layers where we use 7, 5, 5, 7, respectively. S Section 3.5 for more details. network’s belief in where direct view synthesis will be success- fully modeled for each target pixel. Based on the predicted ˆEs, the view synthesis objective is weighted correspondingly by Lvs = X <I1,...,IN >2S X p ˆEs(p)|It(p) ˆIs(p)| . (3) Since we do not have direct supervision for ˆEs, training with the above loss would result in a trivial solution of the network always predicting ˆEs to be zero, which perfectly minimizes the loss. To resolve this, we add a regularization term Lreg( ˆEs) that encour- ages nonzero predictions by minimizing the cross-entropy loss with constant label 1 at each pixel location. In other words, the network is encouraged to minimize the view synthesis objective, but allowed a certain amount of slack for discounting the factors not considered by the model. 3.4. Overcoming the gradient locality One remaining issue with the above learning pipeline is that the gradients are mainly derived from the pixel intensity difference be- tween I(pt) and the four neighbors of I(ps), which would inhibit explicit multi-scale and smoothness loss (e.g., as in [14, 16]) t allows gradients to be derived from larger spatial regions direc We adopt the second strategy in this work as it is less sensitive architectural choices. For smoothness, we minimize the L1 no of the second-order gradients for the predicted depth maps (simi to [48]). Our final objective becomes Lfinal = X l Ll vs + sLl smooth + e X s Lreg( ˆEl s) , where l indexes over different image scales, s indexes over sou images, and s and e are the weighting for the depth smoothn loss and the explainability regularization, respectively. 3.5. Network architecture Single-view depth For single-view depth prediction, we ad the DispNet architecture proposed in [35] that is mainly based an encoder-decoder design with skip connections and multi-sc side predictions (see Figure 4). All conv layers are followed ( #𝐸! : per-pixel soft mask)
  • 6. 6 struct2depth ● Explicitly model 3D motions of moving objects together with ego-motion V = O0(S1) O0(S2) O0(S3) E1!2, E2!3 = E(I1 V, I2 V, I3 V ) To model object motion, we first apply the ego-motion es- timate to obtain the warped sequences (ˆI1!2, I2, ˆI3!2) and ( ˆS1!2, S2, ˆS3!2), where the effect of ego-motion has been removed. Assuming that depth and ego-motion estimates are correct, misalignments within the image sequence are caused only by moving objects. Outlines of potentially mov- ing objects are provided by an off-the-shelf algorithm (He et al. 2017) (similar to prior work that use optical flow (Yang et al. 2018a) that is not trained on either of the datasets of interest). For every object instance in the image, the object motion estimate M(i) of the i-th object is computed as: M (i) 1!2, M (i) 2!3 = M (ˆI1!2 Oi( ˆS1!2), I2 Oi(S2), ˆI3!2 Oi( ˆS3!2)) (3) Note that while M (i) 1!2, M (i) 2!3 2 R6 represent object mo- tions, they are in fact modeling how the camera would have moved in order to explain the object appearance, rather than the object motion directly. The actual 3D-motion vectors are obtained by tracking the voxel movements before and af- ter the object movement transform in the respective region. Corresponding to these motion estimates, an inverse warp- ing operation is done which moves the objects according to the predicted motions. The final warping result is a combi- nation of the individual warping from moving objects ˆI(i) , and the ego-motion ˆI. The full warping ˆI (F ) 1!2 is: ˆI (F ) 1!2 = ˆI1!2 V | {z } Gradient w.r.t. E , + NX i=1 ˆI (i) 1!2 Oi(S2) | {z } Gradient w.r.t. M , (4) Core idea: Compute 6-DoF motion vectors for each instance Technical Advantage: Weights can be updated online !! Evaluation: ・Notable improvement over SfMLearner !! ・Competitive to stereo-depth and combination-depth One more thing:
  • 7. 7 Depth from Videos in the Wild [1/2] ● Train a mono-depth model on YouTube8M l challenges: multiple different camera l approach: learn camera intrinsics including lens distortion l issue: what if 𝑧" 𝑝" = 𝐾𝑅𝐾#$ 𝑧𝑝 + 𝐾𝑡 holds with incorrect *𝐾 ? l 𝐾𝑅𝐾#$ = *𝐾𝑅 *𝐾#$ can hold ? l proof: if R ≠ 𝐼, *𝐾 = 𝐾 (see also paper) train/val with “Quadcopter” videos the errors are within a few pixels but motion and intrinsics are correlated…
  • 8. 8 Depth from Videos in the Wild [2/2] ● Super-readable code is available on google-research.git l you can study diverse training tricks from the author’s comments [see also here]
  • 9. 9 PackNet [1/2] ● Directly fit depth maps to a unit of measurement l inherit drawback of mono-depth: scale ambiguity l existing works: scales depth maps with LiDAR’s GT l PackNet: leverages camera’s velocity l main contribution: 3D packing and unpacking blocks l traditional upsampling: fails to preserve enough detail to recover accurate depth l PackNet: ∙ Space2Depth: don’t squash spatial dimensions but hold them ∙ 3D Conv: compress key spatial details
  • 10. 10 PackNet [2/2] ● TRI has achieved SOTA !!
  • 11. 11 Summary and Discussion ● Summary l self-supervised monocular depth learning has surprisingly evolved for these 3~4 years l their backbone is in traditional geometric computer vision. Let’s study again!! l TRI has competitive technical assets in this area ● Discussion