Depth Estimation From Stereo Images Using Deep Learning

6 min readFeb 28, 2023

Introduction:

In Stereo Vision, two images of the same point is triangulated to recover depth and the depth/distance can be found out based on the given formula:

D = f B/disparity, where

f = focal length, which can be found through camera calibration.
B = Physical distance between 2 cameras.

What is Disparity?

Disparity is a correspondence problem and it tries to find whether a given point(x,y) present in Epipolar lines of left image is matching with a point(x-d, y) present in the right image. Where d = disparity.

Right now, the question is, Is it possible to match a single point present in left image with another point in right image?Obviously,it’s difficult.Am I right?So normally surrounding of a pixel is taken in to consideration.That’s why a window of some size(this value affect disparity measurement) is taken in to consideration and compared with right side image for similarity match.

There are different approaches to measure disparity and they are as follows:

Traditional approach.
Deep Learning Based approach

1. Traditional Approach:

Fig.3-Different Traditional Approach for solving Disparity Problem

Fronto-Parallel Window based approach:

The assumptions made for fronto-parallel based window is as follows:

Block matching assumes that all pixels inside the window is displaced by d,which is not true all the time.

Fig.4- Black patch is not displaced by a constant factor within the given window.

Slanted surfaces deform perspectively when the viewpoint changes.

In Local Stereo matching, only a given window size is matched between left and right image using SSD(Sum of Squared Difference), SAD(Sum of Absolute Difference), NCC(Normalized Cross Correlation) like formula.But in Global Stereo Matching, an Energy function is added.And Semi-Global Stereo Matching(SGBM) is a mixed of the above two.As this article is about Deep learning based approach, so i am not giving detail analysis on this traditional approach.

Slanted Window based approach:

There was a paper published by M. BLEYER, C. RHEMANN, C. ROTHER: PATCHMATCH STEREO(https://www.microsoft.com/en-us/research/publication/patchmatch-stereo-stereo-matching-with-slanted-support-windows/) which uses slanted window for solving disparity match.

Summary of the paper is

Common local stereo methods match support windows at integer-valued disparities.The implicit assumption that pixels within the support region have constant disparity does not hold for slanted surfaces and leads to a bias towards reconstructing fronto-parallel surfaces. This work overcomes this bias by estimating an individual 3D plane at each pixel onto which the support region is projected.
1. The support window contains pixels
that lie on a different surface than the center pixel.
2. The window captures a surface
that is slanted, i.e., not fronto-parallel.

You can find it’s implementation shared at https://github.com/ibergonzani/patch-match-stereo.

In general the steps followed to measure disparity in traditional approach is given below:

2. Deep Learning Based Approach:

Before jumping in to Deep learning based approach one should first understand how the problem was solved in traditional approach.Then it would be easier to understand how Deep learning methods try to solve it.

I tried to summarize how deep learning method try to solve the disparity problem in the below figure.

In Matching cost computation, left side window is multiplied with all the window on the right hand side ,with different disparity, along the Epipolar line.Finally, it forms a 3D cost volume(In most paper it’s called as 4D cost volume). Then, 3D- convolution is applied to cost volume as we want to determine the matching disparity which is along the z axis(i.e. disparity). Right now, you can understand why 3D convolution is required in Cost Volume Aggregation step.In the final step,different loss function is used for optimizing the whole process.

But in the above approach,the main issue is 3D convolution needs lot of memory as well as computation power.Researcher found how to minimize the number of 3D convolutions required or how to completely avoid the use 3D convolution during cost aggregation steps.

Different State of Art Architecture(SOTA) Deep Learning Architecture for Disparity Estimation:

Here is the list of all important deep learning based architectures:

Fig.8- Deep Learning based Architectures

Broadly if you analyze all the architectures from 2016 until now,then you can found that deep learning based methods falls under 3 categories.

Deep Learning Approach-1(Uses 3D convolution for Cost Aggregation Step):

Fig.9- GCNet(https://arxiv.org/abs/1703.04309)

Fig.10-See the Number of 3D convolutions used in GCNet, StereoNet and GANet. Here AA= Adaptive Aggregation Network

Deep Learning Approach-2(Uses minimum 3D convolution for cost aggregation steps):

Here Cost volume estimation are optimized so that less number of 3D convolutions are required at the Cost Aggregation stage. Please check profiling data present in Fig.12 for more info on GwcNet and BGNet.

Deep Learning Approach-3(Uses No 3D convolution for Cost Aggregation Step):

RAFT-Stereo uses iterative update,which is basically a Recurrent Neural Network, for solving Cost Aggregation step. CRE-Stereo is also based on the same iterative update idea,but added few additional steps in order to improve disparity measurement.

My Implementation:

I have shared my implementation @ https://github.com/satya15july/depth_estimation_stereo_images.I have incorporated below mentioned deep learning based architecture in my repo.Please have a look and let me know if you have any query.

Please check the profiling and inference data which i have collected using my repo.

Application of Disparity:

Another use of disparity is 3D image reconstruction which is used mainly in stereo vision based application in Robotics, AR/VR,Drones etc.

Conclusion:

I hope this article give you a clear understanding on how to use different deep learning based architecture to measure disparity from stereo images and how it helps in measuring depth.Thank you for reading this article.

If you like the content, claps are appreciated and follow me for more informative content on Computer Vision and Deep Learning.

References:

Reach me at

LinkedIn: www.linkedin.com/in/satya1507
Mail-Id: satya15july@outlook.com