Monocular Depth Estimation

4 min readAug 19, 2023

Buying 3D camera is an expensive affair. Of course, we can buy two cheap camera to perform the depth estimation using stereo camera technique. But there are 5 general approaches to estimate depth without using a stereo camera.

(a) Monocular Depth Estimation: This technique relies on a single camera and uses various computer vision algorithms to estimate depth. Some popular methods include supervised learning-based approaches, such as monocular depth prediction using deep convolutional neural networks (CNNs) trained on large datasets.

(b) Focus-based Depth Estimation: This method estimates depth based on the variation of focus in an image. By analyzing the sharpness or blur of different regions in an image, it is possible to estimate the depth of objects within the scene.

(c ) Motion-based Depth Estimation: This approach estimates depth based on the motion observed between successive frames in a video. By analyzing the optical flow or motion vectors, depth can be estimated using techniques such as structure from motion or visual odometry.

(d) LiDAR or Time-of-Flight (ToF) Sensors: LiDAR (Light Detection and Ranging) sensors or ToF (Time-of-Flight) cameras emit laser or infrared light and measure the time taken for the light to bounce back from objects in the scene. This information can be used to estimate dept.

(e) Depth from Defocus: This technique estimates depth by analyzing the defocus blur in an image. By capturing multiple images of the same scene with different focus settings, depth can be estimated based on the amount of defocus blur in each image.

We will focus on technique 1: Monocular depth estimation.

It is a computer vision task that involves predicting the depth information of a scene from a single image. In other words, it is the process of estimating the distance of objects in a scene from a single camera viewpoint.

Depth estimation is a crucial step towards inferring scene geometry from 2D images. The goal in monocular depth estimation is to predict the depth value of each pixel or inferring depth information, given only a single RGB image as input. This example will show an approach to build a depth estimation model with a convnet and simple loss functions.

The task is challenging as it requires the model to understand the complex relationships between objects in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, occlusion, and texture.

This task is challenging because depth information is lost when capturing a scene in a 2D image, as opposed to using stereo cameras or depth sensors. Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, and robotics.

Here I take 4 photos of a yellow cylinder with distance 18cm to 9cm from the base. The good thing about the Monocular DE is that it obliterate all the background noises.

While the cylinder has no obvious colour change, the background color is getting bluer when the distance is short.

This suggests that the intensity of colour is due to short distance and we can use this for the depth estimation.

(1) Distance of cylinder from base = 18cm

(2) Distance of cylinder from base = 15cm

(3) Distance of cylinder from base = 12cm

(4) Distance of cylinder from base = 9cm

It’s important to note that while these techniques can provide depth estimation without using a stereo camera, they may have limitations in terms of accuracy and robustness depending on the specific application and conditions.

Appendix: codes for Monocular Estimation

(1) Midas AI

(2) Keras

The codes below are taken from Midas.

# Import dependencies
import cv2
import torch
import matplotlib.pyplot as plt

# Download the MiDaS
midas = torch.hub.load('intel-isl/MiDaS', 'MiDaS_small')
midas.to('cpu')
midas.eval()
# Input transformation pipeline
transforms = torch.hub.load('intel-isl/MiDaS', 'transforms')
transform = transforms.small_transform

# Hook into OpenCV
cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()

    # Transform input for midas
    img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    imgbatch = transform(img).to('cpu')

    # Make a prediction
    with torch.no_grad():
        prediction = midas(imgbatch)
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size = img.shape[:2],
            mode='bicubic',
            align_corners=False
        ).squeeze()

        output = prediction.cpu().numpy()

        print(output)
    plt.imshow(output)
    cv2.imshow('CV2Frame', frame)
    plt.pause(0.00001)

    if cv2.waitKey(10) & 0xFF == ord('q'):
        cap.release()
        cv2.destroyAllWindows()

plt.show()

The codes below are from

REFERENCES

(1) https://github.com/isl-org/MiDaS

(2) https://keras.io/examples/vision/depth_estimation/

Monocular Depth Estimation

Written by Elven Kim