Published in

Artificialis

8 min readJun 12, 2023

Getting Started with Depth Estimation using MiDaS and Python

Fig — 01 Image by Author Inspired by Sambeetarts on Pixabay

Measuring distance of an object from camera poses a significant challenge within the computer vision domain, due to the lack of inherent depth information in 2D images, perspective distortion, variable object sizes, camera calibration requirements, and occlusion in complex scenes. Distance estimation via perspective projection, for instance, relies on variables such as the size of the sensor, focal length, and the actual height of the object. The computation of these unknown variables adds to the complexity of the task.

Fig — 02 (Formula for Distance to Object)

A series of traditional and deep learning based approaches have been in service for quite some time to provide an effective solution to distance estimation. Solutions involving stereo vision have proven to be effective and accurate in depth calculation, however there’s always a search for a much efficient and inexpensive alternative. Deep learning has shown itself to thrive in such constraints, pushing the boundaries of possibilities to a whole new level and bringing to reality depth estimation models for monocular vision — one of which we’ll be exploring in this article.

Following my recent distance measurement using media-pipe tutorial, I intended to build upon the notion of monocular depth estimation by bringing all possible approaches — to provide you with best options for your fun projects.

In this article, I’ll be estimating the distance of an object using a hybrid of media-pipe pose estimation module and MiDaS depth estimation model. But before that, let’s just take a quick overview of what we’ll be covering in this article.

MiDaS overview.
Distance measurement using Media-pipe landmarks and MiDaS depth map.

MiDaS

MiDaS(Multiple Depth Estimation Accuracy with Single Network) is a deep learning based residual model built atop Res-Net for monocular depth estimation. MiDaS is known to have shown promising results in depth estimation from single images. Below is the generic overview of the MiDaS architecture:

Encoder-Decoder Architecture

MiDaS is based on an encoder-decoder architecture, where the encoder part is responsible for high level feature extraction and the decoder generates the depth map from these features via up-sampling.

2. Backbone

MiDaS typically uses a residual network (ResNet-50 or ResNet-101) for feature extraction because it is robust to vanishing gradients. Allowing MiDaS to extract multi-channeled feature maps from input images, capturing hierarchical information at varying scales.

3. Multi-Scale Feature Fusion

Skip connections and feature fusion is incorporated within MiDaS to allow accurate depth estimation. Feature maps from earlier layers are connected to the later layers via skip connections for accessing low level details during up-sampling. With feature fusion, the multi-scale feature maps are combined to ensure the effective exploitation of both local and global information for depth estimation.

4. Up-Sampling and Refinement

The final depth map is generated using up-sampling. The general techniques used for the up-sampling are bi-linear interpolation or transposed convolutions to increase the spatial resolution of the feature maps. Feature fusion is employed to combine the depth maps with corresponding skip connections in order to refine the depth estimation.

Python Code for distance measurement

import cv2
import torch
import mediapipe as mp
import numpy as np
from scipy.interpolate import RectBivariateSpline

mp_pose = mp.solutions.pose
pose = mp_pose.Pose(static_image_mode=False)

Importing required packages and initializing media-pipe pose estimation class mp_pose.Pose.

#Downloading the model from TorchHub.
midas = torch.hub.load('intel-isl/MiDaS','MiDaS_small')
midas.to('cpu')
midas.eval()

Downloading MiDaS_small model from torch hub. You can download the python executable for MiDaS through Github once, instead. There are three variants of MiDaS available on torch hub and can be downloaded by substituting ‘DPT_Large’ or ‘DPT_Hybrid’ for ‘MiDaS_small’. The general performance of all three variants is given below :

Small variant : Lowest accuracy and high inference rate.
Hybrid variant : Medium accuracy and medium inference speed
Large variant : Highest accuracy with lowest inference speed.

If you’ve a Cuda compatible GPU, then you can replace the midas.to(‘cpu’) with midas.to(“cuda”) to maximize the inference speed.

#Performing preprocessing on input for small model 
transforms = torch.hub.load('intel-isl/MiDaS','transforms')
transform = transforms.small_transform

#Converting Depth to distance
def depth_to_distance(depth_value,depth_scale):
  return -1.0/(depth_value*depth_scale)

Applying the requisite preprocessing on the input image/ video frame for MiDaS small model. Followed by a function ‘depth_to_distance’ defined for converting the calculated depth values into the respective distance values.

cap = cv2.VideoCapture('')
while cap.isOpened():
  ret, frame = cap.read()

  img = cv2.cvtColor(frame,cv2.COLOR_BGR2RGB)

  cv2.imshow('Walking',img)

  if cv2.waitKey(2) &0xFF == ord('q'):
    cap.release()
    cv2.destroyAllWindows()

Reading the video input and applying color space transformation using cv2.cvtColor function. As cv2 reads images in BGR format we need to convert it to RGB for standard visuals. Let run the code to check if its working right so far.

Fig — 04 (A person walking in the woods) Gif by Author Inspired by Matthias Groeneveld on Pixabay

Next we will extract the landmarks from the video frame using Media-Pipe using the below code.

# Detect the body landmarks in the frame
    results = pose.process(img)

    # Check if landmarks are detected
    if results.pose_landmarks is not None:
        # Draw Landmarks
        mp_drawing = mp.solutions.drawing_utils
        mp_drawing.draw_landmarks(img, results.pose_landmarks, mp_pose.POSE_CONNECTIONS)

After successfully drawing the detected landmarks, we’ll now extract the waist landmarks as a reference for the distance estimation. Lets write the code to extract the waist landmarks.

# Extract Landmark Coordinates
        landmarks = []
        for landmark in results.pose_landmarks.landmark:
            landmarks.append((landmark.x, landmark.y, landmark.z))

        # Extract left and right waist Landmarks
        waist_landmarks = [results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_HIP],
                           results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_HIP]]
        #Finding midpoint from waist    
        mid_point = ((waist_landmarks[0].x + waist_landmarks[1].x) / 2, (waist_landmarks[0].y + waist_landmarks[1].y) / 2)
        mid_x , mid_y = mid_point

Extracting the x and y coordinate values for both landmarks and calculating the midpoint. You can pick any landmark of your choice depending on the use-case from the pose-landmark list on Media-pipe.

Moving forward, we’ll pass our video through MiDaS depth estimation model to get the depth map.

imgbatch = transform(img).to('cpu')

    # Making a prediction
    with torch.no_grad():
        prediction = midas(imgbatch)
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size=img.shape[:2],
            mode='bicubic',
            align_corners=False
        ).squeeze()

    output = prediction.cpu().numpy()
    #Normalizing the output predictions for cv2 to read.
    output_norm = cv2.normalize(output, None, 0, 1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)

    cv2.imshow('Walking',output_norm)

Above is the depth map extracted through MiDaS, also change the waitKey value to 1 to decrease the frame delay. You can also change the output of the depth map to a colored map using the code below, but we’ll be using the standard black and white map in this project.

#Colored Depth map
output_norm = (output_norm*255).astype(np.uint8)
output_norm = cv2.applyColorMap(output_norm, cv2.COLORMAP_MAGMA)

We’ll then use the from waist landmarks extracted earlier, to calculate the depth value using MiDaS.

#Creating a spline array of non-integer grid
        h , w = output_norm.shape
        x_grid = np.arange(w)
        y_grid = np.arange(h)

        # Create a spline object using the output_norm array
        spline = RectBivariateSpline(y_grid, x_grid, output_norm)

The use of a spline array in above code snippet enables the creation of a smooth and continuous representation of the output_norm array on a non-integer grid. The need for a spline array arises from the normalization of the output predictions, resulting in an array with floating-point values. By utilizing a spline object, more accurate and versatile computations or visualizations can be performed by interpolating values based on the given data.

#Passing the x and y co-ordinates distance function to calculate distance.
#Tweak with the depth scale to see what suits you!
depth_scale = 1
depth_mid_filt = spline(mid_y,mid_x)
depth_midas = depth_to_distance(depth_mid_filt, depth_scale)

#Displaying the distance.
cv2.putText(img, "Depth in unit: " + str(
            np.format_float_positional(depth_mid_filt , precision=3)), (20, 50), cv2.FONT_HERSHEY_SIMPLEX,
                    1, (255, 255, 255), 3)

The depth values are fluctuating a bit. To stabilize the values we’ll use a exponential mean average filter on the depth values and then see the improvements.

#Adjust alpha values to suit your need
alpha = 0.2
previous_depth = 0.0

#Applying exponential moving average filter
def apply_ema_filter(current_depth):
    global previous_depth
    filtered_depth = alpha * current_depth + (1 - alpha) * previous_depth
    previous_depth = filtered_depth  # Update the previous depth value
    return filtered_depth

Here we can witness considerable difference in the fluctuation after applying Exponential Mean Average Filter on distance value.

Lets just take a look at the entire code snippet.

import cv2
import torch
import matplotlib.pyplot as plt
import mediapipe as mp
import numpy as np
import shutil
from scipy.interpolate import RectBivariateSpline

#To Clear the model cache
# shutil.rmtree(torch.hub.get_dir(), ignore_errors=True)

#Initializing the body landmarks detection module
mp_pose = mp.solutions.pose
pose = mp_pose.Pose(static_image_mode=False)

# #download the model
midas = torch.hub.load('intel-isl/MiDaS','MiDaS_small')
midas.to('cpu')
midas.eval()

#Process image
transforms = torch.hub.load('intel-isl/MiDaS','transforms')
transform = transforms.small_transform

alpha = 0.2
previous_depth = 0.0
depth_scale = 1.0

#Applying exponential moving average filter
def apply_ema_filter(current_depth):
    global previous_depth
    filtered_depth = alpha * current_depth + (1 - alpha) * previous_depth
    previous_depth = filtered_depth  # Update the previous depth value
    return filtered_depth


#Define depth to distance
def depth_to_distance(depth_value,depth_scale):
    return 1.0 / (depth_value*depth_scale)

def depth_to_distance1(depth_value,depth_scale):
    return -1.0 / (depth_value*depth_scale)

cap = cv2.VideoCapture('distance1.mp4')
while cap.isOpened():
    ret, frame = cap.read()

    img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    # Detect the body landmarks in the frame
    results = pose.process(img)

    # Check if landmarks are detected
    if results.pose_landmarks is not None:
        # Draw Landmarks
        # mp_drawing = mp.solutions.drawing_utils
        # mp_drawing.draw_landmarks(img, results.pose_landmarks, mp_pose.POSE_CONNECTIONS)

        # Extract Landmark Coordinates
        landmarks = []
        for landmark in results.pose_landmarks.landmark:
            landmarks.append((landmark.x, landmark.y, landmark.z))

        waist_landmarks = [results.pose_landmarks.landmark[mp_pose.PoseLandmark.LEFT_HIP],
                           results.pose_landmarks.landmark[mp_pose.PoseLandmark.RIGHT_HIP]]

        mid_point = ((waist_landmarks[0].x + waist_landmarks[1].x) / 2, (waist_landmarks[0].y + waist_landmarks[1].y) / 2,(waist_landmarks[0].z + waist_landmarks[1].z) /2)
        mid_x,mid_y = mid_point

        
        imgbatch = transform(img).to('cpu')

        # Making a prediction
        with torch.no_grad():
            prediction = midas(imgbatch)
            prediction = torch.nn.functional.interpolate(
                prediction.unsqueeze(1),
                size=img.shape[:2],
                mode='bicubic',
                align_corners=False
            ).squeeze()

        output = prediction.cpu().numpy()
        output_norm = cv2.normalize(output, None, 0, 1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)

        # Creating a spline array of non-integer grid
        h, w = output_norm.shape
        x_grid = np.arange(w)
        y_grid = np.arange(h)

        # Create a spline object using the output_norm array
        spline = RectBivariateSpline(y_grid, x_grid, output_norm)
        depth_mid_filt = spline(mid_y,mid_x)
        depth_midas = depth_to_distance(depth_mid_filt, depth_scale)
        depth_mid_filt = (apply_ema_filter(depth_midas)/10)[0][0]
        
        cv2.putText(img, "Depth in unit: " + str(
            np.format_float_positional(depth_mid_filt , precision=3)) , (20, 50), cv2.FONT_HERSHEY_SIMPLEX,
                    1, (255, 255, 255), 3)
        
        cv2.imshow('Walking',img)

    if cv2.waitKey(1) &0xFF == ord('q'):
        cap.release()
        cv2.destroyAllWindows()

Conclusion

In this article we calculated the distance of an object from the camera using the MiDaS depth estimation model while leveraging the reference landmark extracted from Media-pipe. The following approach can be used to detect distance of multiple objects/people and integrated into mini projects based on proximity.

You can download the source code using Github link.

References

[1] Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler and Vladlen Koltun, Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer (2020), IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).

[2] Ranftl, Alexey Bochkovskiy and Vladlen Koltun, Vision Transformers for Dense Prediction (2021), ArXiv preprint.

[3] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan-Teh Chang, Wei Hua, Manfred Georg and Matthias Grundmann, MediaPipe: A Framework for Building Perception Pipelines (2019), ArXiv preprint.

MiDaS

Python Code for distance measurement

Conclusion

References

Written by Nabeel Khan