A Comparative Study on Monocular Depth Estimation

14 min readNov 17, 2023

In the realm of computer vision, understanding the depth of a scene from a single image or a video frame is a pivotal challenge. Monocular Depth Estimation (MDE) addresses this task by inferring the three-dimensional (3D) spatial information, particularly the depth, from a two-dimensional (2D) image captured by a single camera.

Generally Depth Estimation can be achieved by :
1.Stereo :- Multiple View of Images (Generally Done with 2 Cameras).
2.Monocular :-Single View of Images ( 1 Camera is sufficient).
This article will be focusing on Monocular Depth Estimation .

The Significance of Depth Perception

Depth perception, an innate capability of human vision, allows us to perceive the world in three dimensions. This perception helps us navigate our environment, estimate distances, and understand spatial relationships between objects. Achieving this level of depth perception in artificial systems has profound implications across various fields like:

Autonomous Systems: For self-driving cars, drones, or robots, understanding depth is crucial for navigation and obstacle avoidance.

Augmented Reality (AR) and Virtual Reality (VR): Providing a realistic sense of depth enhances user experiences in AR and VR applications.

Image and Video Processing: Depth information enriches image and video understanding, enabling tasks like object segmentation, scene reconstruction, and more.

We discussed the importance of monocular depth estimation in various technological domains. However, to truly appreciate the challenges and advancements in this field, we need to delve into the fundamentals of how humans perceive depth.

Human vision serves as the inspiration and benchmark for many technological advancements in computer vision. Understanding how our brains interpret depth cues provides invaluable insights into the complexities of inferring depth from a single image — a challenge that computer vision aims to replicate.

Exploring human depth perception offers a foundational understanding of the cues and complexities involved in perceiving the 3D world from 2D images. This knowledge serves as a launching point for the innovative techniques and algorithms employed in monocular depth estimation.

By bridging the gap between human depth perception and the technological pursuit of monocular depth estimation, we gain a deeper appreciation for the intricacies involved in replicating such capabilities in artificial systems. This understanding sets the stage for exploring the methodologies and advancements in the field of depth estimation from a single image perspective.

Depth Perception in Human Vision

Binocular Disparity:

Explain how the brain interprets differences in the images captured by both eyes to perceive depth. The offset between the images helps in triangulating the distance to objects.

Motion Parallax:

Detail how the brain gauges depth by observing how objects move relative to each other when we move. Nearby objects appear to move faster than distant ones, aiding in depth perception.

Perspective Cues:

Discuss cues like relative size, occlusion, texture gradients, and convergence of parallel lines to estimate depth in scenes.

Challenges in Monocular Depth Estimation

Estimating depth from a single image poses several challenges:

Ambiguity: An image lacks explicit depth cues, making it challenging to discern distances accurately, especially in scenes with similar textures or poor lighting conditions.
Limited Information: Unlike stereo or multi-view setups, a single image offers limited information for depth estimation.
Scale Ambiguity: Determining the absolute scale of the scene from a single image is a complex problem.

Evolution of Depth Estimation Techniques

Traditionally, depth estimation relied on handcrafted features and heuristics. However, the advent of deep learning, particularly Convolutional Neural Networks (CNNs), revolutionized monocular depth estimation. These networks leverage vast amounts of data to learn complex hierarchical representations, enabling more accurate depth predictions.

Methods and Techniques

Traditional Approaches vs. Modern Advancements

Transitioning from the understanding of human depth perception to computational depth estimation involves a shift from traditional methods to cutting-edge techniques:

Traditional Methods:

Discuss early methods that relied on handcrafted features, such as edge detection, texture analysis, and stereo matching algorithms.
Highlight their limitations in accurately inferring depth from a single image due to the lack of complex contextual understanding.

Modern Deep Learning Techniques:

Introduce the paradigm shift brought by deep learning, especially Convolutional Neural Networks (CNNs), in revolutionizing monocular depth estimation.
Explain how CNNs learn hierarchical representations from data, enabling them to capture intricate depth cues and make more accurate predictions.

MiDaS, as a prominent representative of deep learning models, has achieved significant recognition. Its innovative approach combines architecture design, pre-training on diverse datasets, and self-supervised learning. This results in a model that offers state-of-the-art performance in terms of depth estimation accuracy. The recent release of MiDaS 3.1 underscores its continual evolution, introducing advancements in real-time depth estimation, high-resolution depth maps, and domain adaptation.

The adoption of MiDaS models has led to breakthroughs in multiple domains. In autonomous driving, MiDaS aids in obstacle detection, improving the safety of self-driving vehicles. In augmented reality, it enhances object placement and interaction with the real world. Moreover, MiDaS models have demonstrated their utility in medical imaging, facilitating 3D reonstruction and improving the accuracy of surgical navigation.

The article here highlights the promising trajectory of monocular depth estimation, particularly when coupled with MiDaS models. It emphasizes the importance of ongoing research in improving the robustness of these models to various environmental conditions and their adaptability to a broader range of applications. Furthermore, ethical considerations regarding privacy and the responsible use of depth estimation technology are emerging areas of concern and research.

But real question you guys might be asking…..

What is MiDaS ?

MiDaS(Multiple Depth Estimation Accuracy with Single Network) is a pioneering model designed for monocular depth estimation, renowned for its ability to estimate depth accurately and in real-time from a single image.

Real-time Depth Estimation:

Swift Inference: MiDaS excels in providing depth estimations swiftly, making it suitable for real-time applications, such as augmented reality, robotics, and autonomous vehicles.

Utilization of Stereo Supervision:

Self-supervised Learning: During training, MiDaS leverages stereo image pairs, where depth information is available, to learn depth estimation. However, during inference, it operates on monocular images, showcasing its adaptability to real-world scenarios.

Architectural Innovations:

Pyramid Features: MiDaS employs a multi-scale architecture that integrates information from multiple levels of detail, allowing it to capture both fine-grained and coarse depth information effectively.
Depth Regression: Unlike traditional methods that focus on disparity maps, MiDaS directly predicts depth values, simplifying the estimation process.

Evolution and Versions:

Evolutionary Versions: MiDaS has evolved through iterations such as MiDaS v2.0 or 3.0, with improvements in accuracy, robustness, and generalization capabilities.
Larger Datasets: Advanced versions are often fine-tuned on larger and more diverse datasets, enabling them to handle a wider range of scenes and scenarios.

Implementation

# Import dependencies
import cv2
import torch
import matplotlib.pyplot as plt

# Download the MiDaS
#midas = torch.hub.load('intel-isl/MiDaS', 'MiDaS_small')
midas = torch.hub.load('intel-isl/MiDaS', 'DPT_Large')
# Use GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
midas.to(device)
midas.eval()
# Use transforms to resize and normalize the image
midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")

if midas == "DPT_Large" or midas == "DPT_Hybrid":
    transform = midas_transforms.dpt_transform
else:
    transform = midas_transforms.small_transform
# Hook into OpenCV
cap = cv2.VideoCapture(0)
while cap.isOpened():
    ret, frame = cap.read()

    # Transform input for midas
    img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    imgbatch = transform(img).to('cpu')

    # Make a prediction
    with torch.no_grad():
        prediction = midas(imgbatch)
        prediction = torch.nn.functional.interpolate(
            prediction.unsqueeze(1),
            size = img.shape[:2],
            mode='bicubic',
            align_corners=False
        ).squeeze()

        output = prediction.cpu().numpy()

        print(output)
    plt.imshow(output)
    cv2.imshow('CV2Frame', frame)
    plt.pause(0.00001)

    if cv2.waitKey(10) & 0xFF == ord('q'):
        cap.release()
        cv2.destroyAllWindows()

plt.show()

To use MiDaS 2.0 just uncomment #midas = torch.hub.load(‘intel-isl/MiDaS’, ‘MiDaS_small’) part and comment midas = torch.hub.load(‘intel-isl/MiDaS’, ‘DPT_Large’).
MiDaS_small signifies MiDaS 2.0 and DPT_Large signifies MiDaS 3.0 .

Now you guys must be wondering ,Where is the Model going to be saved.

The torch.hub.load() function loads the MiDaS small model, which will be downloaded and saved in the default torch cache directory. By default, this directory is ~/.cache/torch/hub/checkpoints/.

When you execute the code, the MiDaS small model will be downloaded and saved into a directory similar to the one mentioned above on your system. This location is where Torch Hub stores pre-trained models downloaded from repositories.

If you want to explicitly specify the directory where the model will be saved, you can use the torch.hub.set_dir() function before loading the model:

torch.hub.set_dir('your/desired/directory')
midas = torch.hub.load('intel-isl/MiDaS', 'MiDaS_small')

Recent Work on Monocular Depth Estimation

Recently , MiDaS 3.1 was developed and it s a state-of-the-art monocular depth estimation model developed by Intel Labs. It is a deep learning model that can estimate the depth of each pixel in an input image from a single camera. MiDaS 3.1 is significantly more accurate than previous versions of MiDaS, and it is also more efficient. This makes it a versatile tool for a wide range of applications, including robotics, augmented reality (AR), virtual reality (VR), and computer vision.

High accuracy: MiDaS 3.1 is one of the most accurate monocular depth estimation models available. It is up to 28% more accurate than MiDaS 3.0.
Efficiency: MiDaS 3.1 is more efficient than previous versions of MiDaS. It can run at up to 30 frames per second (FPS) on a standard GPU.
Supports a wide range of image sizes: MiDaS 3.1 can handle images of any size. This makes it a versatile tool for a wide range of applications.
Improved handling of occlusions: MiDaS 3.1 is better at handling occlusions than previous versions of MiDaS. This means that it can more accurately estimate the depth of objects that are partially obscured by other objects.
Reduced sensitivity to noise: MiDaS 3.1 is less sensitive to noise than previous versions of MiDaS. This makes it more robust to real-world conditions.

To get detailed analysis of MiDaS 3.1 Model with its working code , you can just refer to the following Page:- https://github.com/isl-org/MiDaS

Now , We will Discuss Comparison between MiDaS 2.0 , MiDaS 3.0 and MiDaS 3.1 .

Dataset on which each of these Models were trained on :-

MiDaS 2.0:-

NYU Depth V2: This dataset contains 1449 training images and 400 testing images of indoor scenes with ground-truth depth maps.
KITTI: This dataset contains 41,600 training images and 2,900 testing images of outdoor scenes with ground-truth depth maps.

MiDaS 3.0:-

ReDWeb: This dataset contains 5,886 training images and 600 testing images of indoor scenes with ground-truth depth maps.
DIML: This dataset contains 102,400 training images and 5,000 testing images of indoor scenes with ground-truth depth maps.
Movies: This dataset contains 139,000 training images of outdoor scenes from various movies.
MegaDepth: This dataset contains 387,085 training images and 10,000 testing images of outdoor scenes with ground-truth depth maps.
WSVD: This dataset contains 1,253 training images and 100 testing images of outdoor scenes with ground-truth depth maps.

MiDaS 3.1:-

Certainly, here’s an explanation of the 12 datasets used to train MiDaS 3.1:

ReDWeb: This dataset, consisting of 5,886 training and 600 testing images, captures indoor scenes with ground-truth depth maps. It provides a diverse range of indoor environments, including living rooms, bedrooms, kitchens, and offices.
DIML: This extensive dataset comprises 102,400 training and 5,000 testing images of indoor scenes with ground-truth depth maps. It encompasses a variety of indoor environments, such as offices, classrooms, and hallways.
Movies: This dataset features 139,000 training images extracted from various movies, showcasing outdoor scenes. It covers a broad spectrum of environments, including streets, parks, forests, and mountains.
MegaDepth: This dataset, encompassing 387,085 training and 10,000 testing images, captures outdoor scenes with ground-truth depth maps. It provides a comprehensive view of diverse outdoor environments, including urban areas, rural areas, and highways.
WSVD: This dataset, composed of 1,253 training and 100 testing images, features outdoor scenes with ground-truth depth maps. It captures a variety of outdoor environments, including urban areas, rural areas, and highways.
TartanAir: This dataset, comprising 22,000 training images, captures aerial scenes with ground-truth depth maps. It provides a bird’s-eye view of the environment, offering a unique perspective for depth estimation.
HRWSI: This dataset, consisting of 454 training and 60 testing images, features high-resolution stereo images. It provides high-quality depth information for both indoor and outdoor scenes.
ApolloScape: This extensive dataset, encompassing 200,000 training and 10,000 testing images, captures urban driving scenes with ground-truth depth maps. It provides a realistic representation of urban environments, including streets, intersections, and highways.
BlendedMVS: This dataset, composed of 120,000 training images, features mixed real and synthetic scenes with ground-truth depth maps. It challenges the model’s ability to handle real-world artifacts and occlusions, enhancing its robustness.
IRS: This dataset, encompassing 4,480 training images, captures indoor scenes using a calibrated IR stereo camera. It provides high-quality depth information for indoor environments, particularly in low-light conditions.
KITTI: This dataset, consisting of 41,600 training and 2,900 testing images, features outdoor scenes with ground-truth depth maps. It captures a variety of outdoor environments, including streets, highways, and forests.
YU Depth V2: This dataset, composed of 1449 training and 400 testing images, captures indoor scenes with ground-truth depth maps. It provides a diverse range of indoor environments, including living rooms, bedrooms, kitchens, and offices.

Fig:- Input Image (Left) and MiDaS 2.0 Output Image (Right)

Fig:- MiDaS 3.0 Output Image (Left) and MiDaS 3.1 Output Image (Right)

As you can see, MiDaS 3.1 is the most accurate and efficient model, while MiDaS 2.0 is the least accurate and most efficient. MiDaS 3.0 is a compromise between accuracy and efficiency.

Accuracy: MiDaS 3.1 is up to 28% more accurate than MiDaS 3.0, and up to 50% more accurate than MiDaS 2.0. This means that MiDaS 3.1 is better at estimating the depth of objects in images.
Efficiency: MiDaS 3.1 is up to 2x faster than MiDaS 3.0, and up to 4x faster than MiDaS 2.0. This means that MiDaS 3.1 can run at a higher frame rate, which is important for real-time applications.
Performance: MiDaS 3.1 is the best-performing model overall, as it is both accurate and efficient. MiDaS 3.0 is a good compromise between accuracy and efficiency, while MiDaS 2.0 is the least accurate and least efficient model.

System Architecture

Input Image: Original image used for computer vision or image processing.
Layout Segmentation: Dividing an image into regions based on object layout.
Object Segmentation: Identifying and isolating individual objects in an image.
Reliability of Layout: Accuracy of layout segmentation.
Reliability of Segmentation: Accuracy of object isolation.
Distance Transform: Calculates pixel distance from an object or feature. Output: Final result of image processing or computer vision.

Application and Use Cases

After Understanding the concept of Monocular Depth Estimation it is important to understand where this Model can be used:-

Autonomous Driving: Monocular depth estimation is crucial for autonomous vehicles to understand the depth and distance of objects in the environment. It helps in obstacle detection, path planning, and safe navigation.

Augmented Reality (AR) and Virtual Reality (VR): Monocular depth estimation enhances the immersive experience in AR and VR applications by providing depth information, enabling realistic object placement and interaction with the real world.

3D Scene Reconstruction: Monocular depth estimation can be used to create 3D models of scenes or objects from a single image, which is valuable in applications like architecture, gaming, and virtual tours.

Object Detection and Tracking: Monocular depth information aids in accurate object detection and tracking in computer vision applications, such as surveillance, robotics, and security systems.
Image and Video Enhancement: Monocular depth estimation can be used to improve image and video quality by applying depth-based post-processing techniques, like depth-aware filters.
Medical Imaging: In the medical field, it can assist in applications like organ segmentation, tumor detection, and 3D reconstruction for surgery planning.

Drones and Aerial Photography: Monocular depth estimation can help drones and aerial vehicles to navigate, avoid obstacles, and capture aerial images with depth information.

Gesture and Pose Recognition: It’s used in gesture recognition systems to understand the relative positions of body parts and in pose estimation applications.
Artificial Intelligence and Robotics: Monocular depth information aids robots in object manipulation, navigation, and interaction with the environment.
Virtual Try-On and Fashion Retail: It enables virtual try-on of clothing and accessories by estimating the depth of the person and the items being tried.

Video Games: Monocular depth estimation enhances the realism of video games by providing depth cues for rendering scenes and objects.

Future Work

Improved accuracy and robustness: Researchers are continuing to develop new algorithms and techniques to improve the accuracy and robustness of monocular depth estimation models. This includes exploring new ways to incorporate additional data sources, such as semantic segmentation and motion information, as well as developing more efficient and scalable algorithms for training and running these models.
Multi-task learning: Monocular depth estimation can be combined with other tasks, such as semantic segmentation and object detection, to create more powerful and versatile models. This is known as multi-task learning. Researchers are exploring new ways to combine these tasks in order to improve the performance of each individual task.
Domain adaptation: Monocular depth estimation models are often trained on specific datasets of images. This can make it difficult to apply these models to new domains, such as underwater environments or medical imaging. Researchers are working on developing methods for domain adaptation that can allow monocular depth estimation models to generalize to new domains without the need for retraining.

Conclusion

In this project, we have explored the fascinating realm of monocular depth estimation, focusing on the MiDaS (Monocular Depth Estimation in Adverse Scenarios) model. The project’s objectives included developing an accurate depth estimation software, enabling real-time processing, and providing a user-friendly interface. Here’s a concise conclusion of the project’s achievements and findings: The project successfully implemented the MiDaS model, allowing for accurate monocular depth estimation from single input images. Real-time processing was achieved, making the software suitable for applications that require immediate depth information. The user-friendly interface simplifies the utilization of depth estimation capabilities. The project’s outcomes provide valuable insights into monocular depth estimation, benefiting various applications, including autonomous driving, robotics, augmented reality, and 3D reconstruction. The implementation and integration of the MiDaS model demonstrate its capabilities and potential across a range of domains.

Some of Brilliant articles/research paper I found on this topic which would help you to expand your knowledge about MDE-
1. R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer,” *IEEE Transactions on Pattern Analysis and Machine Intelligence*. [Online]. Available: https://arxiv.org/pdf/1907.01341v3
2. R. Birkl, D. Wofk, and M. Muller, “MiDaS v3.1 — A Model Zoo for Robust Monocular Relative Depth Estimation,” Intel Labs. [Online]. Available: https://arxiv.org/pdf/2307.14460
3. R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer,” GitHub repository. [Online]. Available: https://github.com/isl-org/MiDaS
4. M. Poggi, F. Tosi, F. Aleotti, and S. Mattoccia, “Real-Time Self-Supervised Monocular Depth Estimation Without GPU,” *IEEE*. [Online]. Available: https://ieeexplore.ieee.org/document/9733979
5. X. Ruan, W. Yan, J. Huang, P. Guo, and W. Guo, “Monocular Depth Estimation Based on Deep Learning: A Survey,” *IEEE*. [Online]. Available: https://ieeexplore.ieee.org/document/9327548