Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

Ionut GRIGORE, Călin-Adrian POPA
Abstract

In the field of self-supervised depth estimation, Convolutional Neural Networks (CNNs) and Transformers have traditionally been dominant. However, both architectures struggle with efficiently handling long-range dependencies due to their local focus or computational demands. To overcome this limitation, we present MambaDepth, a versatile network tailored for self-supervised depth estimation. Drawing inspiration from the strengths of the Mamba architecture, renowned for its adept handling of lengthy sequences and its ability to capture global context efficiently through a State Space Model (SSM), we introduce MambaDepth. This innovative architecture combines the U-Net’s effectiveness in self-supervised depth estimation with the advanced capabilities of Mamba. MambaDepth is structured around a purely Mamba-based encoder-decoder framework, incorporating skip connections to maintain spatial information at various levels of the network. This configuration promotes an extensive feature learning process, enabling the capture of fine details and broader contexts within depth maps. Furthermore, we have developed a novel integration technique within the Mamba blocks to facilitate uninterrupted connectivity and information flow between the encoder and decoder components, thereby improving depth accuracy. Comprehensive testing across the established KITTI dataset demonstrates MambaDepth’s superiority over leading CNN and Transformer-based models in self-supervised depth estimation task, allowing it to achieve state-of-the-art performance. Moreover, MambaDepth proves its superior generalization capacities on other datasets such as Make3D and Cityscapes. MambaDepth’s performance heralds a new era in effective long-range dependency modeling for self-supervised depth estimation. Code is available at https://github.com/ionut-grigore99/MambaDepth.

I Introduction

Accurate depth estimation from single image is an active research field to help computer reconstruct and understand the real scenes. It also has a large range of applications in diverse fields such as autonomous vehicles, robotics, augmented reality, etc. While supervised monocular depth estimation has been successful, it is suffering from the expensive access to ground truth. Furthermore, supervised depth estimators often face optimization challenges under sparse supervision and exhibit limited adaptability to new, unencountered scenarios.

Recently, self-supervised approaches have become increasingly prominent. Current strategies primarily focus on utilizing self-distillation techniques [45], incorporating depth hints [59], and employing multi-frame inference [60, 11]. Despite these advancements, a common shortfall is their inability to capture detailed scene intricacies, as illustrated in Figure 1. The challenge lies in effectively and efficiently learning these fine-grained structural details within a self-supervised framework.

CNNs [36] and Transformers [56] are two significant architectures in the realm of self-supervised depth estimation. CNNs, like Monodepth2 [17] and MiDaS [49], excel in hierarchical feature extraction with more efficiency in parameters compared to traditional fully connected networks. Their weight-sharing structure is key in identifying translational invariances and local patterns. Conversely, Transformers, initially developed for natural language processing, have adapted well to image processing tasks. Examples include Vision Transformer (ViT) [8] in image recognition and SwinTransformer [40] as a versatile vision task backbone. Unlike CNNs, Transformers process images not as spatial hierarchies but as sequences of patches, enhancing their global information capturing ability. This distinction has led to the emergence of hybrid architectures combining CNNs and Transformers, like Depthformer [38], TransDepth [61], and DPT [48].

Refer to caption
Figure 1: Our method’s typical predictions on images from the KITTI dataset exhibit superior performance when compared to the classical Monodepth2 [17] and the contemporary attempts to use Transformers [71] or self-attention mechanism [72] in self-supervised monocular depth estimation. Notably, our approach excels in recovering intricate scene details.

Transformers, while adept at managing long-range dependencies, pose a significant computational load due to the self-attention mechanism’s quadratic scaling with input size. This is particularly challenging for high-resolution images as ones used for depth estimation. To address this, state space sequence models (SSMs) [21, 25], especially structured state space sequence models (S4) [24], have shown promise. They offer an efficient and effective approach for deep network construction, as seen in Mamba [22], which optimizes S4 with a selective mechanism and hardware-aware design. These models have shown potential in language, genomics, and also in vision tasks like image [43] and video classification [29]. Given that image patches and features can be treated as sequences [40, 8], this encourages the exploration of SSMs, particularly the Mamba blocks, to enhance U-Net long-range modeling capabilities.

Our main contributions are as follows:

  • We propose MambaDepth, a versatile network designed for self-supervised depth estimation, utilizing a novel SSM-based structure which effectively captures both localized details and extensive dependencies within images and thus obtains fine-grained scene geometry of a single image. To the best of our knowledge, this is the first time SSMs are used for self-supervised depth estimation. It distinguishes itself from conventional Transformer-based models by offering linear feature size scaling, avoiding the Transformers’ usual quadratic complexity. Additionally, we have developed a novel integration technique for Mamba blocks that ensures seamless connectivity and information flow between the encoder and decoder components, thereby enhancing depth accuracy.

  • Our extensive evaluations across the KITTI dataset show MambaDepth’s exceptional performance, significantly outperforming Transformer-based networks and existing self-supervised alternatives in accuracy and efficiency. Moreover, MambaDepth’s enhanced generalization is showcased through its application of a KITTI pre-trained model to diverse datasets, including successful zero-shot transfer to Make3D and Cityscapes.

  • This breakthrough sets the stage for future network designs that efficiently and effectively handle long-range dependencies in self-supervised depth estimation.

II Related work

II-A Supervised Depth Estimation

Eigen and colleagues [10] were pioneers in adopting a learning-based strategy, employing a multiscale convolutional neural network combined with a scale-invariant loss function to estimate depth from a single image. This groundbreaking approach has since inspired a plethora of subsequent methodologies. Broadly, these techniques fall into two main categories: one views depth estimation as a problem of pixel-wise regression, as seen in works like [10], [28], [48], and [66]. The other approach treats it as a pixel-wise classification challenge, as demonstrated in studies [12] and [7]. While regression-based methods are capable of predicting continuous depth values, they often present optimization challenges. On the other hand, classification-based methods, though simpler to optimize, are limited to predicting discrete depth values. In an innovative attempt to harness the advantages of both regression and classification, certain studies, notably those referenced as [1, 31], have redefined depth estimation as a dual task involving both classification and regression at the pixel level. This method involves initially regressing a series of depth bins followed by a pixel-wise classification, where each pixel is assigned to its respective bin. The ultimate depth value is then derived as a linear amalgamation of the centers of these bins, with weights given by their respective probabilities. This hybrid technique has shown significant enhancements in terms of accuracy.

Refer to caption
Figure 2: Overview of our self-supervised framework. Our proposed MambaDepth adopts a U-Net architecture, leveraging MambaDepth blocks from encoder to obtain low-resolution feature maps of the current frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Subsequently, low-resolution feature maps traverse successive MambaDepth blocks from the decoder together with skip connections in order to obtain disparities after applying a final Sigmoid layer. The predicted disparities are then upsampled at various scales to match the original input resolutions. Additionally, a standard pose network utilizes temporally adjacent frames Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and It1subscript𝐼𝑡1I_{t-1}italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT as input, yielding relative pose Tt1tsubscript𝑇𝑡1𝑡T_{t-1\rightarrow t}italic_T start_POSTSUBSCRIPT italic_t - 1 → italic_t end_POSTSUBSCRIPT as output. The camera pose is solely required during training for conducting differentiable warping. In line with numerous prior studies, we employ pixels from frame It1subscript𝐼𝑡1I_{t-1}italic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to reconstruct frame Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the depth map Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and relative pose Tt1tsubscript𝑇𝑡1𝑡T_{t-1\rightarrow t}italic_T start_POSTSUBSCRIPT italic_t - 1 → italic_t end_POSTSUBSCRIPT through a differentiable warping process [30]. The loss function is formulated based on the differences between the warped image It1tsubscript𝐼𝑡1𝑡I_{t-1\rightarrow t}italic_I start_POSTSUBSCRIPT italic_t - 1 → italic_t end_POSTSUBSCRIPT and the source image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

II-B Self-supervised Depth Estimation

In scenarios lacking direct ground truth data, self-supervised models are often developed by leveraging either the temporal consistency found in monocular video sequences, as explored in studies [17, 67], or the left-right consistency observed in stereo image pairs, a concept investigated in references [46, 14, 16].

Monocular Training Approach: This method derives supervision from the congruence between a synthesized scene view from a reference frame and the actual view from a source frame. A notable example, the SfMLearner [46], synchronizes the training of a DepthNet and a separate PoseNet using a photometric loss function. Building on this foundational approach, numerous enhancements have been proposed. These include robust image-level reconstruction losses [53, 20], feature-level reconstruction losses [53, 64], incorporation of auxiliary information during training [34, 59], strategies to address the dynamic objects that disrupt static scene assumptions [17, 20, 50, 57, 5, 2, 3, 33, 37, 63], and additional constraints [62, 63, 50, 4, 18, 26, 69].

Stereo Training Method: Here, synchronized stereo image pairs are used, focusing on predicting a disparity map [52], which is effectively the inverse of a depth map. With known relative camera poses, the model’s task is simplified to disparity map prediction. Garg et al. [14] pioneered this with a self-supervised monocular depth estimator, applying a photometric consistency loss between stereo pairs. Subsequent improvements include implementing left-right consistency [16] and temporal consistency in videos [64]. Garg et al. [13] further refined this by enabling the prediction of continuous disparity values. Stereo-based methods have evolved to include semi-supervised data [35, 41], auxiliary information usage [59], exponential probability volumes [19], and self-distillation techniques [47, 27, 45]. Stereo views offer an ideal reference for supervision and can also be instrumental in deriving absolute depth scales.

However, existing self-supervised methods still struggle with producing high-fidelity depth maps. Current techniques primarily rely on immediate visual features or utilize Transformer [8] enhanced high-level visual representations, often neglecting the critical role of pixel-level geometric cues that could significantly enhance model performance and generalization abilities.

II-C State Space Models

State Space Sequence Models (SSMs) [21] represent a category of systems that transform a one-dimensional function or sequence u(t)𝑢𝑡u(t)italic_u ( italic_t ) into y(t)𝑦𝑡y(t)italic_y ( italic_t ). They are described by the following linear Ordinary Differential Equation (ODE):

x(t)=Ax(t)+Bu(t),superscript𝑥𝑡𝐴𝑥𝑡𝐵𝑢𝑡x^{\prime}(t)=Ax(t)+Bu(t),italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A italic_x ( italic_t ) + italic_B italic_u ( italic_t ) , (1)
y(t)=Cx(t)+Du(t).𝑦𝑡𝐶𝑥𝑡𝐷𝑢𝑡y(t)=Cx(t)+Du(t).italic_y ( italic_t ) = italic_C italic_x ( italic_t ) + italic_D italic_u ( italic_t ) . (2)

In this equation, AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is the state matrix, and B,CN𝐵𝐶superscript𝑁B,C\in\mathbb{R}^{N}italic_B , italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT are its parameters and skip connection D1𝐷superscript1D\in\mathbb{R}^{1}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, with x(t)N𝑥𝑡superscript𝑁x(t)\in\mathbb{R}^{N}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPTsymbolizing the implicit latent state. SSMs possess several advantageous properties, such as linear complexity per time step and the ability for parallel computations, which aid in efficient training. Yet, standard SSMs generally require more memory compared to equivalent CNNs and face issues like vanishing gradients during training, limiting their widespread use in sequence modeling.

The evolution of SSMs led to the creation of Structured State Space Sequence Models (S4) [24], which notably enhance the basic SSM framework. S4 achieves this by applying structured designs to the state matrix A𝐴Aitalic_A and incorporating an efficient algorithm. The state matrix in S4 is specifically developed and initialized using the High-Order Polynomial Projection Operator (HIPPO) [23], facilitating the construction of deep sequence models that are both rich in capability and adept at long-range reasoning. Remarkably, S4 has outperformed Transformers [56] in the demanding Long Range Arena Benchmark [54].

Mamba [22] represents a further advancement in SSMs, especially in discrete data modeling, such as text and genomic sequences. Mamba introduces two significant enhancements. Firstly, it incorporates an input-specific selection mechanism, differing from traditional, invariant SSMs. This mechanism filters information efficiently by customizing SSM parameters based on input data. Secondly, Mamba employs a hardware-optimized algorithm, which scales linearly with sequence length and uses a scanning process for recurrent computation, enhancing speed on contemporary hardware. Mamba’s architecture, which combines SSM blocks with linear layers, is notably more streamlined. It has achieved top-tier results in various long-sequence fields, including language and genomics, demonstrating considerable computational efficiency in both training and inference phases.

III Method

Refer to caption
Figure 3: Overview of MambaDepth architecture. The MambaDepth structure includes an encoder, bottleneck, a decoder, and skip connections. Each of these components – the encoder, bottleneck, and decoder – is built using the MD block.

III-A Self-supervised framework

In this section we describe the framework of our model and describe how we provide the supervisory signal during the training of our model. Fundamentally, our method is a form of Structure from Motion (SfM), where the monocular camera is moving within a rigid environment to provide multiple views of that scene. Our framework is built upon Monodepth2 [17].

Let ItH×W×3,t{1,0,1}formulae-sequencesubscript𝐼𝑡superscript𝐻𝑊3𝑡101I_{t}\in\mathbb{R}^{H\times W\times 3},t\in\left\{-1,0,1\right\}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT , italic_t ∈ { - 1 , 0 , 1 } be a frame in a monocular video sequence captured by a moving camera, where t𝑡titalic_t is the frame time index. Similarly, let DtH×Wsubscript𝐷𝑡superscript𝐻𝑊D_{t}\in\mathbb{R}^{H\times W}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT denote the depth map corresponding to image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The camera pose changes from time 00 to time t,t{1,1}𝑡𝑡11t,t\in\left\{-1,1\right\}italic_t , italic_t ∈ { - 1 , 1 } is encoded by the 3×3333\times 33 × 3 rotation matrix Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the translation vector ttsubscript𝑡𝑡t_{t}italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We obtain the 4×4444\times 44 × 4 camera transformation matrix thus:

Mt=[Rttt01].subscript𝑀𝑡matrixsubscript𝑅𝑡subscript𝑡𝑡01M_{t}=\begin{bmatrix}R_{t}&t_{t}\\ 0&1\end{bmatrix}.italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] . (3)

Our aim is to train two CNN networks to simultaneously estimate the pose of the camera, and the structure of the scene, respectively:

M=θpose(It),𝑀subscript𝜃𝑝𝑜𝑠𝑒subscript𝐼𝑡M=\theta_{pose}(I_{t}),italic_M = italic_θ start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (4)
D=θdepth(It).𝐷subscript𝜃𝑑𝑒𝑝𝑡subscript𝐼𝑡D=\theta_{depth}(I_{t}).italic_D = italic_θ start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (5)

Self-supervised depth prediction reformulates the learning task as a novel view-synthesis problem. Specifically, during training, we let the coupled network synthesize the photo-consistency appearance of a target frame from another viewpoint of the source frame. We treat the depth map as an intermediate variable to constrain the network to complete the image synthesis task.

Let (u,v)2𝑢𝑣superscript2(u,v)\in\mathbb{R}^{2}( italic_u , italic_v ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT be the calibrated coordinates of a pixel in image I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In this case, let the origin (0,0)00(0,0)( 0 , 0 ) be the top-left of the image. In the process of imaging, a 3D point (X,Y,Z)3𝑋𝑌𝑍superscript3(X,Y,Z)\in\mathbb{R}^{3}( italic_X , italic_Y , italic_Z ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT projects onto (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) through a perspective projection operator.

Suppose that the transformation matrix Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT encodes the pose change of the camera from time 00 to time t𝑡titalic_t and (6) is the perspective projection operator:

π(X,Y,Z)=(fxXZ+cx,fyYZ+cy)=(u,v),𝜋𝑋𝑌𝑍subscript𝑓𝑥𝑋𝑍subscript𝑐𝑥subscript𝑓𝑦𝑌𝑍subscript𝑐𝑦𝑢𝑣\pi(X,Y,Z)=\left(f_{x}\frac{X}{Z}+c_{x},f_{y}\frac{Y}{Z}+c_{y}\right)=(u,v),italic_π ( italic_X , italic_Y , italic_Z ) = ( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT divide start_ARG italic_X end_ARG start_ARG italic_Z end_ARG + italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT divide start_ARG italic_Y end_ARG start_ARG italic_Z end_ARG + italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = ( italic_u , italic_v ) , (6)

where (fx,fy,cx,cy)subscript𝑓𝑥subscript𝑓𝑦subscript𝑐𝑥subscript𝑐𝑦(f_{x},f_{y},c_{x},c_{y})( italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) are the camera intrinsic parameters. Therefore, given a depth map D(u,v)𝐷𝑢𝑣D(u,v)italic_D ( italic_u , italic_v ), a 2D image point (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) backprojects to a 3D point (X,Y,Z)𝑋𝑌𝑍(X,Y,Z)( italic_X , italic_Y , italic_Z ) through backprojection operator:

π1(u,v,D(u,v))=D(u,v)(ucxfx,vcyfy)=(X,Y,Z).superscript𝜋1𝑢𝑣𝐷𝑢𝑣𝐷𝑢𝑣𝑢subscript𝑐𝑥subscript𝑓𝑥𝑣subscript𝑐𝑦subscript𝑓𝑦𝑋𝑌𝑍\pi^{-1}(u,v,D(u,v))=D(u,v)\left(\frac{u-c_{x}}{f_{x}},\frac{v-c_{y}}{f_{y}}% \right)=(X,Y,Z).italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u , italic_v , italic_D ( italic_u , italic_v ) ) = italic_D ( italic_u , italic_v ) ( divide start_ARG italic_u - italic_c start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_v - italic_c start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG ) = ( italic_X , italic_Y , italic_Z ) . (7)

Then the corresponding pixels in image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be computed as:

(u,v)=π(Mtπ1(u,v,D(u,v)))=g(u,v|D(u,v),Mt).superscript𝑢superscript𝑣𝜋subscript𝑀𝑡superscript𝜋1𝑢𝑣𝐷𝑢𝑣𝑔𝑢conditional𝑣𝐷𝑢𝑣subscript𝑀𝑡(u^{\prime},v^{\prime})=\pi(M_{t}\pi^{-1}(u,v,D(u,v)))=g(u,v|D(u,v),M_{t}).( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_π ( italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_u , italic_v , italic_D ( italic_u , italic_v ) ) ) = italic_g ( italic_u , italic_v | italic_D ( italic_u , italic_v ) , italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (8)

We project the pixels of an image to form a novel synthetic view (8). However, the projected coordinates (u,v)superscript𝑢superscript𝑣(u^{\prime},v^{\prime})( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are continuous values. To obtain Is(u,v)subscript𝐼𝑠𝑢𝑣I_{s}(u,v)italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u , italic_v ) we include a differentiable bilinear sampling mechanism, as proposed in spatial transformer networks [30]. We can now linearly interpolate the values of the 4-pixel neighbors (top-left, top-right, bottom-left, bottom-right) of I(u,v)𝐼superscript𝑢superscript𝑣I(u^{\prime},v^{\prime})italic_I ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to give the RGB intensities as follows:

Is(u,v)=uvwuvI(u,v),superscript𝐼𝑠𝑢𝑣subscript𝑢subscript𝑣superscript𝑤𝑢𝑣𝐼superscript𝑢superscript𝑣I^{s}(u,v)=\sum_{u}\sum_{v}w^{uv}I(u^{\prime},v^{\prime}),italic_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT italic_u italic_v end_POSTSUPERSCRIPT italic_I ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (9)

where wuvsuperscript𝑤𝑢𝑣w^{uv}italic_w start_POSTSUPERSCRIPT italic_u italic_v end_POSTSUPERSCRIPT is linearly proportional to the spatial proximity between (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) and (u,v)superscript𝑢superscript𝑣(u^{\prime},v^{\prime})( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), and u,vwuv=1subscript𝑢𝑣superscript𝑤𝑢𝑣1\sum_{u,v}w^{uv}=1∑ start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT italic_u italic_v end_POSTSUPERSCRIPT = 1.

Refer to caption
Figure 4: The scan expanding and scan merging operations in SS2D. In the SS2D method, input patches follow four distinct scanning paths. Each sequence is then independently processed by separate S6 blocks. Finally, the results are combined to create a 2D feature map, which serves as the final output.

III-B MambaDepth

In this section, we elaborate the design details of the core components in the MambaDepth.

The Mamba framework has shown remarkable efficacy in processing various kinds of discrete data. However, its application in image data processing, especially in the field of self-supervised depth estimation, has not been fully explored. [42] proposes U-Mamba, a hybrid CNN-SSM architecture, to handle the long-range dependencies in biomedical image segmentation and [68] build a pure SSM-based model, which can be adopted as a generic vision backbone, but their efficiency are not yet fully understood at a large scale. Images, fundamentally discrete samples from continuous signals, can be transformed into extended sequences. This characteristic suggests the potential of using Mamba’s linear scaling benefits to improve the capability of UNet architecture in modeling extensive range dependencies. While image processing using Transformers, like ViT and SwinTransformer, has seen success, their application is limited by the significant computational demands for large images due to the self-attention mechanism’s quadratic complexity. This challenge presents an opportunity to utilize Mamba’s linear scaling to bolster UNet capacity for long-range dependency modeling.

In self-supervised depth estimation, Monodepth2 [17] and its derivatives, recognized for their symmetric encoder-decoder structure, are predominant. This structure is adept at extracting multi-level image features through convolutional methods. However, the design is constrained in its ability to capture long-range dependencies in images, as the convolutional kernels focus on local areas. Each convolutional layer only processes features within its limited receptive field. Although skip connections aid in merging detailed and abstract features, they primarily enhance local feature combination, not the modeling of extensive range dependencies.

MambaDepth, a novel design, integrates the strengths of Monodepth2 [17] and Mamba [22] to comprehensively understand global contexts in self-supervised depth estimation. Figure 3 showcases the structure of MambaDepth, which is distinct in its composition, featuring an embedding layer, encoder, decoder, disparity heads, and straightforward skip connections, marking a departure from the classical designs often found in prior work.

In the initial stage, the embedding layer segments the input image, denoted as x𝑥xitalic_x with dimensions H×W×3superscript𝐻𝑊3\mathbb{R}^{H\times W\times 3}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, into discrete 4×4444\times 44 × 4 patches. These patches are then transformed to a predefined dimension C𝐶Citalic_C (typically set to 96), resulting in a reshaped image representation, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, with dimensions H4×W4×Csuperscript𝐻4𝑊4𝐶\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times C}blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_C end_POSTSUPERSCRIPT. xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is normalized using Layer Normalization before its progression into the encoder for the extraction of features. The encoder itself is structured into four phases, incorporating our newly introduced feature fusion step at the conclusion of the first three phases to compact the spatial dimensions and amplify the channel depth, employing a configuration of [2,2,2,2]2222[2,2,2,2][ 2 , 2 , 2 , 2 ] MD blocks and channel dimensions scaling from [C,2C,4C,8C]𝐶2𝐶4𝐶8𝐶[C,2C,4C,8C][ italic_C , 2 italic_C , 4 italic_C , 8 italic_C ] through each phase.

Conversely, the decoder reverses this process over its four stages, using our introduced feature decomposition technique at the start of the final three stages to enlarge spatial dimensions and condense the channel count. Here, the arrangement of MD blocks is [2,2,2,2]2222[2,2,2,2][ 2 , 2 , 2 , 2 ], with channel dimensions inversely scaling from [8C,4C,2C,C]8𝐶4𝐶2𝐶𝐶[8C,4C,2C,C][ 8 italic_C , 4 italic_C , 2 italic_C , italic_C ]. The decoder culminates in 4 disparity heads that upscale the feature dimensions by a factor of four through feature decomposition, followed by a projection layer that adjusts the channel count to align with the target of self-supervised depth estimation, which is processed through a convolutional layer and a Sigmoid layer to generate the final depth map.

Skip connections within this architecture are implemented via a simple addition operation, purposely designed to avoid the incorporation of extra parameters, thus maintaining the model’s efficiency and simplicity.

III-C MD block

Refer to caption
Figure 5: The detailed structure of the MD (MambaDepth) Block.
Method Train Params H×\times×W AbsRel \downarrow SqRel \downarrow RMSE \downarrow RMSESlog \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25\uparrow δ<1.252𝛿superscript1.252\delta<1.25^{2}italic_δ < 1.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT\uparrow δ<1.253𝛿superscript1.253\delta<1.25^{3}italic_δ < 1.25 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT\uparrow
MonoDepth2 [17] M 34M 192×\times×640 0.110 0.831 4.642 0.187 0.883 0.962 0.982
CaDepth-Net [72] M 58M 192×\times×640 0.105 0.769 4.535 0.181 0.892 0.964 0.983
PackNet-SfM [76] M 128M 192×\times×640 0.107 0.785 4.612 0.185 0.887 0.962 0.982
HR-Depth [70] M 14M 192×\times×640 0.109 0.792 4.632 0.185 0.884 0.962 0.983
Lite-Mono [71] M 8M 192×\times×640 0.107 0.765 4.561 0.183 0.886 0.963 0.983
DynamicDepth [11] M - 192×\times×640 0.096 0.720 4.458 0.175 0.897 0.964 0.984
MonoViT [65] M 27M 192×\times×640 0.099 0.708 4.372 0.175 0.900 0.967 0.984
MambaDepth (Ours) M 30M 192×\times×640 0.097 0.706 4.370 0.172 0.907 0.970 0.986
MonoDepth2 [17] MS 34M 320×\times×1024 0.106 0.806 4.630 0.193 0.876 0.958 0.980
MonoDepth2 [17] M 34M 320×\times×1024 0.115 0.882 4.701 0.190 0.879 0.961 0.982
HR-Depth [70] MS 14M 320×\times×1024 0.101 0.716 4.395 0.179 0.899 0.966 0.983
HR-Depth [70] M 14M 320×\times×1024 0.106 0.755 4.472 0.181 0.892 0.966 0.984
EPCDepth [45] S - 320×\times×1024 0.091 0.646 4.207 0.176 0.901 0.966 0.983
Depth Hints [59] MS - 320×\times×1024 0.098 0.702 4.398 0.183 0.887 0.963 0.983
Depth Hints [59] S - 320×\times×1024 0.096 0.710 4.393 0.185 0.890 0.962 0.981
CADepth-Net [72] MS 58M 320×\times×1024 0.096 0.964 4.264 0.173 0.908 0.968 0.984
CADepth-Net [72] M 58M 320×\times×1024 0.102 0.734 4.407 0.178 0.898 0.966 0.984
MT-SfMLearner [77] M - 320×\times×1024 0.104 0.799 4.547 0.181 0.893 0.963 0.982
Lite-Mono [71] M 8M 320×\times×1024 0.097 0.710 4.309 0.174 0.905 0.967 0.984
MonoViT [65] M 27M 320×\times×1024 0.096 0.714 4.292 0.172 0.908 0.968 0.984
MambaDepth (Ours) M 30M 320×\times×1024 0.095 0.634 3.402 0.169 0.914 0.968 0.985
Table I: Performance comparison on KITTI [15] eigen benchmark. Best results are marked in bold.
Method Train Params H×\times×W AbsRel \downarrow SqRel \downarrow RMSE \downarrow RMSESlog \downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25\uparrow δ<1.252𝛿superscript1.252\delta<1.25^{2}italic_δ < 1.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT\uparrow δ<1.253𝛿superscript1.253\delta<1.25^{3}italic_δ < 1.25 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT\uparrow
MonoDepth2 [17] M 34M 192×\times×640 0.090 0.545 3.942 0.137 0.914 0.983 0.995
HR-Depth [70] M 14M 192×\times×640 0.085 0.471 3.769 0.130 0.919 0.985 0.996
PackNet-SfM [76] M 128M 192×\times×640 0.078 0.420 3.485 0.121 0.931 0.986 0.996
CADepth-Net [72] M 58M 192×\times×640 0.080 0.450 3.649 0.124 0.927 0.986 0.996
DIFFNet [75] M - 192×\times×640 0.076 0.412 3.494 0.119 0.935 0.988 0.996
MonoViT [65] M 27M 192×\times×640 0.075 0.389 3.419 0.115 0.938 0.989 0.997
MambaDepth (Ours) M 30M 192×\times×640 0.073 0.386 3.415 0.112 0.941 0.991 0.997
MonoDepth2 [17] MS 34M 320×\times×1024 0.091 0.531 3.742 0.135 0.916 0.984 0.995
MonoDepth2 [17] S 34M 320×\times×1024 0.077 4.455 3.489 0.119 0.933 0.988 0.996
DepthHints [59] S - 320×\times×1024 0.074 0.364 3.202 0.114 0.936 0.989 0.997
MambaDepth (Ours) M 30M 320×\times×1024 0.072 0.410 3.490 0.114 0.940 0.989 0.996
Table II: Performance comparison using KITTI improved ground truth [55]. Best results are marked in bold.

At the heart of MambaDepth lies the MD (Mamba Depth) module, which is an adaptation from VMamaba [39], as illustrated in Figure 5. This module begins with Layer Normalization of the input, which then bifurcates into two distinct pathways. The initial pathway channels the input through a linear transformation and an activation phase, whereas the second pathway subjects the input to a sequence involving a linear transformation, depthwise separable convolution, and activation, before directing it to the 2D-Selective-Scan (SS2D) component for advanced feature extraction. Following this, the extracted features undergo Layer Normalization, are then merged with the output of the first pathway through element-wise multiplication, and finally are integrated using a linear transformation. This process is augmented with a residual connection, culminating in the output of the MD block. For activation, the SiLU function is consistently utilized throughout this study.

The SS2D mechanism comprises three essential stages: an operation to expand the scan, an S6 block for processing, and a merging operation for the scans. As shown in Figure 4, the expansion phase unfolds the input image in four orientations (diagonally and anti-diagonally) into sequences. These sequences are then refined by the S6 block, a procedure that meticulously scans information from all directions to extract a comprehensive range of features. Following this, the sequences are recombined through a merging operation, ensuring the output image is resized back to the original dimensions. The innovative S6 block, evolving from Mamba and building upon the S4 structure, introduces a selective filter that dynamically adjusts to the input by fine-tuning the parameters of the State Space Model (SSM). This adjustment allows the system to selectively focus on and preserve relevant information, while discarding what is unnecessary.

Refer to caption
Figure 6: Qualitative results on the KITTI eigen benchmark.

III-D Loss function

Objective functions. In line with the methodologies described in [16, 17], we adopt the conventional photometric loss pe𝑝𝑒peitalic_p italic_e, which is a combination of L1𝐿1L1italic_L 1 and SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M losses:

pe(Ia,Ib)=α2(1SSIM(Ia,Ib))+(1α)IaIb1.𝑝𝑒subscript𝐼𝑎subscript𝐼𝑏𝛼21𝑆𝑆𝐼𝑀subscript𝐼𝑎subscript𝐼𝑏1𝛼subscriptnormsubscript𝐼𝑎subscript𝐼𝑏1pe(I_{a},I_{b})=\frac{\alpha}{2}(1-SSIM(I_{a},I_{b}))+(1-\alpha)\left\|I_{a}-I% _{b}\right\|_{1}.italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ( 1 - italic_S italic_S italic_I italic_M ( italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) + ( 1 - italic_α ) ∥ italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (10)

To ensure proper depth regularization in areas lacking texture, we employ an edge-aware smooth loss, applied in the following manner:

Ls=|xdt|e|xIt|+|ydt|e|yIt|.subscript𝐿𝑠subscript𝑥superscriptsubscript𝑑𝑡superscript𝑒subscript𝑥subscript𝐼𝑡subscript𝑦superscriptsubscript𝑑𝑡superscript𝑒subscript𝑦subscript𝐼𝑡L_{s}=\left|\partial_{x}d_{t}^{*}\right|e^{-\left|\partial_{x}I_{t}\right|}+% \left|\partial_{y}d_{t}^{*}\right|e^{-\left|\partial_{y}I_{t}\right|}.italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = | ∂ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT - | ∂ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT + | ∂ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | italic_e start_POSTSUPERSCRIPT - | ∂ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT . (11)
Method Train Params H×\times×W AbsRel\downarrow SqRel\downarrow RMSE\downarrow RMSESlog\downarrow δ<1.25𝛿1.25\delta<1.25italic_δ < 1.25\uparrow δ<1.252𝛿superscript1.252\delta<1.25^{2}italic_δ < 1.25 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT\uparrow δ<1.253𝛿superscript1.253\delta<1.25^{3}italic_δ < 1.25 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT\uparrow
Struct2Depth 2 [74] M - 416×\times×128 0.145 1.737 7.280 0.205 0.813 0.942 0.976
MonoDepth2 [17] MS 34M 416×\times×128 0.129 1.569 6.876 0.187 0.849 0.957 0.983
ManyDepth [60] MS 37M 416×\times×128 0.114 1.193 6.223 0.170 0.875 0.967 0.989
MambaDepth (Ours) M 30M 416×\times×128 0.112 1.186 6.226 0.167 0.876 0.968 0.990
Table III: Cityscapes results. Best results are marked in bold.

Masking Strategy. In real-world settings, scenarios featuring stationary cameras and moving objects can disrupt the usual assumptions of a moving camera and static environment, negatively impacting the performance of self-supervised depth estimators. Previous studies have attempted to enhance depth prediction accuracy by incorporating a motion mask, which addresses moving objects using scene-specific instance segmentation models. However, this approach limits their applicability to new, unencountered scenarios. To maintain scalability, our method eschews the use of a motion mask for handling moving objects. Instead, we adopt the auto-masking strategy outlined in [17], which filters out static pixels and areas of low texture that appear unchanged between two consecutive frames in a sequence. This binary mask μ𝜇\muitalic_μ is calculated as per (12), employing the Iverson bracket notation:

μ=[mintpe(It,It>t)<mintpe(It,It)].𝜇delimited-[]superscript𝑡min𝑝𝑒subscript𝐼𝑡subscript𝐼limit-fromsuperscript𝑡𝑡superscript𝑡min𝑝𝑒subscript𝐼𝑡subscript𝐼superscript𝑡\mu=[\underset{t^{\prime}}{\mathrm{min}}\,pe(I_{t},I_{t^{\prime}->t})<% \underset{t^{\prime}}{\mathrm{min}}\,pe(I_{t},I_{t^{\prime}})].italic_μ = [ start_UNDERACCENT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - > italic_t end_POSTSUBSCRIPT ) < start_UNDERACCENT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG italic_p italic_e ( italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ] . (12)

Final Training Loss. We formulate the final loss by combining our per-pixel smooth loss with the masked photometric losses:

L=μLp+λLs.𝐿𝜇subscript𝐿𝑝𝜆subscript𝐿𝑠L=\mu L_{p}+\lambda L_{s}.italic_L = italic_μ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (13)

IV Experiments

The effectiveness of MambaDepth is assessed using the public dataset KITTI. We measure the model’s performance using several established metrics from [9].

IV-A Datasets and Experimental Protocol

KITTI [15] dataset, known for its stereo image sequences, is widely utilized in self-supervised monocular depth estimation. We employ the Eigen split [9], comprising about 26,000 images for training and 697 for testing. Our approach with MambaDepth involves training it from the beginning on KITTI under minimal conditions: it operates solely with auto-masking [17], without additional stereo pairs or auxiliary data. For testing purposes, we maintain a challenging scenario by using only a single frame as input, in contrast to other methods that might use multiple frames to enhance accuracy.

Cityscapes [6] dataset, noted for its complexity and abundance of moving objects, serves as a testing ground to assess the adaptability of MambaDepth. To this end, we conduct a zero-shot evaluation on Cityscapes, utilizing a model pre-trained on KITTI. It is crucial to highlight that, unlike many competing approaches, we do not employ a motion mask in our evaluation. For data preparation, we follow the same preprocessing procedures outlined in [67], which are also adopted by other baselines, converting the image sequences into triplets.

Make3D [51]. To assess the capability of MambaDepth to generalize to new, previously unseen images, the model, initially trained on the KITTI dataset, was subjected to a zero-shot evaluation using the Make3D dataset. Furthermore, supplementary visualizations of depth maps are provided.

IV-B Implementation Details

We developed our model using the PyTorch framework [44]. It was trained on eight NVIDIA Tesla V100-SXM2 GPUs, with a batch size of 8888. We pre-trained the architecture on the ImageNet-1k dataset, subsequently using these pre-trained weights to initialize both the encoder and decoder components of the model. In line with the approach in [17], we applied color and flip augmentations to the images during training. Both DepthNet and PoseNet were trained using the Adam Optimizer [32], with β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT set at 0.9 and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at 0.999, DepthNet being also trained concurrently. The learning rate starts at 1e41𝑒41e-41 italic_e - 4 and reduces to 1e51𝑒51e-51 italic_e - 5 after 15151515 epochs. We configured the SSIM weight at α=0.85𝛼0.85\alpha=0.85italic_α = 0.85 and the weight for the smooth loss term at λ=1e3𝜆1𝑒3\lambda=1e-3italic_λ = 1 italic_e - 3.

Refer to caption
Figure 7: Model parameters and performance with respect to the absolute relative difference (Abs Rel) on the KITTI Eigen test set. Our model is more efficient and accurate.

IV-C KITTI Results

We evaluate our MambaDepth using the standard KITTI Eigen split [9], which consists of 697 images paired with raw LiDAR scans. Improved ground truth labels [55] are available for 652 of these images. To address monocular scale ambiguity in depth models trained on video sequences, we scale the estimated depth by the per-image median ground truth [67].

Table I presents the results achieved by state-of-the-art self-supervised frameworks, processing either low-resolution (640×\times×192) or high-resolution (1024×\times×320) images. MambaDepth significantly outperforms existing state-of-the-art methods across all training resolutions on all metrics, some of them trained solely on stereo videos ([59, 45]) or with binocular videos while MambaDepth being trained entirely on monocular videos. Notably, MambaDepth substantially surpasses MonoViT [65], Lite-Mono [71], and MT-SfMLearner [77], the best contemporary attempts to use Transformers or self-attention mechanism in self-supervised monocular depth estimation.

Table II shows the same metrics computed using the improved ground truth labels for images processed at 640×\times×192 resolution and 1024×\times×320 resolution. Again, MambaDepth consistently demonstrates higher accuracy.

Figure 6 compares MambaDepth with some of its competitors, illustrating that our model achieves significantly lower RMSE. This comparison highlights MambaDepth’s superior capability in modeling long-range relationships between objects compared to existing models.

Also the model is more efficient and accurate in terms of computations costs, as depicted in Figure 8, where we compare the absolute relative difference against Giga Multiply-Add Calculations per Second on the KITTI Eigen test set for more existing state-of-the-art methods. As for parameters comparison, details are present in Figure 7.

Refer to caption
Figure 8: Comparisons about computational costs. We compare AbsRel against Giga Multiply-Add Caculation per Second (GMACs) on the KITTI Eigen test set. Our model is more efficient and accurate.

IV-D Cityscapes Results

To evaluate the generalization of MambaDepth, we conducted zero-shot evaluation. For this, we utilized the model pretrained on KITTI for images processed at 416×\times×128 resolution. The results, summarized in Table III, indicate that MambaDepth performs exceptionally well being entirely trained with monocular videos, unlike most baselines in Table III trained with a combination of monocular videos and stereo pairs. Notably, MambaDepth achieves a 1.75% error reduction compared to the well-known ManyDepth [60], which utilizes two frames (the previous and current) as input. These findings underscore the superior generalization ability of MambaDepth.

IV-E Make3D Results

To further assess the generalization capability of MambaDepth, we conducted a zero-shot evaluation on the Make3D dataset [51] using the pretrained weights from KITTI. Adhering to the evaluation protocol described in [16], we tested on a center crop with a 2:1 aspect ratio. As illustrated in Table IV and Figure 9, MambaDepth outperformed the baselines, delivering sharp depth maps with more precise scene details. These results highlight the exceptional zero-shot generalization ability of our model.

Method AbsRel \downarrow SqRel \downarrow RMSE \downarrow RMSESlog \downarrow
Zhou [67] 0.383 5.321 10.470 0.478
DDVO [73] 0.387 4.720 8.090 0.204
MonoDepth2 [17] 0.322 3.589 7.417 0.163
CADepth-Net [72] 0.312 3.086 7.066 0.159
MambaDepth (Ours) 0.307 2.405 6.858 0.153
Table IV: Make3D results. Best results are marked in bold.
Refer to caption
Figure 9: Qualitative Make3D results (Zero-shot). Predictions by CA-Depth [72], MonoDepth2 [17] and our MambaDepth.

IV-F Ablation study

Here, we explore the impact of initialization for MambaDepth using KITTI dataset. We initialize MambaDepth with and without weights pretrained on ImageNet. The results, displayed in Table V, suggest that stronger pretrained weights markedly improve MambaDepth’s subsequent effectiveness, highlighting the significant role these initial weights play.

Weights initialization AbsRel \downarrow SqRel \downarrow RMSE \downarrow RMSESlog \downarrow
Xavier 0.114 0.881 4.700 0.191
ImageNet pretraining 0.095 0.634 3.402 0.169
Table V: Ablation study on weights initialization for MambaDepth on KITTI eigen benchmark. Best results are marked in bold.

V Conclusions

In our study, we re-examine the challenges of self-supervised monocular depth estimation and present a novel and efficient approach, which we have named MambaDepth. This method is designed to overcome the challenges in capturing long-range dependencies, a limitation observed in the localized nature of CNNs and the computational intensity of Transformers. MambaDepth is a purely Mamba block-based U-Net-style network for self-supervised monocular depth estimation. It achieves outstanding state-of-the-art performance on the KITTI dataset. Additionally, we showcase the enhanced generalizability of our model in various settings. Our findings position MambaDepth as a leading contender for future advanced self-supervised depth estimation networks.

References

  • [1] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  • [2] Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems, 32, 2019.
  • [3] Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Unsupervised monocular depth and ego-motion learning with structure and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [4] Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu- Chiang Frank Wang. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2624–2632, 2019.
  • [5] Yuhua Chen, Cordelia Schmid, and Cristian Sminchisescu. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7063–7072, 2019.
  • [6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [7] Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4738–4747, 2019.
  • [8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [9] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015.
  • [10] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  • [11] Ziyue Feng, Liang Yang, Longlong Jing, Haiyan Wang, YingLi Tian, and Bing Li. Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In European Conference on Computer Vision, pages 228–244. Springer, 2022.
  • [12] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- manghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018.
  • [13] Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and Wei-Lun Chao. Wasserstein distances for stereo disparity estimation. Advances in Neural Information Processing Systems, 33:22517–22529, 2020.
  • [14] Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 740–756. Springer, 2016.
  • [15] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
  • [16] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017.
  • [17] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019.
  • [18] Juan Luis Gonzalez and Munchurl Kim. Plade-net: Towards pixel-level accuracy for self-supervised single-view depth estimation with neural positional encoding and distilled matting loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6851–6860, 2021.
  • [19] Juan Luis GonzalezBello and Munchurl Kim. Forget about the lidar: Self-supervised depth estimators with med probability volumes. Advances in Neural Information Processing Systems, 33:12626–12637, 2020.
  • [20] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8977–8986, 2019.
  • [21] Albert Gu. Modeling Sequences with Structured State Spaces. PhD thesis, Stanford University, 2023.
  • [22] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • [23] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  • [24] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • [25] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  • [26] Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, and Adrien Gaidon. Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319, 2020.
  • [27] Xiaoyang Guo, Hongsheng Li, Shuai Yi, Jimmy Ren, and Xiaogang Wang. Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European conference on computer vision (ECCV), pages 484– 500, 2018.
  • [28] Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. Guiding monocular depth estimation using depth-attention volume. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pages 581–597. Springer, 2020.
  • [29] Md Mohaiminul Islam and Gedas Bertasius. Long movie clip classification with state-space video models. In European Conference on Computer Vision, pages 87–104. Springer, 2022.
  • [30] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in neural information processing systems, 28, 2015.
  • [31] Adrian Johnston and Gustavo Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 4756–4765, 2020.
  • [32] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [33] Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 582–600. Springer, 2020.
  • [34] Maria Klodt and Andrea Vedaldi. Supervising the new with the old: learning sfm from sfm. In Proceedings of the European conference on computer vision (ECCV), pages 698– 713, 2018.
  • [35] Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6647–6655, 2017.
  • [36] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
  • [37] Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning, pages 1908–1917. PMLR, 2021.
  • [38] Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211, 2022.
  • [39] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  • [40] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • [41] Yue Luo, Jimmy Ren, Mude Lin, Jiahao Pang, Wenxiu Sun, Hongsheng Li, and Liang Lin. Single view stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 155–163, 2018.
  • [42] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  • [43] Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
  • [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [45] Rui Peng, Ronggang Wang, Yawen Lai, Luyang Tang, and Yangang Cai. Excavating the potential capacity of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15560–15569, 2021.
  • [46] Sudeep Pillai, Rareş Ambruş, and Adrien Gaidon. Superdepth: Self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA), pages 9250–9256. IEEE, 2019.
  • [47] Andrea Pilzer, Stephane Lathuiliere, Nicu Sebe, and Elisa Ricci. Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9768– 9777, 2019.
  • [48] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  • [49] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  • [50] Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12240–12249, 2019.
  • [51] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Learning 3-d scene structure from a single still image. In 2007 IEEE 11th international conference on computer vision, pages 1– 8. IEEE, 2007.
  • [52] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47:7–42, 2002.
  • [53] Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. Feature-metric loss for self-supervised learning of depth and egomotion. In European Conference on Computer Vision, pages 572–588. Springer, 2020.
  • [54] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  • [55] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In 2017 international conference on 3D Vision (3DV), pages 11–20. IEEE, 2017.
  • [56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [57] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804, 2017.
  • [58] Youhong Wang, Yunji Liang, Hao Xu, Shaohui Jiao, and Hongkai Yu. Sqldepth: Generalizable self-supervised fine-structured monocular depth estimation. arXiv preprint arXiv:2309.00526, 2023.
  • [59] Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2162–2171, 2019.
  • [60] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1164–1174, 2021.
  • [61] Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE/CVF International Conference on Computer vision, pages 16269–16279, 2021.
  • [62] Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [63] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1983–1992, 2018.
  • [64] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 340–349, 2018.
  • [65] Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo, Zheng Zhu, Guan Huang, Yang Tang, and Stefano Mattoccia. Monovit: Self-supervised monocular depth estimation with a vision transformer. In 2022 International Conference on 3D Vision (3DV), pages 668–678. IEEE, 2022.
  • [66] Jiawei Zhao, Ke Yan, Yifan Zhao, Xiaowei Guo, Feiyue Huang, and Jia Li. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 163–172, 2021.
  • [67] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017.
  • [68] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  • [69] Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13116–13125, 2020.
  • [70] Lyu, Xiaoyang, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, and Yi Yuan. Hr-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 3, pp. 2294-2301. 2021.
  • [71] Zhang, Ning, Francesco Nex, George Vosselman, and Norman Kerle. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18537-18546. 2023.
  • [72] Yan, Jiaxing, Hong Zhao, Penghui Bu, and YuSheng Jin. Channel-wise attention-based network for self-supervised monocular depth estimation. In 2021 International Conference on 3D vision (3DV), pp. 464-473. IEEE, 2021.
  • [73] Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2022–2030, 2018.
  • [74] Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. Unsupervised monocular depth and ego-motion learning with structure and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [75] Hang Zhou, David Greenwood, and Sarah Taylor. Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482, 2021.
  • [76] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, and Adrien Gaidon. Packnet-sfm: 3d packing for self-supervised monocular depth estimation. arXiv preprint arXiv:1905.02693, 5(1), 2019.
  • [77] Arnav Varma., Hemang Chawla., Bahram Zonooz., and Elahe Arani. Transformers in self-supervised monocular depth estimation with unknown camera intrinsics. In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, pages 758–769. IN- STICC, SciTePress, 2022.]