Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Jiahao Shao1∗  Yuanbo Yang1∗  Hongyu Zhou1  Youmin Zhang2,4
Yujun Shen3   Matteo Poggi2   Yiyi Liao1†
1Zhejiang University  2University of Bologna   3Ant Group   4Rock Universe
Abstract

This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learning difficulty and enhancing generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy — first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen — yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis. Our project page is available at https://jhaoshao.github.io/ChronoDepth/. \ast denotes equal contribution.\dagger corresponding author.

1 Introduction

Monocular video depth estimation stands as a foundational challenge within the realm of computer vision, boasting diverse applications across robotics, autonomous driving, animation, virtual reality, and beyond. While recent strides have notably enhanced the spatial precision of depth prediction through single-image depth models [55, 77, 17, 79, 54, 73, 36, 20, 23, 72], achieving temporal consistency over a video remains an ongoing pursuit. Temporal consistency, vital for eliminating flickering artifacts between successive frames caused by single-frame scale ambiguity, remains a core objective in the domain of video depth estimation. Existing methodologies [47, 40, 82] predominantly employ test-time training (TTT) paradigms, wherein a single-image depth model undergoes fine-tuning on the testing video, leveraging geometry constraints and pose estimation for inference. However, this approach heavily hinges on precise camera poses and lacks generalizability. Despite concerted efforts to enforce temporal consistency through learning-based models, prior methods [70, 45, 80, 68, 12] have often faltered, exhibiting inferior performance metrics. While certain techniques [69] demonstrate promising temporal coherence, their spatial accuracy remains suboptimal, failing to meet desired standards. Thus, the quest persists for a unified model capable of seamlessly marrying spatial accuracy with temporal consistency, a pursuit that promises to advance the frontiers of monocular video depth estimation.

Video generative models have witnessed significant growth recently, sharing similar targets of achieving high spatial quality with temporal consistency. Recent advancements in video generation [9, 8] have yielded models capable of producing visually realistic and temporally consistent content. These achievements often stem from the integration of temporal layers, which incorporate self-attention mechanisms and convolutional operations between frames within conventional image generation frameworks. Through this streamlined approach, these models effectively synthesize video content that exhibits both realism and temporal coherence, marking a notable progression in video generation technology.

In this work, inspired by the recent attempts leveraging the priors of image generative models for single-frame perceptual tasks [33, 16, 58, 84, 15, 36, 20, 23, 72], we investigate the potential of pre-trained video generative models to serve as foundational models for video depth estimation. Considering that fine-tuning a video diffusion model for discriminative depth estimation has not been explored in the literature, we delve into optimizing the fine-tuning protocols and inference strategies to achieve both spatial accuracy and temporal consistency. For training, we conduct thorough comparisons between various protocols, empirically finding an effective training strategy fully exploring image and video depth datasets. In particular, we fine-tune SVD by combining image and video depth datasets, using randomly sampled clip lengths from the latter. Instead of fine-tuning the full network, we observe the sequential training of spatial and temporal layers to be more effective, where the spatial layers are first trained and kept frozen during the training of the temporal layers. For inference, we use a sliding windows strategy, using previously predicted depth frames to guide the prediction of subsequent frames, yielding consistent depth estimation over videos of unlimited length. We observe that using only a single frame overlap between two consecutive videos yields good temporal consistency, effectively reaching a balance between efficiency and performance. Through training on synthetic image and video datasets, we conduct quantitative and qualitative evaluations on real-world benchmarks. Experimental results show that our method achieves state-of-the-art temporal consistency on video depth estimation, with comparable spatial accuracy with state-of-the-art single-image depth estimation methods. We further demonstrate that our temporally-consistent depth estimation provides better support than baselines for downstream applications, including depth-conditioned video generation and novel view synthesis.

2 Related Work

2.1 Monocular Depth Estimation

Estimating depth from a single image is a ubiquitous task in computer vision, addressed through either discriminative or generative methodologies.

Discriminative depth estimation. These approaches are trained end-to-end to regress depth, according to two alternative categories: metric versus relative depth estimation. Early attempts focused on the former category [18], yet were limited to single scenarios – i.e., training specialized models for driving environments [22] or indoor scenes [60]. Among these, some frameworks used ordinal regression [19], local planar guidance [41], adaptive bins [6] or fully-connected CRFs [78]. In pursuit of achieving generalizability across different environments [55], the community recently shifted towards training models estimating relative depth, through the use of affine-invariant loss functions [55, 54] to be robust against different scale and shift factors across diverse datasets. On this track, the following efforts focused on recovering the real 3D shape of scenes [77], exploiting surface normals as supervision [75], improving high-frequency details [48, 79] or exploiting procedural data generation [17]. To combine the best of the two worlds, new attempts to learn generalizable, metric depth estimation have emerged lately [7, 43], most of them by explicitly handling camera parameters through canonical transformations [76, 29] or as direct prompts to the model [25, 52].

Generative depth estimation. Eventually, some methodologies have embraced the utilization of pre-trained generative models for monocular depth estimation. Among them, some exploited Low-Rank Adaptation (LoRA) [15] or started from self-supervised pre-training [58], while others re-purposed Latent Diffusion Models (LDM) by fine-tuning the pre-trained UNet [36] to denoise depth maps. Further advances of this latter strategy jointly tackled depth estimation and surface normal prediction [20], or exploited Flow Matching [23] for higher efficiency. Despite achieving remarkable performance and capturing intricate details in complex scenarios, these methods often struggle to maintain geometric consistency when extrapolated to 3D space and exhibit subpar temporal consistency when estimating depth across sequences of frames. To address these challenges, we take a further step forward on this track by exploiting a pre-trained video diffusion model [8] as a robust visual prior in place of a single-image one, while designing an efficient fine-tuning protocol to deal with the high complexity it demands.

2.2 Video Depth Estimation

In addition to spatial accuracy, achieving temporal consistency is a fundamental goal when predicting depth on a video. This entails eliminating flickering effects between consecutive frames, possibly caused by scale inconsistencies. Some approaches hinge on estimating the poses of any frame in the video and using them to build cost volumes [70, 45, 24] or running test-time training [47, 40, 82], with both heavily depending on the accuracy of the poses and the latter lacking generalizability as it overfits a single video. Others deploy recurrent networks [80, 64], while most recent works exploit attention mechanisms [68, 12], yet with sub-optimal results compared to state-of-the-art single-image depth predictors. Finally, NVDS [69] introduces a stabilization network so as to conduct temporal refinement to single-frame results from an off-the-shelf depth predictor, which could harm spatial accuracy.

2.3 Diffusion Models

Image Diffusion Models (IDMs) by Sohl-Dickstein et al. [61] conquered the main stage for image generation tasks [14, 39] at the expense of GANs. Further developments aimed at improving both generation conditioning [1, 50] and computational efficiency [57], with Latent Diffusion Models (LDMs) [57] notably emerging as a solution for the latter. Among conditioning techniques, the use of cross-attention [3, 10, 21, 26, 35, 38, 50, 51, 53] and the encoding of segmentation masks into tokens [2, 21] stand as the most popular, with additional schemes being proposed to enable the generation of visual data conditioned by diverse factors such as text, images, semantic maps, sketches, and other representations [4, 5, 30, 49, 81, 66]. Lately, LDMs have been extended for video generation [28, 8, 13, 32, 83, 71], focusing on obtaining consistent content over time thanks to the integration of temporal layers, incorporating self-attention mechanisms and convolutions between frames within conventional image generation frameworks.

While recent efforts exploited image diffusion models for single-image depth estimation [15, 58, 36, 20, 23], ours is the first effort to tame cutting-edge video diffusion models for consistent depth estimation on monocular videos.

3 Method

In this section, we introduce ChronoDepth, a consistent video depth estimator derived from a video foundation model, specifically from Stable Video Diffusion (SVD) [8].

Given a consecutive monocular video 𝐱𝐱\mathbf{x}bold_x, our goal is to generate spatial-accurate and temporal-consistent video depth 𝐝𝐝\mathbf{d}bold_d. Firstly, we reformulate this problem as a conditional denoising diffusion generation task (see Sec. 3.1). Secondly, we perform a comprehensive empirical analysis of fine-tuning protocols, aiming to discover the best practices for taming video foundation model into consistent depth estimator (see Sec. 3.2). Lastly, we propose an efficient and effective inference approach and demonstrate it to outperform other inference approaches (see Sec. 3.3). Fig. 1 contains an overview of ChronoDepth.

3.1 Diffusion Formulation

In order to align with the video foundation model, SVD [8], we reformulate monocular video depth estimation as a continuous-time denoising diffusion [62, 34] generation task conditioned on the RGB video. The diffusion model consists of a stochastic forward pass to inject σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-variance Gaussian noise into the input video depth and a reverse process to remove noise with a learnable denoiser Dθsubscript𝐷𝜃D_{\mathbf{\theta}}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Following SVD, our diffusion model is defined in a latent space of lower spatial dimension for better computation efficiency, where a variational autoencoder (VAE) consists of an encoder \mathcal{E}caligraphic_E and a decoder 𝒟𝒟\mathcal{D}caligraphic_D is used to compress the original signals. In order to repurpose the VAE of SVD, which accepts a 3-channel (RGB) input, we replicate the depth map into three channels to mimic an RGB image. Subsequently, during inference, we decode and calculate the average of these three channels, which serves as our predicted depth map following [36].

Training. During training, the RGB video clip 𝐱𝐑T×W×H×3𝐱superscript𝐑𝑇𝑊𝐻3\mathbf{x}\in\mathbf{R}^{T\times W\times H\times 3}bold_x ∈ bold_R start_POSTSUPERSCRIPT italic_T × italic_W × italic_H × 3 end_POSTSUPERSCRIPT and the corresponding depth 0pt𝐑T×W×H0𝑝𝑡superscript𝐑𝑇𝑊𝐻0pt\in\mathbf{R}^{T\times W\times H}0 italic_p italic_t ∈ bold_R start_POSTSUPERSCRIPT italic_T × italic_W × italic_H end_POSTSUPERSCRIPT are first encoded into the latent space with the VAE encoder: 𝐳(𝐱)=(𝐱),𝐳(0pt)=(0pt)formulae-sequencesuperscript𝐳𝐱𝐱superscript𝐳0𝑝𝑡0𝑝𝑡\mathbf{z}^{(\mathbf{x})}=\mathcal{E}(\mathbf{x}),\mathbf{z}^{(0pt)}=\mathcal{% E}(0pt)bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT = caligraphic_E ( bold_x ) , bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT = caligraphic_E ( 0 italic_p italic_t ). For each training step, we sample a Gaussian noise 𝐧𝒩(𝟎,σM2I)similar-to𝐧𝒩0subscriptsuperscript𝜎2𝑀𝐼\mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^{2}_{M}I)bold_n ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_I ) and add it to the depth latent 𝐳0(0pt)subscriptsuperscript𝐳0𝑝𝑡0\mathbf{z}^{(0pt)}_{0}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to obtain 𝐳M(0pt)subscriptsuperscript𝐳0𝑝𝑡𝑀\mathbf{z}^{(0pt)}_{M}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as

𝐳M(0pt)=𝐳(0pt)+𝐧,𝐧𝒩(𝟎,σM2I)formulae-sequencesubscriptsuperscript𝐳0𝑝𝑡𝑀superscript𝐳0𝑝𝑡𝐧similar-to𝐧𝒩0subscriptsuperscript𝜎2𝑀𝐼\mathbf{z}^{(0pt)}_{M}=\mathbf{z}^{(0pt)}+\mathbf{n},\quad\mathbf{n}\sim% \mathcal{N}(\mathbf{0},\sigma^{2}_{M}I)bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT + bold_n , bold_n ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_I ) (1)

where logσM𝒩(Pmean,Pstd2)similar-tosubscript𝜎𝑀𝒩subscript𝑃𝑚𝑒𝑎𝑛superscriptsubscript𝑃𝑠𝑡𝑑2\log\sigma_{M}\sim\mathcal{N}(P_{mean},P_{std}^{2})roman_log italic_σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_P start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_s italic_t italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with Pmean=0.7subscript𝑃𝑚𝑒𝑎𝑛0.7P_{mean}=0.7italic_P start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = 0.7 and Pstd=1.6subscript𝑃𝑠𝑡𝑑1.6P_{std}=1.6italic_P start_POSTSUBSCRIPT italic_s italic_t italic_d end_POSTSUBSCRIPT = 1.6 [34]. For simplicity, we denote σMsubscript𝜎𝑀\sigma_{M}italic_σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as σ𝜎\sigmaitalic_σ throughout the remainder of training part. In the reverse process, diffusion model denoises 𝐳M(0pt)subscriptsuperscript𝐳0𝑝𝑡𝑀\mathbf{z}^{(0pt)}_{M}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT towards clean 𝐳(0pt)superscript𝐳0𝑝𝑡\mathbf{z}^{(0pt)}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT with a learnable denoiser Dθsubscript𝐷𝜃D_{\mathbf{\theta}}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which is trained via denoising score matching (DSM) [65]

=𝔼(𝐳(0pt),𝐳(𝐱))pdata(𝐳(0pt),𝐳(𝐱)),(σ,𝐧)p(σ,𝐧)[λσDθ(𝐳M(0pt);σ,𝐳(𝐱))𝐳(0pt)22],subscript𝔼formulae-sequencesimilar-tosuperscript𝐳0𝑝𝑡superscript𝐳𝐱subscript𝑝𝑑𝑎𝑡𝑎superscript𝐳0𝑝𝑡superscript𝐳𝐱similar-to𝜎𝐧𝑝𝜎𝐧delimited-[]subscript𝜆𝜎superscriptsubscriptnormsubscript𝐷𝜃subscriptsuperscript𝐳0𝑝𝑡𝑀𝜎superscript𝐳𝐱superscript𝐳0𝑝𝑡22\mathcal{L}=\mathbb{E}_{{(\mathbf{z}^{(0pt)},\mathbf{z}^{(\mathbf{x})})\sim p_% {data}(\mathbf{z}^{(0pt)},\mathbf{z}^{(\mathbf{x})}),(\sigma,\mathbf{n})\sim p% (\sigma,\mathbf{n})}}\left[\lambda_{\sigma}\|D_{\theta}(\mathbf{z}^{(0pt)}_{M}% ;\sigma,\mathbf{z}^{(\mathbf{x})})-\mathbf{z}^{(0pt)}\|_{2}^{2}\right],caligraphic_L = blackboard_E start_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) ∼ italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) , ( italic_σ , bold_n ) ∼ italic_p ( italic_σ , bold_n ) end_POSTSUBSCRIPT [ italic_λ start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ∥ italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ; italic_σ , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) - bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (2)

where weighting function λ(σ)=(1+σ2)σ2𝜆𝜎1superscript𝜎2superscript𝜎2\lambda(\sigma)=(1+\sigma^{2})\sigma^{-2}italic_λ ( italic_σ ) = ( 1 + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_σ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. In this work, we follow the EDM-preconditioning framework [34], parameterizing the learnable denoiser Dθsubscript𝐷𝜃D_{\mathbf{\theta}}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as

Dθ(𝐳M(0pt);σ,𝐳(𝐱))=1σ2+1𝐳M(0pt)σσ2+1Fθ(𝐳M(0pt)/σ2+1;0.25logσ,𝐳(𝐱)),subscript𝐷𝜃subscriptsuperscript𝐳0𝑝𝑡𝑀𝜎superscript𝐳𝐱1superscript𝜎21subscriptsuperscript𝐳0𝑝𝑡𝑀𝜎superscript𝜎21subscript𝐹𝜃subscriptsuperscript𝐳0𝑝𝑡𝑀superscript𝜎210.25𝑙𝑜𝑔𝜎superscript𝐳𝐱D_{\mathbf{\theta}}(\mathbf{z}^{(0pt)}_{M};\sigma,\mathbf{z}^{(\mathbf{x})})=% \frac{1}{\sigma^{2}+1}\mathbf{z}^{(0pt)}_{M}-\frac{\sigma}{\sqrt{\sigma^{2}+1}% }F_{\mathbf{\theta}}(\mathbf{z}^{(0pt)}_{M}/{\sqrt{\sigma^{2}+1}};0.25log% \sigma,\mathbf{z}^{(\mathbf{x})}),italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ; italic_σ , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG end_ARG italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT / square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG ; 0.25 italic_l italic_o italic_g italic_σ , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) , (3)

where Fθsubscript𝐹𝜃F_{\mathbf{\theta}}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a UNet to be trained in our case. The condition, RGB video latent 𝐳(𝐱)superscript𝐳𝐱\mathbf{z}^{(\mathbf{x})}bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT, in this formulation is introduced via concatenation with depth video latent on feature dimension 𝐳M=concat(𝐳M(0pt)/σ2+1,𝐳(𝐱))subscript𝐳𝑀𝑐𝑜𝑛𝑐𝑎𝑡subscriptsuperscript𝐳0𝑝𝑡𝑀superscript𝜎21superscript𝐳𝐱\mathbf{z}_{M}=concat(\mathbf{z}^{(0pt)}_{M}/\sqrt{\sigma^{2}+1},\mathbf{z}^{(% \mathbf{x})})bold_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_c italic_o italic_n italic_c italic_a italic_t ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT / square-root start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ).

Inference. During inference time, 𝐳0(0pt)subscriptsuperscript𝐳0𝑝𝑡0\mathbf{z}^{(0pt)}_{0}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is restored from a randomly-sampled Gaussian noise 𝐳N(0pt)𝒩(𝟎,σN2I)similar-tosubscriptsuperscript𝐳0𝑝𝑡𝑁𝒩0subscriptsuperscript𝜎2𝑁𝐼\mathbf{z}^{(0pt)}_{N}\sim\mathcal{N}(\mathbf{0},\sigma^{2}_{N}I)bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT italic_I ) conditioning on given RGB video via iteratively applying the denoising process with trained denoiser Dθsubscript𝐷𝜃D_{\mathbf{\theta}}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

𝐳i(0pt)=𝐳i+1(0pt)+𝐳i+1Dθ(𝐳i+1(0pt);σi+1,𝐳(𝐱))σi+1(σiσi+1),i<Nformulae-sequencesubscriptsuperscript𝐳0𝑝𝑡𝑖subscriptsuperscript𝐳0𝑝𝑡𝑖1subscript𝐳𝑖1subscript𝐷𝜃subscriptsuperscript𝐳0𝑝𝑡𝑖1subscript𝜎𝑖1superscript𝐳𝐱subscript𝜎𝑖1subscript𝜎𝑖subscript𝜎𝑖1𝑖𝑁\mathbf{z}^{(0pt)}_{i}=\mathbf{z}^{(0pt)}_{i+1}+\frac{\mathbf{z}_{i+1}-D_{% \mathbf{\theta}}(\mathbf{z}^{(0pt)}_{i+1};\sigma_{i+1},\mathbf{z}^{(\mathbf{x}% )})}{\sigma_{i+1}}(\sigma_{i}-\sigma_{i+1}),\ i<Nbold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT + divide start_ARG bold_z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_ARG ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) , italic_i < italic_N (4)

where {σ0,,σNsubscript𝜎0subscript𝜎𝑁\sigma_{0},...,\sigma_{N}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT} is the fixed variance schedule of a denoising process with N𝑁Nitalic_N steps. 0pt00𝑝subscript𝑡00pt_{0}0 italic_p italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT could be obtained with decoder 0pt0=𝒟(𝐳0(0pt))0𝑝subscript𝑡0𝒟subscriptsuperscript𝐳0𝑝𝑡00pt_{0}=\mathcal{D}(\mathbf{z}^{(0pt)}_{0})0 italic_p italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_D ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Refer to caption
Figure 1: Training pipeline. We add an RGB video conditioning branch to a pre-trained video diffusion model, Stable Video Diffusion, and fine-tune it for consistent depth estimation. Both the RGB and depth videos are projected to a lower-dimensional latent space using a pre-trained encoder. The video depth estimator is trained via denoising score matching (DSM). Our training involves two stages: first, we train the spatial layers with single-frame depths; then, we freeze the spatial layers and train the temporal layers using clips of randomly sampled lengths. This sequential spatial-temporal fine-tuning approach yields better performance than training the full network.

3.2 Fine-Tuning Protocol

In contrast to image generative models, video generative models incorporate both spatial and temporal dimensions. The central question thereby becomes how to effectively fine-tune these pre-trained video generative models to achieve a satisfying level of spatial accuracy and temporal coherence in geometry. In response to this, we have conducted comprehensive analyses to ascertain the best practices for taming our foundational video model into a consistent depth estimator. Note that the VAE for compressing the original pixel space remains frozen.

Without v.s. With Single-Frame Datasets. We first investigate the impact of training data for taming SVD into a video depth estimator. In particular, we investigate whether it is beneficial to use single-frame datasets for supervision, in addition to video depth datasets. Thanks to the fact that single-frame depth datasets are easier to obtain and are usually more diverse, our experimental results suggest that jointly using single-frame and multi-frame depth datasets is crucial to achieving good spatial and temporal accuracy. Therefore, we keep the single-frame dataset as a part of our supervision throughout the full training process, in contrast to SVD [8] that single-view image datasets [59] are only used in the first pre-training stage. Note that the total number of training images of the video depth datasets is more than that of the image depth dataset, yet the number of scenes is less, indicating that scene diversity plays an important role in depth estimation.

Fixed v.s. Randomly Sampled Clip Length. Next, we ablate whether the length of the video clips T𝑇Titalic_T should be fixed or randomly sampled during the training process. Both options are naturally supported by design, thanks to the spatial layers interpreting the video as a batch of independent images and temporal layers applying both convolutions and temporal attention along the time axis. This means the video clip length T𝑇Titalic_T can be either fixed or vary during training. As we will introduce later in Section 3.3, it is a natural choice to use a fixed-length video clip for inference. This leads to an intuitive assumption that training under the same, fixed length T𝑇Titalic_T leads to better performance. Surprisingly, our experimental results demonstrate that using a uniformly sampled clip length from [1,Tmax]1subscript𝑇max[1,T_{\text{max}}][ 1 , italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] yields better performance. We use this random clip-length sampling strategy by default.

Joint v.s. Sequential Spatial-Temporal Fine-Tune. The original SVD fine-tunes the full UNet Dθsubscript𝐷𝜃D_{\theta}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on video datasets, meaning that both the spatial layers and the temporal layers are trained jointly. We investigate another training protocol – sequential spatial-temporal training. Specifically, we first train the spatial layers using single-frame supervision. After convergence, we keep the spatial layers frozen and fine-tune the temporal layers using clips of randomly sampled length as supervision. Surprisingly, this sequential training of spatial and temporal layers allows for achieving good spatial accuracy while improving temporal consistency, yielding better performance than training the full network.

Refer to caption
Figure 2: Inference pipeline. We explore two inference strategies. Option 1: Separate Inference involves dividing the videos into non-overlapping clips and predicting their depths individually. Option 2: Temporal Inpaint Inference denotes inpainting later frames 𝐳(0pt[W:T])superscript𝐳0𝑝subscript𝑡delimited-[]:𝑊𝑇\mathbf{z}^{(0pt_{[W:T]})}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ italic_W : italic_T ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT in a clip based on the previous frames’ prediction 𝐳(0pt[0:W])superscript𝐳0𝑝subscript𝑡delimited-[]:0𝑊\mathbf{z}^{(0pt_{[0:W]})}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ 0 : italic_W ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. Our proposed temporal inpaint enhances temporal consistency.

3.3 Inference Strategy

Having trained our model using the aforementioned strategy, we are capable of generating depth videos with a high degree of consistency in a single inference. However, due to resource constraints, it is not possible to infer a long-horizon video in a single forward pass. With a single 48G A6000 and a resolution of 576×768576768576\times 768576 × 768, the maximum number of frames we can process is 38383838. As such, we have to segment the long video into several clips and proceed to infer each separately. However, when no information is shared across clips, the model may still suffer from inconsistency prediction across different clips. Therefore, it becomes highly important to minimize the gap between each inference result. We evaluate two different inference protocols to determine the best practice.

Separate Inference. Given a video sequence with a total frame number Tseqsubscript𝑇seqT_{\text{seq}}italic_T start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT, we set the clip frame number to T𝑇Titalic_T and cut the video into K=Tseq/T𝐾subscript𝑇seq𝑇K=\left\lceil T_{\text{seq}}/T\right\rceilitalic_K = ⌈ italic_T start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT / italic_T ⌉ clips (the number of frames in the last clip is padded to make it up to T𝑇Titalic_T). We then carry out K𝐾Kitalic_K times of inferences, subsequently concatenating all of these results. Conversely to Marigold [36], we do not run ensemble inference, as we found it ineffective when combined with EDM.

Temporal Inpaint Inference. To exchange temporal information across clips, we introduce a novel temporal inpainting inference strategy, inspired by spatial inpainting methods [46]. We add W𝑊Witalic_W frames overlap between adjacent T𝑇Titalic_T-frame clips (W<T𝑊𝑇W<Titalic_W < italic_T). For the first clip, we denoise from a randomly-sampled Gaussian noise using Eq. 4. Then, for the remaining clips, we retain the last W𝑊Witalic_W depth maps predicted from the previous inference, i.e., those overlapped with the current clip, and include them as input to the current inference. Specifically, we add Gaussian noise on these W𝑊Witalic_W depth latents at any step i𝑖iitalic_i, to obtain 𝐳i(0pt[0:W])subscriptsuperscript𝐳0𝑝subscript𝑡delimited-[]:0𝑊𝑖\mathbf{z}^{(0pt_{[0:W]})}_{i}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ 0 : italic_W ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using Eq. 1 for the first W𝑊Witalic_W frames of the current clip. We then use Eq. 4 for the remaining (TW)𝑇𝑊(T-W)( italic_T - italic_W ) frames 𝐳i(0pt[W:T])subscriptsuperscript𝐳0𝑝subscript𝑡delimited-[]:𝑊𝑇𝑖\mathbf{z}^{(0pt_{[W:T]})}_{i}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ italic_W : italic_T ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We get the final 𝐳i(0pt)subscriptsuperscript𝐳0𝑝𝑡𝑖\mathbf{z}^{(0pt)}_{i}bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT via concatenating these two items along the time axis. Thus we achieve the following expression for one reverse step in our approach

𝐳i(0pt[0:W])=𝐳(0pt[0:W])+𝐧i,𝐧i𝒩(0,σi2I)formulae-sequencesubscriptsuperscript𝐳0𝑝subscript𝑡delimited-[]:0𝑊𝑖superscript𝐳0𝑝subscript𝑡delimited-[]:0𝑊subscript𝐧𝑖similar-tosubscript𝐧𝑖𝒩0subscriptsuperscript𝜎2𝑖𝐼\mathbf{z}^{(0pt_{[0:W]})}_{i}=\mathbf{z}^{(0pt_{[0:W]})}+\mathbf{n}_{i},% \mathbf{n}_{i}\sim\mathcal{N}(0,\sigma^{2}_{i}I)bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ 0 : italic_W ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ 0 : italic_W ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_I ) (5)
𝐳i(0pt[W:T])=[𝐳i+1(0pt)+𝐳i+1Dθ(𝐳i+1(0pt);σi+1,𝐳(𝐱))σi+1(σiσi+1)][W:T],i<Nformulae-sequencesubscriptsuperscript𝐳0𝑝subscript𝑡delimited-[]:𝑊𝑇𝑖superscriptdelimited-[]subscriptsuperscript𝐳0𝑝𝑡𝑖1subscript𝐳𝑖1subscript𝐷𝜃subscriptsuperscript𝐳0𝑝𝑡𝑖1subscript𝜎𝑖1superscript𝐳𝐱subscript𝜎𝑖1subscript𝜎𝑖subscript𝜎𝑖1delimited-[]:𝑊𝑇𝑖𝑁\mathbf{z}^{(0pt_{[W:T]})}_{i}=\left[\mathbf{z}^{(0pt)}_{i+1}+\frac{\mathbf{z}% _{i+1}-D_{\mathbf{\theta}}(\mathbf{z}^{(0pt)}_{i+1};\sigma_{i+1},\mathbf{z}^{(% \mathbf{x})})}{\sigma_{i+1}}(\sigma_{i}-\sigma_{i+1})\right]^{[W:T]},\ i<Nbold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ italic_W : italic_T ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT + divide start_ARG bold_z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ; italic_σ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT ( bold_x ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_ARG ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT [ italic_W : italic_T ] end_POSTSUPERSCRIPT , italic_i < italic_N (6)
𝐳i(0pt)=concat(𝐳i(0pt[0:W]),𝐳i(0pt[W:T])).subscriptsuperscript𝐳0𝑝𝑡𝑖𝑐𝑜𝑛𝑐𝑎𝑡subscriptsuperscript𝐳0𝑝subscript𝑡delimited-[]:0𝑊𝑖subscriptsuperscript𝐳0𝑝subscript𝑡delimited-[]:𝑊𝑇𝑖\mathbf{z}^{(0pt)}_{i}=concat(\mathbf{z}^{(0pt_{[0:W]})}_{i},\mathbf{z}^{(0pt_% {[W:T]})}_{i}).bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_c italic_o italic_n italic_c italic_a italic_t ( bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ 0 : italic_W ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUPERSCRIPT ( 0 italic_p italic_t start_POSTSUBSCRIPT [ italic_W : italic_T ] end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)

Our inference strategy is depicted in Figure 2. Intuitively, there is an efficiency-consistency trade-off for this inpainting strategy: a larger overlap W𝑊Witalic_W yields more consistent performance but slower inference. We observe that a single-frame overlap (W=1𝑊1W=1italic_W = 1) is able to improve consistency over the separate inference without much additional computation burden.

4 Experiment

4.1 Implementation Details and Datasets

Implementation Details. We implement ChronoDepth using diffusers and we fine-tune our video foundation model SVD, specifically the image-to-video variant, following the strategy discussed in Sec. 3.2. We disable the cross-attention conditioning of the original SVD. We use the standard EDM noise schedule and the network preconditioning [34]. We use images and video resolution of 576×768576768576\times 768576 × 768. We first pre-train our model with single-frame images input for 20k steps using batch size 8 then fine-tune our model with five-frame video clips for 18k steps using batch size 1. The maximum video clip length Tmaxsubscript𝑇maxT_{\text{max}}italic_T start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is 5 for training. For the temporal inpainting inference, we set the length of each clip T𝑇Titalic_T as 10 and the number of overlapping frames between two adjacent clips W𝑊Witalic_W as 1. The entire training takes about 1.5 days on a cluster of 8 Nvidia Tesla A100-80GB GPUs. We use the Adam optimizer with a learning rate of 3e-5.

Training Datasets. We utilize four synthetic datasets, including single-frame and multi-frame datasets. 1) Single-frame: Hypersim [56] is a photorealistic synthetic dataset with 461 indoor scenes. We use the official split with around 54K samples from 365 scenes for training. We filter out incomplete samples and get 39K samples finally. 2) Multi-frames: Tartanair [67] is a dataset collected in simulation environments for robot navigation tasks including 18 scenes. We select 4 indoor scenes with 40K samples in total. Virtual KITTI 2 [11] is a synthetic urban dataset providing 5 scenes with various weather or modified camera configurations. We use 4 scenes with 20K samples for training. MVS-Synth [31] is a synthetic urban dataset captured in the video game GTA. We use all the 120 sequences with a total of 12K samples for training.

4.2 Evaluation

Evaluation Datasets. We perform a range of experiments to assess the performance of our model. We select several zero-shot video clips from KITTI-360 [44], MatrixCity [42] and ScanNet++ [74], with a total number of frames ranging from 200 to 240, instrumental for a thorough evaluation aimed at evaluating both spatial accuracy and temporal consistency in video depth estimation.

Metrics. We consider metrics for both spatial accuracy and temporal consistency. For spatial accuracy, we apply two widely recognized metrics [55] for spatial quality for depth estimation, Absolute Mean Relative Error (AbsRel) and δ1𝛿1\delta 1italic_δ 1 accuracy with a specified threshold value of 1.25. For temporal consistency, we introduce multi-frame similarity (Sim.). Consider two depth maps Dm,DnW×Hsuperscript𝐷𝑚superscript𝐷𝑛superscript𝑊𝐻D^{m},D^{n}\in\mathcal{R}^{W\times H}italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_W × italic_H end_POSTSUPERSCRIPT at frame m𝑚mitalic_m, n𝑛nitalic_n of the video sequence, we unproject Dmsuperscript𝐷𝑚D^{m}italic_D start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT into a point cloud. Next, using the ground-truth world-to-camera poses Pm,Pn3×4subscript𝑃𝑚subscript𝑃𝑛superscript34P_{m},P_{n}\in\mathcal{R}^{3\times 4}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT 3 × 4 end_POSTSUPERSCRIPT for frames m,n𝑚𝑛m,nitalic_m , italic_n’ camera, we can transform the point cloud from frame m𝑚mitalic_m’ camera space to frame n𝑛nitalic_n’ camera space, and project it onto frame n𝑛nitalic_n’s image plane to yield Dmnsuperscript𝐷𝑚𝑛D^{m\to n}italic_D start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT. We measure the temporal consistency as the average L1 distance between Dmnsuperscript𝐷𝑚𝑛D^{m\to n}italic_D start_POSTSUPERSCRIPT italic_m → italic_n end_POSTSUPERSCRIPT and Dnsuperscript𝐷𝑛D^{n}italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In practice, we only calculate multi-frame similarity on adjacent frames. Note, the ground-truth camera poses are for evaluation only.

4.3 Comparison Experiments

Baselines. We compare ChronoDepth with four baselines covering both monocular and video depth estimation, aiming at demonstrating its effectiveness across spatial accuracy and temporal consistency. For monocular depth estimation, we select baselines from state-of-the-art methods, which have been demonstrated generalizability across various benchmarks. These methods achieve satisfying spatial accuracy (DPT [54], DepthAnything [73], and Marigold [36]). For video depth estimation, we choose NVDS [69] which is a state-of-the-art video depth estimator based on DPT.

Quantitative and Qualitative Analisis. In Table 1, we compare ChronoDepth with both discriminative (top) and generative (bottom) existing methods. It is worth mentioning discriminative models are trained on millions of images annotated with ground truth depth, about one order of magnitude more than generative ones. Nonetheless, ChronoDepth consistently outperforms prior art in temporal consistency, while retaining comparable spatial accuracy, often higher than the Marigold baseline – conversely to our main video counterpart, NVDS. It’s notable that Marigold employs an ensemble strategy for optimal performance, which involves denoising iterations ten times over and subsequently averaging the predictions. However, this method significantly hampers the speed, stretching the inference time to 8.75 seconds per frame. In contrast, ChronoDepth’s inference process is markedly more efficient, demanding just 0.84 seconds per frame. Fig. 3 shows a qualitative comparison between the baselines and ChronoDepth on the three datasets involved in our evaluation. We can appreciate how our video depth estimation framework predicts sharper, finer, and more accurate details – as highlighted by the arrows. Further, we observe that ChronoDepth improves performance at far regions in driving scenarios thanks to the multi-frame context. Note that this is not reflected in the quantitative results due to the incomplete GT.

Method KITTI-360 MatrixCity ScanNet++
AbsRel\downarrow δ𝛿\deltaitalic_δ1\uparrow Sim.\downarrow AbsRel\downarrow δ𝛿\deltaitalic_δ1\uparrow Sim.\downarrow AbsRel\downarrow δ𝛿\deltaitalic_δ1\uparrow Sim.\downarrow
DPT [54] 17.7 75.4 2.25 11.9 86.2 10.5 13.9 81.8 0.17
NVDS [69] 31.7 62.1 2.92 13.4 85.8 11.3 17.3 72.0 0.14
DepthAnyting [73] 16.8 77.7 1.89 8.3 90.7 11.7 8.2 92.7 0.21
Marigold [36] 19.7 73.1 1.12 13.9 79.7 9.0 11.3 86.9 0.16
Marigold (w/o ensemble) 21.4 69.0 1.83 14.6 76.8 9.3 12.0 85.1 0.18
Ours 19.6 72.4 0.91 13.4 82.3 8.8 12.6 87.1 0.10
Table 1: Quantitative Comparison on zero-shot depth benchmarks. Top: Discriminative methods. Bottom: Generative methods.
Input RGB NVDS [69] DepthAnything [73] Marigold [36] Ours Depth GT
Refer to caption
Figure 3: Qualitative Comparison on KITTI-360 (rows 1-2), ScanNet++ (3-4) and MatrixCity (5-6). The first column displays RGB images, followed by depth maps predicted by NVDS, Depth Anything, Marigold, and ChronoDepth. Ground-truth depth is shown on the right-most column.

4.4 Ablation Studies

We then conduct ablation studies on KITTI-360 dataset.

AbsRel\downarrow δ𝛿\deltaitalic_δ1\uparrow Sim.\downarrow
w/o ImgDepth 24.2 59.8 0.94
w/o RandomClip 20.7 69.3 0.93
w/o S-T Fintune 21.0 68.7 0.92
w/o Inpainting 20.2 71.0 1.03
Full 19.6 72.4 0.91
Table 2: Ablation Study. We report accuracy and consistency metrics of our method on KITTI-360 without single frame image pretraining (w/o ImgDepth), without random clip length (w/o RandomClip), without Sequential Spatial-Temporal Fine-Tuning (w/o S-T fintune) and without inpainting during inference (w/o Inpainting)
Refer to caption
Figure 4: Ablation Study. We report accuracy and consistency metrics of our method on KITTI-360 with different overlaps between clips. Overlap 0 refers to separate inference.

Single-Frame Datasets. We investigate the impact of single-frame data during the training phase. We train our model in two different settings, one with only multi-frame data and one with single-frame data additionally. As shown in Table 2, training with single-frame data significantly improves the video depth prediction in both spatial accuracy and temporal consistency over using merely multi-frame data (w/o ImgDepth). Specifically, training with single-frame data additionally yields better estimation in regions characterized by a high frequency of depth variations.

Random Clip Length Sampling. Next, we ablate the effectiveness of random video clip length sampling during training. Table 2 shows that removing this sampling (w/o RandomClip) leads to performance degradation in spatial accuracy. This indicates that the random clip length sampling strategy acts as an effective form of data augmentation, mitigating the risk of model overfitting.

Sequential Spatial-Temporal Fine-Tune. We evaluate the effect of sequential spatial-temporal training protocol in Table 2 by jointly training the full network (w/o S-T Finetune). The sequential training strategy leads to better spatial accuracy and temporal coherence, which means disentangling spatial and temporal layers could be the better way to tame video foundation into depth estimator.

Temporal Inpaint Inference. We compare the results of temporal inpaint inference and separate inference derived from the same checkpoint. For temporal inpaint inference, we set the overlap between adjacent clips to 1111. As shown in Table 2, this latter achieves significantly improved performance in temporal consistency with only an inconsequential compromise to spatial accuracy. Fig. 4 illustrates the trade-off between the performance of our method and the efficiency of inference, specifically regarding the number of overlapping frames during inpainting.

4.5 Applications

Depth Conditioned Video Generation. We perform a depth-conditioned video generation experiment. Video depth maps generated by our and other approaches served as inputs to a video ControlNet [81]. As shown in Fig. 5, we manage to synthesize high-quality RGB videos (depth maps generated on the KITTI-360 validation set, with the corresponding first frame RGB used as the reference image). Furthermore, we quantitatively assess the visual fidelity (FID [27]) and temporal consistency (FVD [63]) of depth map videos produced by various depth estimation methods as reported in Table 3.

Method KITTI-360
FID\downarrow FVD\downarrow
DPT [54] 79.7 310.3
NVDS [69] 87.2 420.5
DepthAnyting [73] 59.0 322.3
Marigold [36] 122.5 683.5
Ours 63.1 292.4
Table 3: Quality of Video Generation using different depth estimators. Our temporally consistent depth yields videos of superior FVD compared to baselines.
Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Marigold [36] Ours

Figure 5: Depth Conditioned Video Generation. Compared to baseline methods such as Marigold, the depth maps generated by method can drive ControlNet to produce more consistent and releastic videos.

Novel View Synthesis. We then utilize unprojected point clouds generated by ChronoDepth’s depth maps to initialize 3DGS [37]. As demonstrated in Fig. 6, ChronoDepth converges at a notably faster rate than our baseline. For a graphic evaluation, we exhibit a qualitative comparison in Fig. 7. The application of ChronoDepth as the foundation for NVS results delivers a superior reconstruction quality, particularly in distant regions.

Refer to caption
Figure 6: PSNR Convergence Curve. We report PSNR of 3DGS over training iterations using different initializations.

Marigold [36]

Refer to caption

Ours

Refer to caption

Figure 7: Novel View Synthesis Comparison using different intializations.

5 Conclusion

This paper introduces ChronoDepth, a video depth estimator that prioritizes temporal consistency by leveraging video generation priors. Our exploration of various training protocols and evaluation methodologies has led us to identify the most effective approach, resulting in superior performance. Specifically, ChronoDepth outperforms existing methods in terms of temporal consistency, surpassing both image and video depth estimation techniques, while maintaining comparable spatial accuracy. Our assessments on downstream applications, namely depth-conditioned video generation and novel view synthesis, underscore the advantages of our temporally consistent depth estimation approach. We contend that our empirical insights into harnessing video generation models for depth estimation lay the groundwork for future investigations in this domain.

References

  • [1] Alembics. Disco diffusion, 2022.
  • [2] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
  • [3] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  • [4] Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023.
  • [5] Dina Bashkirova, José Lezama, Kihyuk Sohn, Kate Saenko, and Irfan Essa. Masksketch: Unpaired structure-guided masked image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1879–1889, 2023.
  • [6] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009–4018, 2021.
  • [7] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  • [8] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv.org, 2311.15127, 2023.
  • [9] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  • [10] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  • [11] Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2. arXiv preprint arXiv:2001.10773, 2020.
  • [12] Yuanzhouhan Cao, Yidong Li, Haokui Zhang, Chao Ren, and Yifan Liu. Learning structure affinity for video depth estimation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 190–198, 2021.
  • [13] Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, 2023.
  • [14] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • [15] Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, and Anand Bhattad. Generative models: What do they know? do they know things? let’s find out! arXiv.org, 2023.
  • [16] Yiqun Duan, Xianda Guo, and Zheng Zhu. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021, 2023.
  • [17] Ainaz Eftekhar, Alexander Sax, Jitendra Malik, and Amir Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
  • [18] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  • [19] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [20] Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. arXiv.org, 2024.
  • [21] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  • [22] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [23] Ming Gui, Johannes S. Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Depthfm: Fast monocular depth estimation with flow matching. arXiv.org, 2403.13788, 2024.
  • [24] Vitor Guizilini, Rare\textcommabelows Ambru\textcommabelows, Dian Chen, Sergey Zakharov, and Adrien Gaidon. Multi-frame self-supervised depth with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 160–170, 2022.
  • [25] Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rares Ambrus, and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
  • [26] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • [27] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [28] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022.
  • [29] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024.
  • [30] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  • [31] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018.
  • [32] Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked-diffusion. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [33] Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, and Ping Luo. Ddp: Diffusion model for dense visual prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21741–21752, 2023.
  • [34] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  • [35] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  • [36] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [37] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. on Graphics, 2023.
  • [38] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  • [39] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • [40] Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021.
  • [41] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  • [42] Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond, 2023.
  • [43] Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. 2024.
  • [44] Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022.
  • [45] Xiaoxiao Long, Lingjie Liu, Wei Li, Christian Theobalt, and Wenping Wang. Multi-view depth estimation using epipolar spatio-temporal networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8258–8267, 2021.
  • [46] Andreas Lugmayr, Martin Danelljan, Andrés Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [47] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Transactions on Graphics (ToG), 39(4):71–1, 2020.
  • [48] S. Mahdi H. Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris, and Yağız Aksoy. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. 2021.
  • [49] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • [50] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [51] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  • [52] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [53] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • [54] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  • [55] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  • [56] Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021.
  • [57] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [58] Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek Kar, Mohammad Norouzi, Deqing Sun, and David J. Fleet. The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. In Advances in Neural Information Processing Systems (NIPS), 2023.
  • [59] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • [60] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGB-D images. In Proc. of the European Conf. on Computer Vision (ECCV), 2012.
  • [61] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [62] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • [63] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. In Deep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019.
  • [64] Patil Vaishakh, Wouter Van Gansbeke, Dengxin Dai, and Luc Van Gool. Don’t forget the past: Recurrent depth estimation from monocular video. IEEE Robotics and Automation Letters, 5(4):6813–6820, 2020.
  • [65] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • [66] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  • [67] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020.
  • [68] Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, and Jianming Zhang. Less is more: Consistent video depth estimation with masked frames modeling. In Proceedings of the 30th ACM International Conference on Multimedia, pages 6347–6358, 2022.
  • [69] Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, and Guosheng Lin. Neural video depth stabilizer. arXiv.org, 2023.
  • [70] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1164–1174, 2021.
  • [71] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  • [72] Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. Diffusion models trained with large data are transferable visual models. arXiv preprint arXiv:2403.06090, 2024.
  • [73] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [74] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the International Conference on Computer Vision (ICCV), 2023.
  • [75] Wei Yin, Yifan Liu, and Chunhua Shen. Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
  • [76] Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. 2023.
  • [77] Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 204–213, 2021.
  • [78] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Newcrfs: Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  • [79] Chi Zhang, Wei Yin, Billzb Wang, Gang Yu, Bin Fu, and Chunhua Shen. Hierarchical normalization for robust monocular depth estimation. Advances in Neural Information Processing Systems, 35:14128–14139, 2022.
  • [80] Haokui Zhang, Chunhua Shen, Ying Li, Yuanzhouhan Cao, Yu Liu, and Youliang Yan. Exploiting temporal consistency for real-time video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1725–1734, 2019.
  • [81] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
  • [82] Zhoutong Zhang, Forrester Cole, Richard Tucker, William T Freeman, and Tali Dekel. Consistent depth of moving objects in video. ACM Transactions on Graphics (TOG), 40(4):1–12, 2021.
  • [83] Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, and Tao Mei. Trip: Temporal residual learning with image noise prior for image-to-video diffusion models. In CVPR, 2024.
  • [84] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023.