VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

Cong, Wenyan; Zhu, Hanqing; Wang, Kevin; Lei, Jiahui; Stearns, Colton; Cai, Yuanhao; Wang, Dilin; Ranjan, Rakesh; Feiszli, Matt; Guibas, Leonidas; Wang, Zhangyang; Wang, Weiyao; Fan, Zhiwen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.01949 (cs)

[Submitted on 3 Jan 2025 (v1), last revised 10 Mar 2025 (this version, v2)]

Title:VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

Authors:Wenyan Cong, Hanqing Zhu, Kevin Wang, Jiahui Lei, Colton Stearns, Yuanhao Cai, Dilin Wang, Rakesh Ranjan, Matt Feiszli, Leonidas Guibas, Zhangyang Wang, Weiyao Wang, Zhiwen Fan

View PDF HTML (experimental)

Abstract:Efficiently reconstructing 3D scenes from monocular video remains a core challenge in computer vision, vital for applications in virtual reality, robotics, and scene understanding. Recently, frame-by-frame progressive reconstruction without camera poses is commonly adopted, incurring high computational overhead and compounding errors when scaling to longer videos. To overcome these issues, we introduce VideoLifter, a novel video-to-3D pipeline that leverages a local-to-global strategy on a fragment basis, achieving both extreme efficiency and SOTA quality. Locally, VideoLifter leverages learnable 3D priors to register fragments, extracting essential information for subsequent 3D Gaussian initialization with enforced inter-fragment consistency and optimized efficiency. Globally, it employs a tree-based hierarchical merging method with key frame guidance for inter-fragment alignment, pairwise merging with Gaussian point pruning, and subsequent joint optimization to ensure global consistency while efficiently mitigating cumulative errors. This approach significantly accelerates the reconstruction process, reducing training time by over 82% while holding better visual quality than current SOTA methods.

Comments:	project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.01949 [cs.CV]
	(or arXiv:2501.01949v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.01949

Submission history

From: Wenyan Cong [view email]
[v1] Fri, 3 Jan 2025 18:52:36 UTC (1,351 KB)
[v2] Mon, 10 Mar 2025 17:19:37 UTC (1,500 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators