Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.
ARKit LiDAR depth is a low-resolution depth map generated by the ARKit API using the 24x24 points iPhone LiDAR and RGB images.
This demo is implemented with our ViT-Small model, which can be run in 94+ FPS on a single RTX 4090 GPU. The video is sped up 2x for demonstration.
We thank the generous support from Prof. Weinan Zhang for robot experiments, including the space, objects and the Unitree H1 robot. We also thank Zhengbang Zhu, Jiahang Cao, Xinyao Li, Wentao Dong for their help in setting up the robot platform and collecting robot data.
@inproceedings{lin2024promptda,
title={Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation},
author={Lin, Haotong and Peng, Sida and Chen, Jingxiao and Peng, Songyou and Sun, Jiaming and Liu, Minghuan and Bao, Hujun and Feng, Jiashi and Zhou, Xiaowei and Kang, Bingyi},
journal={arXiv},
year={2024}
}