Abstract
In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as FUMET. The key idea is to leverage cars found on the road as sources of scale supervision and to incorporate them in network training robustly. FUMET detects and estimates the sizes of cars in a frame and aggregates scale information extracted from them into an estimate of the camera height whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network so that they become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and the Cityscapes datasets show the effectiveness of FUMET, which achieves state-of-the-art accuracy. We also show that FUMET enables training on mixed datasets of different camera heights, which leads to larger-scale training and better generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and FUMET democratizes its deployment by establishing the means to convert any model into a metric depth estimator.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: CVPR, pp. 4009–4018 (2021)
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Unsupervised monocular depth and ego-motion learning with structure and semantics. In: CVPRW, pp. 381–388 (2019)
Chawla, H., Varma, A., Arani, E., Zonooz, B.: Multimodal scale consistency and awareness for monocular self-supervised depth estimation. In: ICRA, pp. 5140–5146 (2021)
Chen, X., et al.: 3D object proposals for accurate object class detection. Adv. Neural Inform. Process. Syst. 28 (2015)
Chen, X., Li, T.H., Zhang, R., Li, G.: Frequency-aware self-supervised monocular depth estimation. In: WACV, pp. 5808–5817 (2023)
Cordts, M., et al.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Ding, M., et al.: Learning depth-guided convolutions for monocular 3D object detection. In: CVPR, pp. 11669–11678 (2020)
Eigen, D., Fergus, R.: Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In: ICCV. pp. 3213–3223 (2015)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inform. Process. Syst. 2, 2366–2374 (2014)
Frost, D., Prisacariu, V., Murray, D.: Recovering stable scale in monocular SLAM using object-supplemented bundle adjustment. IEEE Trans. Rob. 34(3), 736–747 (2018). https://doi.org/10.1109/TRO.2018.2820722
Garg, R., Bg, V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: ECCV, pp. 740–756 (2016)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Inter. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013). https://doi.org/10.1177/0278364913491297
Geyer, J., et al.: A2D2: Audi Autonomous Driving Dataset. CoRR arXiv:abs/2004.06320 (2020)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611 (2017)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3827–3837 (2019)
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: CVPR, pp. 2482–2491 (2020)
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edn. (2004)
He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: RA-depth: resolution adaptive self-supervised monocular depth estimation. In: ECCV, pp. 565–581 (2022)
Heylen, J., et al.: MonoCInIS: camera independent monocular 3D object detection using instance segmentation. In: ICCVW, pp. 923–934 (2021)
Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. IJCV 80, 3–15 (2008). https://doi.org/10.1007/s11263-008-0137-5
Houston, J., et al.: One thousand and one hours: self-driving motion prediction dataset. In: Conference on Robot Learning, pp. 409–418 (2021)
Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: OneFormer: one transformer to rule universal image segmentation. In: CVPR, pp. 2989–2998 (2023)
Kim, D., Ga, W., Ahn, P., Joo, D., Chun, S., Kim, J.: Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth. CoRR arXiv:abs/2201.07436 (2022)
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. CoRR arXiv:abs/1412.6980 (2014)
Lyu, X., et al.: HR-Depth: high resolution self-supervised monocular depth estimation. In: AAAI. vol. 35, pp. 2294–2301 (2021)
Milli, E., Erkent, O., Yılmaz, A.E.: Multi-modal multi-task (3MT) road segmentation. IEEE Robot. Autom. Lett. 8(9), 5408–5415 (2023). https://doi.org/10.1109/LRA.2023.3295254
Ning, C., Gan, H.: Trap Attention: Monocular Depth Estimation with Manual Traps. In: CVPR. pp. 5033–5043 (2023)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst, 8026—8037 (2019)
Piccinelli, L., Sakaridis, C., Yu, F.: IDISC: internal discretization for monocular depth estimation. In: CVPR, pp. 21477–21487 (2023)
Russakovsky, O., et al.: Imagenet: a large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Sadat, A., Casas, S., Ren, M., Wu, X., Dhawan, P., Urtasun, R.: Perceive, predict, and plan: safe motion planning through interpretable semantic representations. In: ECCV, pp. 414–430 (2020)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Simonelli, A., Bulò, S.R., Porzi, L., Kontschieder, P., Ricci, E.: Are we missing confidence in pseudo-lidar methods for monocular 3D object detection? In: ICCV, pp. 3225–3233 (2021)
Spencer, J., Russell, C., Hadfield, S., Bowden, R.: Kick Back and Relax: Learning to Reconstruct the World by Watching SlowTV (2023)
Sucar, E., Hayet, J.B.: Bayesian scale estimation for monocular SLAM based on generic object detection for correcting scale drift. In: ICRA, pp. 5152–5158 (2018)
, Wagstaff, B., Kelly, J.: Self-supervised scale recovery for monocular depth and egomotion estimation. In: IROS, pp. 2620–2627 (2021)
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)
Wang, H., Cai, P., Fan, R., Sun, Y., Liu, M.: End-to-end interactive prediction and planning with optical flow distillation for autonomous driving. In: CVPRW, pp. 2229–2238 (2021)
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: CVPR, pp. 8437–8445 (2019)
Wang, Y., et al.: Train in germany, test in the USA: making 3D object detectors generalize. In: CVPR, pp. 11710–11720 (2020)
Wang, Y., Yang, H., Cai, J., Wang, G., Wang, J., Huang, Y.: Unsupervised learning of depth and pose based on monocular camera and inertial measurement unit (IMU). In: ICRA, pp. 10010–10017 (2023)
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004). https://doi.org/10.1109/TIP.2003.819861
Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G., Firman, M.: The temporal opportunist: self-supervised multi-frame monocular depth. In: CVPR, pp. 1164–1174 (2021)
Wilson, B., et al.: Argoverse 2: next generation datasets for self-driving perception and forecasting. In: NeurIPS Datasets and Benchmarks (2021)
Xiang, J., Wang, Y., An, L., Liu, H., Wang, Z., Liu, J.: Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving. IEEE Robot. Auto. Lett. 7(4), 11998–12005 (2022). https://doi.org/10.1109/LRA.2022.3210298
Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: LEGO: learning edge with geometry all at once by watching videos. In: CVPR, pp. 225–234 (2018)
Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., Shen, C.: Metric3D: Towards Zero-Shot Metric 3D Prediction from a Single Image. In: ICCV. pp. 9043–9053 (2023)
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)
You, Y., et al.: Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2020)
Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Neural window fully-connected CRFs for monocular depth estimation. In: CVPR, pp. 3916–3925 (2022)
Zeng, W., et al.: End-to-end interpretable neural motion planner. In: CVPR, pp. 8652–8661 (2019)
Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In: CVPR. pp. 18537–18546 (2023)
Zhang, S., Zhang, J., Tao, D.: Towards scale-aware, robust, and generalizable unsupervised monocular depth estimation by integrating IMU motion dynamics. In: ECCV, pp. 143–160 (2022)
Zhang, S., Li, X., Liu, Y., Fu, H.: Scale-aware insertion of virtual objects in monocular videos. In: ISMAR, pp. 36–44 (2020)
Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: BMVC (2021)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)
Zhu, R., et al.: Single view metrology in the wild. In: ECCV, pp. 316–333 (2020)
Acknowledgement
This work was in part supported by JSPS 20H05951 and 21H04893, JST JPMJCR20G7 and JPMJAP2305, and RIKEN GRP.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kinoshita, G., Nishino, K. (2025). Camera Height Doesn’t Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-73337-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73336-9
Online ISBN: 978-3-031-73337-6
eBook Packages: Computer ScienceComputer Science (R0)