Camera Height Doesn’t Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Kinoshita, Genki; Nishino, Ko

doi:10.1007/978-3-031-73337-6_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15081))

Included in the following conference series:

European Conference on Computer Vision

191 Accesses

Abstract

In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as FUMET. The key idea is to leverage cars found on the road as sources of scale supervision and to incorporate them in network training robustly. FUMET detects and estimates the sizes of cars in a frame and aggregates scale information extracted from them into an estimate of the camera height whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network so that they become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and the Cityscapes datasets show the effectiveness of FUMET, which achieves state-of-the-art accuracy. We also show that FUMET enables training on mixed datasets of different camera heights, which leads to larger-scale training and better generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and FUMET democratizes its deployment by establishing the means to convert any model into a metric depth estimator.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity

GroCo: Ground Constraint for Metric Self-supervised Monocular Depth

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

References

Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: CVPR, pp. 4009–4018 (2021)
Google Scholar
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Unsupervised monocular depth and ego-motion learning with structure and semantics. In: CVPRW, pp. 381–388 (2019)
Google Scholar
Chawla, H., Varma, A., Arani, E., Zonooz, B.: Multimodal scale consistency and awareness for monocular self-supervised depth estimation. In: ICRA, pp. 5140–5146 (2021)
Google Scholar
Chen, X., et al.: 3D object proposals for accurate object class detection. Adv. Neural Inform. Process. Syst. 28 (2015)
Google Scholar
Chen, X., Li, T.H., Zhang, R., Li, G.: Frequency-aware self-supervised monocular depth estimation. In: WACV, pp. 5808–5817 (2023)
Google Scholar
Cordts, M., et al.: The Cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Ding, M., et al.: Learning depth-guided convolutions for monocular 3D object detection. In: CVPR, pp. 11669–11678 (2020)
Google Scholar
Eigen, D., Fergus, R.: Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In: ICCV. pp. 3213–3223 (2015)
Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inform. Process. Syst. 2, 2366–2374 (2014)
Google Scholar
Frost, D., Prisacariu, V., Murray, D.: Recovering stable scale in monocular SLAM using object-supplemented bundle adjustment. IEEE Trans. Rob. 34(3), 736–747 (2018). https://doi.org/10.1109/TRO.2018.2820722
Article Google Scholar
Garg, R., Bg, V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: ECCV, pp. 740–756 (2016)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Inter. J. Robot. Res. (IJRR) 32(11), 1231–1237 (2013). https://doi.org/10.1177/0278364913491297
Article Google Scholar
Geyer, J., et al.: A2D2: Audi Autonomous Driving Dataset. CoRR arXiv:abs/2004.06320 (2020)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 6602–6611 (2017)
Google Scholar
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3827–3837 (2019)
Google Scholar
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: CVPR, pp. 2482–2491 (2020)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edn. (2004)
Google Scholar
He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: RA-depth: resolution adaptive self-supervised monocular depth estimation. In: ECCV, pp. 565–581 (2022)
Google Scholar
Heylen, J., et al.: MonoCInIS: camera independent monocular 3D object detection using instance segmentation. In: ICCVW, pp. 923–934 (2021)
Google Scholar
Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. IJCV 80, 3–15 (2008). https://doi.org/10.1007/s11263-008-0137-5
Houston, J., et al.: One thousand and one hours: self-driving motion prediction dataset. In: Conference on Robot Learning, pp. 409–418 (2021)
Google Scholar
Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: OneFormer: one transformer to rule universal image segmentation. In: CVPR, pp. 2989–2998 (2023)
Google Scholar
Kim, D., Ga, W., Ahn, P., Joo, D., Chun, S., Kim, J.: Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth. CoRR arXiv:abs/2201.07436 (2022)
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. CoRR arXiv:abs/1412.6980 (2014)
Lyu, X., et al.: HR-Depth: high resolution self-supervised monocular depth estimation. In: AAAI. vol. 35, pp. 2294–2301 (2021)
Google Scholar
Milli, E., Erkent, O., Yılmaz, A.E.: Multi-modal multi-task (3MT) road segmentation. IEEE Robot. Autom. Lett. 8(9), 5408–5415 (2023). https://doi.org/10.1109/LRA.2023.3295254
Article Google Scholar
Ning, C., Gan, H.: Trap Attention: Monocular Depth Estimation with Manual Traps. In: CVPR. pp. 5033–5043 (2023)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inform. Process. Syst, 8026—8037 (2019)
Google Scholar
Piccinelli, L., Sakaridis, C., Yu, F.: IDISC: internal discretization for monocular depth estimation. In: CVPR, pp. 21477–21487 (2023)
Google Scholar
Russakovsky, O., et al.: Imagenet: a large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Sadat, A., Casas, S., Ren, M., Wu, X., Dhawan, P., Urtasun, R.: Perceive, predict, and plan: safe motion planning through interpretable semantic representations. In: ECCV, pp. 414–430 (2020)
Google Scholar
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Google Scholar
Simonelli, A., Bulò, S.R., Porzi, L., Kontschieder, P., Ricci, E.: Are we missing confidence in pseudo-lidar methods for monocular 3D object detection? In: ICCV, pp. 3225–3233 (2021)
Google Scholar
Spencer, J., Russell, C., Hadfield, S., Bowden, R.: Kick Back and Relax: Learning to Reconstruct the World by Watching SlowTV (2023)
Google Scholar
Sucar, E., Hayet, J.B.: Bayesian scale estimation for monocular SLAM based on generic object detection for correcting scale drift. In: ICRA, pp. 5152–5158 (2018)
Google Scholar
, Wagstaff, B., Kelly, J.: Self-supervised scale recovery for monocular depth and egomotion estimation. In: IROS, pp. 2620–2627 (2021)
Google Scholar
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)
Google Scholar
Wang, H., Cai, P., Fan, R., Sun, Y., Liu, M.: End-to-end interactive prediction and planning with optical flow distillation for autonomous driving. In: CVPRW, pp. 2229–2238 (2021)
Google Scholar
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: CVPR, pp. 8437–8445 (2019)
Google Scholar
Wang, Y., et al.: Train in germany, test in the USA: making 3D object detectors generalize. In: CVPR, pp. 11710–11720 (2020)
Google Scholar
Wang, Y., Yang, H., Cai, J., Wang, G., Wang, J., Huang, Y.: Unsupervised learning of depth and pose based on monocular camera and inertial measurement unit (IMU). In: ICRA, pp. 10010–10017 (2023)
Google Scholar
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004). https://doi.org/10.1109/TIP.2003.819861
Article Google Scholar
Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G., Firman, M.: The temporal opportunist: self-supervised multi-frame monocular depth. In: CVPR, pp. 1164–1174 (2021)
Google Scholar
Wilson, B., et al.: Argoverse 2: next generation datasets for self-driving perception and forecasting. In: NeurIPS Datasets and Benchmarks (2021)
Google Scholar
Xiang, J., Wang, Y., An, L., Liu, H., Wang, Z., Liu, J.: Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving. IEEE Robot. Auto. Lett. 7(4), 11998–12005 (2022). https://doi.org/10.1109/LRA.2022.3210298
Article Google Scholar
Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R.: LEGO: learning edge with geometry all at once by watching videos. In: CVPR, pp. 225–234 (2018)
Google Scholar
Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., Shen, C.: Metric3D: Towards Zero-Shot Metric 3D Prediction from a Single Image. In: ICCV. pp. 9043–9053 (2023)
Google Scholar
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)
Google Scholar
You, Y., et al.: Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2020)
Google Scholar
Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Neural window fully-connected CRFs for monocular depth estimation. In: CVPR, pp. 3916–3925 (2022)
Google Scholar
Zeng, W., et al.: End-to-end interpretable neural motion planner. In: CVPR, pp. 8652–8661 (2019)
Google Scholar
Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In: CVPR. pp. 18537–18546 (2023)
Google Scholar
Zhang, S., Zhang, J., Tao, D.: Towards scale-aware, robust, and generalizable unsupervised monocular depth estimation by integrating IMU motion dynamics. In: ECCV, pp. 143–160 (2022)
Google Scholar
Zhang, S., Li, X., Liu, Y., Fu, H.: Scale-aware insertion of virtual objects in monocular videos. In: ISMAR, pp. 36–44 (2020)
Google Scholar
Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: BMVC (2021)
Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017)
Google Scholar
Zhu, R., et al.: Single view metrology in the wild. In: ECCV, pp. 316–333 (2020)
Google Scholar

Download references

Acknowledgement

This work was in part supported by JSPS 20H05951 and 21H04893, JST JPMJCR20G7 and JPMJAP2305, and RIKEN GRP.

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Kyoto, Japan
Genki Kinoshita & Ko Nishino

Authors

Genki Kinoshita
View author publications
You can also search for this author in PubMed Google Scholar
Ko Nishino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Genki Kinoshita .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3929 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kinoshita, G., Nishino, K. (2025). Camera Height Doesn’t Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-73337-6_4
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73336-9
Online ISBN: 978-3-031-73337-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Camera Height Doesn’t Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity

GroCo: Ground Constraint for Metric Self-supervised Monocular Depth

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3929 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Camera Height Doesn’t Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity

GroCo: Ground Constraint for Metric Self-supervised Monocular Depth

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3929 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation