Dual-attention-based semantic-aware self-supervised monocular depth estimation

Xu, Jinze; Ye, Feng; Lai, Yizong

doi:10.1007/s11042-023-17976-1

Dual-attention-based semantic-aware self-supervised monocular depth estimation

Published: 12 January 2024

Volume 83, pages 65579–65601, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

232 Accesses
Explore all metrics

Abstract

Based on the assumption of photometric consistency, self-supervised monocular depth estimation has been widely studied due to the advantage of avoiding costly annotations. However, it is sensitive to noise, occlusion issues and photometric changes. To overcome these problems, we propose a multi-task model with a dual-attention-based cross-task feature fusion module (DCFFM). We simultaneously predict depth and semantic with a shared encoder and two separate decoders, aiming to improve depth estimation with the enhancement of semantic supervision information. In DCFFM, we fuse the cross-task features with both pixel-wise and channel-wise attention, which fully excavate and make good use of the helpful information from the other task mutually. We compute both of two attentions in a one-to-all manner to capture global information while limiting the rapid growth of computation. Furthermore, we propose a novel data augmentation method called data exchange & recovery (DE &R), which performs inter-batch data exchange in both vertical and horizontal direction so as to increase the diversity of input data. It encourages the network to explore more diversified cues for depth estimation and avoid overfitting. And essentially, the corresponding outputs are further recovered in order to keep the geometry relationship and ensure the correct calculation of photometric loss. Extensive experiments on the KITTI dataset and the NYU-Depth-v2 dataset demonstrate that our method is very effective and achieves better performance compared with other state-of-the-art works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 12

Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Article Open access 13 September 2023

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Article 27 March 2024

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Klingner M, Termöhlen JA, Mikolajczyk J et al (2020) Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer vision–ECCV 2020: 16th European conference, Springer, pp 582–600
Guizilini V, Ambrus R, Pillai S et al (2020) 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2485–2494
Tang C, Wang Y, Zhang L et al (2022) Multisource fusion uav cluster cooperative positioning using information geometry. Remote Sensing 14(21):5491
Article Google Scholar
Tang C, Wang C, Zhang L et al (2022) Multivehicle 3d cooperative positioning algorithm based on information geometric probability fusion of gnss/wireless station navigation. Remote Sensing 14(23):6094
Article Google Scholar
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems
Fu H, Gong M, Wang C et al (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2002–2011
Farooq Bhat S, Alhashim I, Wonka P (2021) Adabins: depth estimation using adaptive bins. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4008–4017. https://doi.org/10.1109/CVPR46437.2021.00400
Xie J, Girshick R, Farhadi A (2016) Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: Computer Vision–ECCV 2016: 14th European conference, pp 842–857
Garg R, B.G. VK, Carneiro G et al (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Computer Vision – ECCV 2016, Cham, pp 740–756
Zhou T, Brown M, Snavely N et al (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1851–1858
Godard C, Mac Aodha O, Firman M et al (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3828–3838
Shu C, Yu K, Duan Z et al (2020) Feature-metric loss for self-supervised learning of depth and egomotion. In: Computer vision–ECCV 2020: 16th European conference, pp 572–588
Guizilini V, Hou R, Li J et al (2020) Semantically-guided representation learning for self-supervised monocular depth. arXiv:2002.12319
Choi J, Jung D, Lee D et al (2020) Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv:2010.02893
Jung H, Park E, Yoo S (2021) Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,642–12,652
Zama Ramirez P, Poggi M, Tosi F et al (2019) Geometry meets semantics for semi-supervised monocular depth estimation. In: Computer vision–ACCV 2018: 14th asian conference on computer vision, Springer, pp 298–313
Zhu S, Brazil G, Liu X (2020) The edge of depth: explicit constraints between segmentation and depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,116–13,125
Li R, Xue D, Su S, et al. (2023) Learning depth via leveraging semantics: self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition p 109297
Cai H, Matai J, Borse S et al (2021) X-distill: improving self-supervised monocular depth via cross-task distillation. arXiv:2110.12516
Peng R, Wang R, Lai Y et al (2021) Excavating the potential capacity of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 15,560–15,569
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279
Poggi M, Tosi F, Mattoccia S (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In: 2018 International conference on 3d vision (3DV), IEEE, pp 324–333
GonzalezBello JL, Kim M (2020) Forget about the lidar: self-supervised depth estimators with med probability volumes. In: Advances in neural information processing systems, pp 12,626–12,637
Watson J, Firman M, Brostow GJ et al (2019) Self-supervised monocular depth hints. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2162–2171
Hirschmuller H (2007) Stereo processing by semiglobal matching and mutual information. IEEE Trans Pattern Anal Mach Intell 30(2):328–341
Article Google Scholar
Poggi M, Aleotti F, Tosi F et al (2020) On the uncertainty of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3227–3237
Yang N, Stumberg Lv, Wang R et al (2020) D3vo: deep depth, deep pose and deep uncertainty for monocular visual odometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1281–1292
Ranjan A, Jampani V, Balles L et al (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,240–12,249
Guizilini V, Lee KH, Ambruş R et al (2022) Learning optical flow, depth, and scene flow without real-world labels. IEEE Robotics Automation Lett 7(2):3491–3498
Article Google Scholar
Yin Z, Shi J (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1983–1992
Xiang J, Wang Y, An L, et al. (2022) Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving. IEEE Robotics and Automation Letters 7(4):11,998–12,005
Petrovai A, Nedevschi S (2022) Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1578–1588
Yan J, Zhao H, Bu P et al (2021) Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International conference on 3D vision (3DV), IEEE, pp 464–473
Lyu X, Liu L, Wang M et al (2021) Hr-depth: high resolution self-supervised monocular depth estimation. In: Proceedings of the AAAI conference on artificial intelligence, pp 2294–2301
Zhou H, Greenwood D, Taylor S (2021a) Self-supervised monocular depth estimation with internal feature fusion. arXiv:2110.09482
Zhou Z, Fan X, Shi P et al (2021b) R-msfm: recurrent multi-scale feature modulation for monocular depth estimating. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,777–12,786
Han W, Yin J, Jin X et al (2022) Brnet: exploring comprehensive features for monocular depth estimation. In: Computer Vision–ECCV 2022: 17th European conference, Springer, pp 586–602
Zhao C, Zhang Y, Poggi M et al (2022) Monovit: self-supervised monocular depth estimation with a vision transformer. arXiv:2208.03543
Lee S, Im S, Lin S et al (2021) Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: Proceedings of the AAAI conference on artificial intelligence, pp 1863–1872
He C, Li K, Zhang Y et al (2023a) Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. arXiv:2305.11003
He C, Li K, Zhang Y et al (2023b) Camouflaged object detection with feature decomposition and edge reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22,046–22,055
He C, Li K, Zhang Y et al (2023c) Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. arXiv:2308.03166
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Article Google Scholar
Zhu Y, Sapra K, Reda FA et al (2019) Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8856–8865
Bolya D, Fu CY, Dai X et al (2023) Hydra attention: efficient attention with many heads. In: Computer vision–ECCV 2022 Workshops, Springer, pp 35–49
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems
Katharopoulos A, Vyas A, Pappas N et al (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In: International conference on machine learning, pp 5156–5165
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3354–3361
Silberman N, Hoiem D, Kohli P et al (2012) Indoor segmentation and support inference from rgbd images. In: Computer Vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, Springer, pp 746–760
Menze M, Geiger A (2015) Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3061–3070
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in pytorch. In: International conference on learning representations (ICLR)
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations (ICLR)
Bian J, Li Z, Wang N et al (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Advances in neural information processing systems
Wang L, Wang Y, Wang L et al (2021) Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner? In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,727–12,736
Dijk Tv, Croon Gd (2019) How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2183–2191

Download references

Acknowledgements

This work was supported by Department of science and technology of Guangdong Province (No:2021B01420003).

Author information

Authors and Affiliations

School of Mechanical and Automotive Engineering, South China University of Technology, Guangzhou, 510641, China
Jinze Xu, Feng Ye & Yizong Lai

Authors

Jinze Xu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yizong Lai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Ye.

Ethics declarations

Conflicts of interest

The authors declare that they do not have any conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, J., Ye, F. & Lai, Y. Dual-attention-based semantic-aware self-supervised monocular depth estimation. Multimed Tools Appl 83, 65579–65601 (2024). https://doi.org/10.1007/s11042-023-17976-1

Download citation

Received: 08 May 2023
Revised: 19 December 2023
Accepted: 21 December 2023
Published: 12 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11042-023-17976-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual-attention-based semantic-aware self-supervised monocular depth estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dual-attention-based semantic-aware self-supervised monocular depth estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation