Abstract
Based on the assumption of photometric consistency, self-supervised monocular depth estimation has been widely studied due to the advantage of avoiding costly annotations. However, it is sensitive to noise, occlusion issues and photometric changes. To overcome these problems, we propose a multi-task model with a dual-attention-based cross-task feature fusion module (DCFFM). We simultaneously predict depth and semantic with a shared encoder and two separate decoders, aiming to improve depth estimation with the enhancement of semantic supervision information. In DCFFM, we fuse the cross-task features with both pixel-wise and channel-wise attention, which fully excavate and make good use of the helpful information from the other task mutually. We compute both of two attentions in a one-to-all manner to capture global information while limiting the rapid growth of computation. Furthermore, we propose a novel data augmentation method called data exchange & recovery (DE &R), which performs inter-batch data exchange in both vertical and horizontal direction so as to increase the diversity of input data. It encourages the network to explore more diversified cues for depth estimation and avoid overfitting. And essentially, the corresponding outputs are further recovered in order to keep the geometry relationship and ensure the correct calculation of photometric loss. Extensive experiments on the KITTI dataset and the NYU-Depth-v2 dataset demonstrate that our method is very effective and achieves better performance compared with other state-of-the-art works.
Similar content being viewed by others
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Klingner M, Termöhlen JA, Mikolajczyk J et al (2020) Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer vision–ECCV 2020: 16th European conference, Springer, pp 582–600
Guizilini V, Ambrus R, Pillai S et al (2020) 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2485–2494
Tang C, Wang Y, Zhang L et al (2022) Multisource fusion uav cluster cooperative positioning using information geometry. Remote Sensing 14(21):5491
Tang C, Wang C, Zhang L et al (2022) Multivehicle 3d cooperative positioning algorithm based on information geometric probability fusion of gnss/wireless station navigation. Remote Sensing 14(23):6094
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems
Fu H, Gong M, Wang C et al (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2002–2011
Farooq Bhat S, Alhashim I, Wonka P (2021) Adabins: depth estimation using adaptive bins. In: 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 4008–4017. https://doi.org/10.1109/CVPR46437.2021.00400
Xie J, Girshick R, Farhadi A (2016) Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: Computer Vision–ECCV 2016: 14th European conference, pp 842–857
Garg R, B.G. VK, Carneiro G et al (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Computer Vision – ECCV 2016, Cham, pp 740–756
Zhou T, Brown M, Snavely N et al (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1851–1858
Godard C, Mac Aodha O, Firman M et al (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3828–3838
Shu C, Yu K, Duan Z et al (2020) Feature-metric loss for self-supervised learning of depth and egomotion. In: Computer vision–ECCV 2020: 16th European conference, pp 572–588
Guizilini V, Hou R, Li J et al (2020) Semantically-guided representation learning for self-supervised monocular depth. arXiv:2002.12319
Choi J, Jung D, Lee D et al (2020) Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv:2010.02893
Jung H, Park E, Yoo S (2021) Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,642–12,652
Zama Ramirez P, Poggi M, Tosi F et al (2019) Geometry meets semantics for semi-supervised monocular depth estimation. In: Computer vision–ACCV 2018: 14th asian conference on computer vision, Springer, pp 298–313
Zhu S, Brazil G, Liu X (2020) The edge of depth: explicit constraints between segmentation and depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,116–13,125
Li R, Xue D, Su S, et al. (2023) Learning depth via leveraging semantics: self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition p 109297
Cai H, Matai J, Borse S et al (2021) X-distill: improving self-supervised monocular depth via cross-task distillation. arXiv:2110.12516
Peng R, Wang R, Lai Y et al (2021) Excavating the potential capacity of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 15,560–15,569
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279
Poggi M, Tosi F, Mattoccia S (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In: 2018 International conference on 3d vision (3DV), IEEE, pp 324–333
GonzalezBello JL, Kim M (2020) Forget about the lidar: self-supervised depth estimators with med probability volumes. In: Advances in neural information processing systems, pp 12,626–12,637
Watson J, Firman M, Brostow GJ et al (2019) Self-supervised monocular depth hints. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2162–2171
Hirschmuller H (2007) Stereo processing by semiglobal matching and mutual information. IEEE Trans Pattern Anal Mach Intell 30(2):328–341
Poggi M, Aleotti F, Tosi F et al (2020) On the uncertainty of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3227–3237
Yang N, Stumberg Lv, Wang R et al (2020) D3vo: deep depth, deep pose and deep uncertainty for monocular visual odometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1281–1292
Ranjan A, Jampani V, Balles L et al (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,240–12,249
Guizilini V, Lee KH, Ambruş R et al (2022) Learning optical flow, depth, and scene flow without real-world labels. IEEE Robotics Automation Lett 7(2):3491–3498
Yin Z, Shi J (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1983–1992
Xiang J, Wang Y, An L, et al. (2022) Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving. IEEE Robotics and Automation Letters 7(4):11,998–12,005
Petrovai A, Nedevschi S (2022) Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1578–1588
Yan J, Zhao H, Bu P et al (2021) Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International conference on 3D vision (3DV), IEEE, pp 464–473
Lyu X, Liu L, Wang M et al (2021) Hr-depth: high resolution self-supervised monocular depth estimation. In: Proceedings of the AAAI conference on artificial intelligence, pp 2294–2301
Zhou H, Greenwood D, Taylor S (2021a) Self-supervised monocular depth estimation with internal feature fusion. arXiv:2110.09482
Zhou Z, Fan X, Shi P et al (2021b) R-msfm: recurrent multi-scale feature modulation for monocular depth estimating. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,777–12,786
Han W, Yin J, Jin X et al (2022) Brnet: exploring comprehensive features for monocular depth estimation. In: Computer Vision–ECCV 2022: 17th European conference, Springer, pp 586–602
Zhao C, Zhang Y, Poggi M et al (2022) Monovit: self-supervised monocular depth estimation with a vision transformer. arXiv:2208.03543
Lee S, Im S, Lin S et al (2021) Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: Proceedings of the AAAI conference on artificial intelligence, pp 1863–1872
He C, Li K, Zhang Y et al (2023a) Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. arXiv:2305.11003
He C, Li K, Zhang Y et al (2023b) Camouflaged object detection with feature decomposition and edge reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22,046–22,055
He C, Li K, Zhang Y et al (2023c) Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. arXiv:2308.03166
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Zhu Y, Sapra K, Reda FA et al (2019) Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8856–8865
Bolya D, Fu CY, Dai X et al (2023) Hydra attention: efficient attention with many heads. In: Computer vision–ECCV 2022 Workshops, Springer, pp 35–49
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in neural information processing systems
Katharopoulos A, Vyas A, Pappas N et al (2020) Transformers are rnns: fast autoregressive transformers with linear attention. In: International conference on machine learning, pp 5156–5165
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp 3354–3361
Silberman N, Hoiem D, Kohli P et al (2012) Indoor segmentation and support inference from rgbd images. In: Computer Vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, Springer, pp 746–760
Menze M, Geiger A (2015) Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3061–3070
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Paszke A, Gross S, Chintala S et al (2017) Automatic differentiation in pytorch. In: International conference on learning representations (ICLR)
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International conference on learning representations (ICLR)
Bian J, Li Z, Wang N et al (2019) Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: Advances in neural information processing systems
Wang L, Wang Y, Wang L et al (2021) Can scale-consistent monocular depth be learned in a self-supervised scale-invariant manner? In: Proceedings of the IEEE/CVF international conference on computer vision, pp 12,727–12,736
Dijk Tv, Croon Gd (2019) How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2183–2191
Acknowledgements
This work was supported by Department of science and technology of Guangdong Province (No:2021B01420003).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they do not have any conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, J., Ye, F. & Lai, Y. Dual-attention-based semantic-aware self-supervised monocular depth estimation. Multimed Tools Appl 83, 65579–65601 (2024). https://doi.org/10.1007/s11042-023-17976-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17976-1