Abstract
Monocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and expensive to obtain. Self-supervised methods are showing great promise, exploiting geometry to provide supervision signals through image warping. Moreover, several works leverage on other visual tasks (e.g. stereo matching and semantic segmentation) to further advance self-supervised monocular depth estimation. In this paper, we propose a novel framework utilizing monocular depth completion as an auxiliary task to assist monocular depth estimation. In particular, a knowledge transfer strategy is employed to enable monocular depth estimation to benefit from the effective feature representations learned by monocular depth completion task. The correlation between monocular depth completion and monocular depth estimation could be fully and effectively utilized in this framework. Only unlabeled stereo images are used in the proposed framework, which achieves a self-supervised learning paradigm. Experimental results on publicly available dataset prove that the proposed approach achieves superior performance to state-of-the-art self-supervised methods and comparable performance with supervised methods.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P, et al. (2015) TensorFlow: Large scale machine learning on heterogeneous systems
Atapour-Abarghouei A, Breckon TP (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2800–2810
Cao Y, Wu Z, Shen C (2018) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circuits Syst Video Tech 28(11):3174–3182
Chen P, Liu AH, Liu Y, Wang YF (2019) Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2619–2627
Cordts M, Omran M, Ramos S et al (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp 2366–2374
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2002–2011
Garg R, Carneiro G, Reid I (2016) Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: European conference on computer vision, pp 740–756
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3354–3361
Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279
Godard C, Aodha OM, Brostow GJ (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE international conference on computer vision, pp 3827–3837
Guizilini V, Ambrus R, Pillai S et al (2020) 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2482–2491
Guizilini V, Hou R, Li J et al (2020) Semantically-guided representation learning for self-supervised monocular depth. In: Proceedings of the eighth international conference on learning representations, pp 1–14
Guo X, Li H, Yi S, Ren J, Wang X (2018) Learning monocular depth by distilling cross-domain stereo networks. In: European conference on computer vision, pp 484–500
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Jiang H, Larsson G, Shakhnarovich M, Miller E (2018) Self-supervised relative depth learning for urban scene understanding. In: European conference on computer vision, pp 20–37
Kuznietsov Y, Stckler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6647–6655
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: International conference on 3D vision, pp 239–248
Lei J, Li X, Peng B, Fang L, Ling N, Huang Q (2021) Deep spatial-spectral subspace clustering for hyperspectral image. IEEE Transactions on Circuits and Systems for Video Technology 31(7):2686–2697
Li S, Li W, Cook C, Zhu C, Gao Y (2018) Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466
Li B, Shen C, Dai Y et al (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1119–1127
Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5162–5170
Liu F, Shen C, Lin G, Reid I (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Machine Intell 38(10):2024–2039
Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5667–5675
Mehta I, Sakurikar P, Narayanan PJ (2018) Structured adversarial training for unsupervised monocular depth estimation. In: International conference on 3D vision, pp 314–323
Mei X, Sun X, Zhou M, Jiao S, Wang H, Zhang X (2011) On building an accurate stereo matching system on graphics hardware. In: Proceedings of the IEEE international conference on computer vision workshops, pp 467–474
Owen AB (2007) A robust hybrid of lasso and ridge regression. Contemp Math 443(7):59–72
Pan Z, Yu W, Lei J, Ling N, Kwong S (2021) TSAN: Synthesized view quality enhancement via two-stream attention network for 3D-HEVC. IEEE Transactions on Circuits and Systems for Video Technology. 1–14 https://doi.org/10.1109/TCSVT.2021.3057518
Peng B, Lei J, Fu H, Jia Y, Zhang Z, Li Y (2021) Deep video action clustering via spatio-temporal feature learning. Neurocomputing 1–9. https://doi.org/10.1016/j.neucom.2020.05.123
Pilzer A, Lathuiliere S, Sebe N et al (2019) Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9760–9769
Poggi M, Tosi F, Mattoccia S (2018) Learning monocular depth estimation with unsupervised trinocular assumptions. In: International conference on 3D vision, pp 324–333
Ramirez P, Poggi M, Tosi F, Mattoccia S, Stefano LD (2018) Geometry meets semantics for semi-supervised monocular depth estimation. In: Asian Conference on Computer Vision, pp 298–313
Ranjan A, Jampani V, Kim K, Sun D, Wulff, Black MJ (2019) Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12232–12241
Russakovsky O, Deng J, Su H (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Saxena A, Chung SH, Ng AY (2008) 3-D depth reconstruction from a single still image. Int J Comput Vision 76(1):53–69
Scharstein D, Szeliski R (2003) High-accuracy stereo depth maps using structured light. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 195–202
Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: European conference on computer vision, pp 746–760
Tonioni A, Poggi M, Mattoccia S, Stefano LD (2020) Unsupervised domain adaptation for depth prediction from images. IEEE Trans Pattern Anal Machine Intell 42(10):2396–2409
Tosi F, Aleotti F, Poggi M et al (2019) Learning monocular depth estimation infusing traditional stereo knowledge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9791–9801
Tosi F, Poggi M, Tonioni A, Stefano LD, Mattoccia S (2017) Learning confidence measures in the wild. In: 28th British machine vision conference
Wang Z (2004) Image quality assessment: from error visibility to structural similarity, vol 13, pp 600–612
Wang P, Shen X, Lin Z, Cohen S, Price B, Yuille A (2015) Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2800–2809
Wong A, Hong B, Soatto S (2019) Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5644–5653
Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2019) Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks. IEEE Trans Pattern Anal Machine Intell 41(6):1426–1440
Yang Z, Wang P, Wang Y, Xu W, Nevatia R (2018) Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5667–5675
Yang Z, Wang P, Xu W, Zhao L, Nevatia R (2018) Unsupervised learning of geometry with edge-aware depth-normal consistency. In: 32nd AAAI Conference on artificial intelligence, pp 7493–7500
Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6612–6621
Zhu J, Wang L, Yang R, Davis J E, Pan Z (2011) Reliability fusion of time-of-flight depth and stereo geometry for high quality depth maps. IEEE Trans Pattern Anal Machine Intell 33(7):1400–1414
Zou Y, Luo Z, Huang J-B (2018) DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In: European conference on computer vision, pp 36–53
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported in part by the Natural Science Foundation of Tianjin (No.18ZXZNGX00110).
Rights and permissions
About this article
Cite this article
Sun, L., Li, Y., Liu, B. et al. Transferring knowledge from monocular completion for self-supervised monocular depth estimation. Multimed Tools Appl 81, 42485–42495 (2022). https://doi.org/10.1007/s11042-021-11212-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11212-4