research-article

Hierarchical normalization for robust monocular depth estimation

AUTHORs:

Chunhua ShenAuthors Info & Claims

NIPS'22: Proceedings of the 36th International Conference on Neural Information Processing Systems

Article No.: 1027, Pages 14128 - 14139

Published: 03 April 2024 Publication History

Abstract

In this paper, we address monocular depth estimation with deep neural networks. To enable training of deep monocular estimation models with various sources of datasets, state-of-the-art methods adopt image-level normalization strategies to generate affine-invariant depth representations. However, learning with the image-level normalization mainly emphasizes the relations of pixel representations with the global statistic in the images, such as the structure of the scene, while the fine-grained depth difference may be overlooked. In this paper, we propose a novel multi-scale depth normalization method that hierarchically normalizes the depth representations based on spatial information and depth distributions. Compared with previous normalization strategies applied only at the holistic image level, the proposed hierarchical normalization can effectively preserve the finegrained details and improve accuracy. We present two strategies that define the hierarchical normalization contexts in the depth domain and the spatial domain, respectively. Our extensive experiments show that the proposed normalization strategy remarkably outperforms previous normalization methods, and we set new state-of-the-art on five zero-shot transfer benchmark datasets.

Supplementary Material

Additional material (3600270.3601297_supp.pdf)

Supplemental material.

Download
3.49 MB

References

[1]

Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Proc. Advances in Neural Inf. Process. Syst., 2019.

[2]

Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In Proc. Advances in Neural Inf. Process. Syst., pages 730-738, 2016.

[3]

Weifeng Chen, Shengyi Qian, and Jia Deng. Learning single-image depth from videos using quality assessment networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5604-5613, 2019.

[4]

Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, and Jia Deng. Oasis: A large-scale dataset for single image 3d in the wild. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 679-688, 2020.

[5]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5828-5839, 2017.

[6]

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Ieee, 2005.

Digital Library

[7]

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Inf. Process. Syst., pages 2366-2374, 2014.

[8]

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4340-4349, 2016.

[9]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3354-3361. IEEE, 2012.

[10]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proc. IEEE Int. Conf. Comp. Vis., pages 3828-3838, 2019.

[11]

Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., June 2020.

[12]

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. IEEE Int. Conf. Comp. Vis., pages 1501-1510, 2017.

[13]

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Mach. Learn., 2015.

[14]

Youngjung Kim, Hyungjoo Jung, Dongbo Min, and Kwanghoon Sohn. Deep monocular depth estimation via integration of global and local predictions. IEEE Trans. Image Process., 27(8):4131-4144, 2018.

[15]

Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Körner. Evaluation of CNN-based singleimage depth estimation methods. In Eur. Conf. Comput. Vis. Worksh., pages 331-348, 2018.

[16]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Inf. Process. Syst., 2012.

[17]

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv: Comp. Res. Repository, page 1907.10326, 2019.

[18]

Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. Deep attention-based classification network for robust depth prediction. In Proc. Asian Conf. Comp. Vis., pages 663-678. Springer, 2018.

[19]

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2041-2050, 2018.

[20]

David G Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91-110, 2004.

Digital Library

[21]

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2337-2346, 2019.

[22]

Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 283-291, 2018.

[23]

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proc. IEEE Int. Conf. Comp. Vis., pages 12179-12188, 2021.

[24]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell., 2020.

[25]

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multicamera videos. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3260-3269, 2017.

[26]

Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. Feature-metric loss for self-supervised learning of depth and egomotion. In Proc. Eur. Conf. Comp. Vis., pages 572-588, 2020.

Digital Library

[27]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comp. Vis., pages 746-760. Springer, 2012.

Digital Library

[28]

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Proc. Eur. Conf. Comp. Vis., pages 402-419. Springer, 2020.

Digital Library

[29]

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.

[30]

Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset. arXiv: Comp. Res. Repository, page 1908.00463, 2019.

[31]

Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In Int. Conf. 3D. Vis., pages 348-357. IEEE, 2019.

[32]

Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 311-320, 2018.

[33]

Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-guided ranking loss for single image depth prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 611-620, 2020.

[34]

Wei Yin, Yifan Liu, and Chunhua Shen. Virtual normal: Enforcing geometric constraintsfor accurate and robust depth prediction. IEEE Trans. Pattern Anal. Mach. Intell., 2021.

[35]

Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proc. IEEE Int. Conf. Comp. Vis., 2019.

[36]

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Simon Chen, Yifan Liu, and Chunhua Shen. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Trans. Pattern Anal. Mach. Intell., pages 1-21, 2022.

[37]

Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021.

[38]

Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. IEEE, 2018.

[39]

Amir R. Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, and Leonidas J. Guibas. Robust learning through cross-task consistency. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2020.

[40]

Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. DeepEMD: Few-shot image classification with differentiable earth mover's distance and structured classifiers. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.

[41]

Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Differentiable earth mover's distance for few-shot learning, 2020.

[42]

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip Torr. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 185-194, 2019.

[43]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2881-2890, 2017.

[44]

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1851-1858, 2017.

Index Terms

Hierarchical normalization for robust monocular depth estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

SC-DepthV3: Robust Self-Supervised Monocular Depth Estimation for Dynamic Scenes
Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions and occlusions. Consequently, existing ...
Unsupervised Monocular Depth Estimation CNN Robust to Training Data Diversity
Advances in Artificial Intelligence
Abstract
We present an unsupervised learning method for the task of monocular depth estimation. In common with many recent works, we leverage convolutional neural network (CNN) training on stereo pair images with view reconstruction as a self-supervisory ...
Transferring knowledge from monocular completion for self-supervised monocular depth estimation
Abstract
Monocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

November 2022

39114 pages

ISBN:9781713871088

Copyright © 2022 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 April 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents