Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3600270.3601297guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Hierarchical normalization for robust monocular depth estimation

Published: 03 April 2024 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we address monocular depth estimation with deep neural networks. To enable training of deep monocular estimation models with various sources of datasets, state-of-the-art methods adopt image-level normalization strategies to generate affine-invariant depth representations. However, learning with the image-level normalization mainly emphasizes the relations of pixel representations with the global statistic in the images, such as the structure of the scene, while the fine-grained depth difference may be overlooked. In this paper, we propose a novel multi-scale depth normalization method that hierarchically normalizes the depth representations based on spatial information and depth distributions. Compared with previous normalization strategies applied only at the holistic image level, the proposed hierarchical normalization can effectively preserve the finegrained details and improve accuracy. We present two strategies that define the hierarchical normalization contexts in the depth domain and the spatial domain, respectively. Our extensive experiments show that the proposed normalization strategy remarkably outperforms previous normalization methods, and we set new state-of-the-art on five zero-shot transfer benchmark datasets.

    Supplementary Material

    Additional material (3600270.3601297_supp.pdf)
    Supplemental material.

    References

    [1]
    Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Proc. Advances in Neural Inf. Process. Syst., 2019.
    [2]
    Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In Proc. Advances in Neural Inf. Process. Syst., pages 730-738, 2016.
    [3]
    Weifeng Chen, Shengyi Qian, and Jia Deng. Learning single-image depth from videos using quality assessment networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5604-5613, 2019.
    [4]
    Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, and Jia Deng. Oasis: A large-scale dataset for single image 3d in the wild. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 679-688, 2020.
    [5]
    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5828-5839, 2017.
    [6]
    Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Ieee, 2005.
    [7]
    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Proc. Advances in Neural Inf. Process. Syst., pages 2366-2374, 2014.
    [8]
    Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4340-4349, 2016.
    [9]
    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3354-3361. IEEE, 2012.
    [10]
    Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self-supervised monocular depth estimation. In Proc. IEEE Int. Conf. Comp. Vis., pages 3828-3838, 2019.
    [11]
    Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., June 2020.
    [12]
    Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. IEEE Int. Conf. Comp. Vis., pages 1501-1510, 2017.
    [13]
    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Mach. Learn., 2015.
    [14]
    Youngjung Kim, Hyungjoo Jung, Dongbo Min, and Kwanghoon Sohn. Deep monocular depth estimation via integration of global and local predictions. IEEE Trans. Image Process., 27(8):4131-4144, 2018.
    [15]
    Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Körner. Evaluation of CNN-based singleimage depth estimation methods. In Eur. Conf. Comput. Vis. Worksh., pages 331-348, 2018.
    [16]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Inf. Process. Syst., 2012.
    [17]
    Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv: Comp. Res. Repository, page 1907.10326, 2019.
    [18]
    Ruibo Li, Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, and Lingxiao Hang. Deep attention-based classification network for robust depth prediction. In Proc. Asian Conf. Comp. Vis., pages 663-678. Springer, 2018.
    [19]
    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2041-2050, 2018.
    [20]
    David G Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91-110, 2004.
    [21]
    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2337-2346, 2019.
    [22]
    Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, Raquel Urtasun, and Jiaya Jia. Geonet: Geometric neural network for joint depth and surface normal estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 283-291, 2018.
    [23]
    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proc. IEEE Int. Conf. Comp. Vis., pages 12179-12188, 2021.
    [24]
    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell., 2020.
    [25]
    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multicamera videos. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3260-3269, 2017.
    [26]
    Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. Feature-metric loss for self-supervised learning of depth and egomotion. In Proc. Eur. Conf. Comp. Vis., pages 572-588, 2020.
    [27]
    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proc. Eur. Conf. Comp. Vis., pages 746-760. Springer, 2012.
    [28]
    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Proc. Eur. Conf. Comp. Vis., pages 402-419. Springer, 2020.
    [29]
    Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
    [30]
    Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z Dai, Andrea F Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R Walter, et al. Diode: A dense indoor and outdoor depth dataset. arXiv: Comp. Res. Repository, page 1908.00463, 2019.
    [31]
    Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In Int. Conf. 3D. Vis., pages 348-357. IEEE, 2019.
    [32]
    Ke Xian, Chunhua Shen, Zhiguo Cao, Hao Lu, Yang Xiao, Ruibo Li, and Zhenbo Luo. Monocular relative depth perception with web stereo data supervision. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 311-320, 2018.
    [33]
    Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-guided ranking loss for single image depth prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 611-620, 2020.
    [34]
    Wei Yin, Yifan Liu, and Chunhua Shen. Virtual normal: Enforcing geometric constraintsfor accurate and robust depth prediction. IEEE Trans. Pattern Anal. Mach. Intell., 2021.
    [35]
    Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
    [36]
    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Simon Chen, Yifan Liu, and Chunhua Shen. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Trans. Pattern Anal. Mach. Intell., pages 1-21, 2022.
    [37]
    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2021.
    [38]
    Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. IEEE, 2018.
    [39]
    Amir R. Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, and Leonidas J. Guibas. Robust learning through cross-task consistency. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2020.
    [40]
    Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. DeepEMD: Few-shot image classification with differentiable earth mover's distance and structured classifiers. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019.
    [41]
    Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. Deepemd: Differentiable earth mover's distance for few-shot learning, 2020.
    [42]
    Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip Torr. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 185-194, 2019.
    [43]
    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2881-2890, 2017.
    [44]
    Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1851-1858, 2017.

    Index Terms

    1. Hierarchical normalization for robust monocular depth estimation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems
        November 2022
        39114 pages

        Publisher

        Curran Associates Inc.

        Red Hook, NY, United States

        Publication History

        Published: 03 April 2024

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 11 Aug 2024

        Other Metrics

        Citations

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media