Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

Published: 25 September 2023 Publication History
  • Get Citation Alerts
  • Abstract

    3D perception of depth and ego-motion is of vital importance in intelligent agent and Human Computer Interaction (HCI) tasks, such as robotics and autonomous driving. There are different kinds of sensors that can directly obtain 3D depth information. However, the commonly used Lidar sensor is expensive, and the effective range of RGB-D cameras is limited. In the field of computer vision, researchers have done a lot of work on 3D perception. While traditional geometric algorithms require a lot of manual features for depth estimation, Deep Learning methods have achieved great success in this field. In this work, we proposed a novel self-supervised method based on Vision Transformer (ViT) with Convolutional Neural Network (CNN) architecture, which is referred to as ViT-Depth. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low-cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. Our method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, we propose a cross-frame loss that could constrain photometric error and scale consistency among multi-frames, which lead the training process to be more stable and improve the performance. Extensive experimental results on autonomous driving dataset demonstrate the proposed approach is competitive with the state-of-the-art depth and motion estimation methods.

    References

    [1]
    Antonio Cipolletta, Valentino Peluso, Andrea Calimera, Matteo Poggi, and Stefano Mattoccia. 2021. Energy-quality scalable monocular depth estimation on low-power CPUs. IEEE Internet of Things Journal 99 (2021), 1–1.
    [2]
    Yanan Miao, Xiaoming Tao, Xiaoli Xu, and Jianhua Lu. 2019. Joint 3-D shape estimation and landmark localization from monocular cameras of intelligent vehicles. IEEE Internet of Things Journal (2019).
    [3]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS (2017).
    [4]
    Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2018), 2011–2023.
    [5]
    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013).
    [6]
    Richard Hartley and Andrew Zisserman. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN 0-521-62304-9, 2000.
    [7]
    David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. NIPS (2014).
    [8]
    Jun Li, Reinhard Klein, and Angela Yao. 2017. A two-streamed network for estimating fine-scaled depth maps from single RGB images. CVPR (2017), 3372–3380.
    [9]
    Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. 2017. Unsupervised learning of depth and ego-motion from video. CVPR (2017).
    [10]
    Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning depth from monocular videos using direct methods. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022–2030.
    [11]
    Zhichao Yin and Jianping Shi. 2018. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1983–1992.
    [12]
    Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA. 8001–8008.
    [13]
    Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. 2020. Unsupervised monocular depth learning in dynamic scenes, arXiv preprint arXiv: 2010.16404.
    [14]
    Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019), 35–45.
    [15]
    Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. 2019. Digging into self-supervised monocular depth estimation. ICCV (2019), 3827–3837.
    [16]
    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
    [17]
    Xiaohan Tu, Cheng Xu, Siping Liu, Renfa Li, Guoqi Xie, Jing Huang, and Laurence Tianruo Yang. 2021. Efficient monocular depth estimation for edge devices in internet of things. IEEE Transactions on Industrial Informatics 17, 4 (2021), 2821–2832.
    [18]
    Fu Huan, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. CVPR (2018), 2002–2011.
    [19]
    Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. 2018. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. ECCV.
    [20]
    Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2024–2039.
    [21]
    Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R. Venkatesh Babu. 2018. AdaDepth: Unsupervised content congruent adaptation for depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2656–2665.
    [22]
    Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe, 2017. Semi-supervised deep learning for monocular depth map prediction. IEEE Conference on Computer Vision & Pattern Recognition.
    [23]
    Ishit Mehta, Parikshit Sakurikar, and P. J. Narayanan. 2018. Structured adversarial training for unsupervised monocular depth estimation. International Conference on 3D Vision (3DV). 314–323.
    [24]
    Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. 2018. Unsupervised learning of geometry with edge-aware depth-normal consistency. AAAI (2018).
    [25]
    Reza Mahjourian, Martin Wicke, and Anelia Angelova. 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5667–5675.
    [26]
    Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. 2019. Every Pixel Counts ++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2019), 2624–2641.
    [27]
    Yue Meng, Yongxi Lu, Aman Raj, Samuel Sunarjo, Rui Guo, Tara Javidi, Gaurav Bansal, and Dinesh Bharadia. 2019. SIGNet: Semantic instance aided unsupervised 3D geometry perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9802–9812.
    [28]
    Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. IEEE/CVF International Conference on Computer Vision (ICCV). 8976–8985.
    [29]
    Tianwei Shen, Zixin Luo, Lei Zhou, Hanyu Deng, Runze Zhang, Tian Fang, and Long Quan. 2019. Beyond photometric loss for self-supervised ego-motion estimation. International Conference on Robotics and Automation (ICRA). 6359–6365.
    [30]
    Yuliang Zou, Zelun Luo, and Jia-Bin Huang. 2018. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. 15th European Conference, Munich, Germany.
    [31]
    Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T. Barron. 2019. Learning single camera depth estimation using dual-pixels. ICCV (2019), 7627–7636.
    [32]
    Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. 2021. Unsupervised scale-consistent depth and ego-motion learning from monocular video. IJCV (2021).
    [33]
    Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. 2016. Fourth International Conference on 3D Vision (3DV). 239–248.
    [34]
    Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. 2020. Guiding monocular depth estimation using depth-attention volume. European Conference on Computer Vision (ECCV). 581–597.
    [35]
    Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. 2018. What makes good synthetic training data for learning disparity and optical flow estimation?. IJCV 126 (2018), 942–960.
    [36]
    Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. ECCV (2016).
    [37]
    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV). 2980–2988.
    [38]
    Antoni Buades, Bartomeu Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2 (2005), 60–65.
    [39]
    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7794–7803.
    [40]
    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877.
    [41]
    Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503.
    [42]
    Lin Huang, Jianchao Tan, Ji Liu, and Junsong Yuan. 2020. Hand-transformer: Non-autoregressive structured modeling for 3D hand pose estimation. European Conference on Computer Vision, 17–33.
    [43]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [44]
    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (April 2004), 600–612.
    [45]
    Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [46]
    Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. CVPR (2020), 1281–1292.
    [47]
    Ling Li, Xiaojian Li, Shanlin Yang, Shuai Ding, Alireza Jolfaei, and Xi Zheng. 2021. Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics 17, 6 (2021), 3920–3928.
    [48]
    Feng Li, Jie Hao, Jin Wang, Jun Luo, Ying He, Dongxiao Yu, and Xiuzhen Cheng. 2019. VisioMap: Lightweight 3-D scene reconstruction toward natural indoor localization. IEEE Internet of Things Journal (2019).
    [49]
    Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J. Black. 2019. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CVPR (2019), 12232–12241.

    Cited By

    View all
    • (2024)Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal ConstraintsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3674977Online publication date: 1-Jul-2024
    • (2024)DCL-depth: monocular depth estimation network based on iam and depth consistency lossMultimedia Tools and Applications10.1007/s11042-024-18877-7Online publication date: 25-Mar-2024
    • (2024)Resolution-sensitive self-supervised monocular absolute depth estimationApplied Intelligence10.1007/s10489-024-05414-054:6(4781-4793)Online publication date: 5-Apr-2024

    Index Terms

    1. Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
      February 2024
      548 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613570
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 September 2023
      Online AM: 23 March 2023
      Accepted: 15 March 2023
      Revised: 07 December 2022
      Received: 10 November 2021
      Published in TOMM Volume 20, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Autonomous driving
      2. 3D perception
      3. monocular depth and motion estimation
      4. Self-supervised Learning
      5. visual SLAM

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Shanghai Local Capacity Enhancement
      • Science and Technology Innovation Action Plan
      • Shanghai Science and Technology Commission
      • Chenguang talented program of Shanghai

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)298
      • Downloads (Last 6 weeks)18
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal ConstraintsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3674977Online publication date: 1-Jul-2024
      • (2024)DCL-depth: monocular depth estimation network based on iam and depth consistency lossMultimedia Tools and Applications10.1007/s11042-024-18877-7Online publication date: 25-Mar-2024
      • (2024)Resolution-sensitive self-supervised monocular absolute depth estimationApplied Intelligence10.1007/s10489-024-05414-054:6(4781-4793)Online publication date: 5-Apr-2024

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media