Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3442705.3442706acmotherconferencesArticle/Chapter ViewAbstractPublication PagesvsipConference Proceedingsconference-collections
research-article

Self-Supervised Visual Odometry with Ego-Motion Sampling

Published: 21 March 2021 Publication History

Abstract

In recent years, deep learning-based methods for monocular visual odometry have made good progress and now demonstrate state-of-the-art results on the well-known KITTI benchmark. However, collecting ground truth camera poses for training deep visual odometry models requires special equipment and thus might be difficult and expensive. To overcome this limitation, there have been proposed a number of unsupervised methods that exploit geometric relations between depth and motion. However, there is still a large gap in accuracy between unsupervised and supervised methods. In this work, we propose a simple method for generating self-supervision for visual odometry. During training, it requires dense depth maps and an approximate motion distribution of a target platform (e.g. a car or a robot). For each input frame, we sample camera motion from the given distribution, then using a depth map we compute an optical flow that corresponds to the sampled camera motion. Then, this generated optical flow serves as an input to a visual odometry model, while the sampled camera motion serves as a ground truth output.
Experiments on KITTI demonstrate that a deep visual odometry method trained in the proposed self-supervised manner outperforms unsupervised visual odometry methods, thus reducing the gap between the methods that do not require supervision and fully supervised methods. The source code is available on GitHub.

References

[1]
Almalioglu, Muhamad Risqi U Saputra, Pedro PB de Gusmao, Andrew Markham, and Niki Trigoni. 2018. GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks. arXiv preprint arXiv:1809.05786(2018).
[2]
J.-W. Bian, Z. Li, N. Wang, H.Zhan, C. Shen, M.-M. Cheng, and Reid I. 2019. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. arXiv preprint arXiv:1908.10553(2019).
[3]
G. Bradski. 2000. The OpenCV Library. Dr. Dobb's Journal of Software Tools (2000).
[4]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 [cs.LG]
[5]
J. L. B. Claraco. 2010. A tutorial on SE3 transformation parameterizations and on-manifold optimization. Technical Report 012010.
[6]
Gabriele Costante and Thomas Alessandro Ciarfuglia. 2018. Ls-vo: Learning dense optical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters3, 3 (2018),1735–1742.
[7]
Thanuja Dharmasiri, Andrew Spek,and Tom Drummond. 2018. ENG: End-to-end Neural Geometry for Robust Depth and PoseEstimation using CNNs. arXiv preprint arXiv:1807.05705(2018).
[8]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758–2766.
[9]
Jakob Engel, Thomas Sch ops, and Daniel Cremers. 2014. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision. Springer, 834–849.
[10]
Tuo Feng and Dongbing Gu. 2019. SGANVO: Unsupervised Deep Visual Odometry and Depth Estimation with Stacked Generative Adversarial Networks. CoRR abs/1906.08889(2019)
[11]
Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Mingyang Geng, Su Ning Shang, Bo Ding, Huaimin Wang, Pengfei Zhang, and Lei Zhang. 2019. Unsupervised Learning-based Depth Estimation aided Visual SLAM Approach. CoRR abs/1901.07288(2019)
[13]
Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728(2018).
[14]
Ariel Gordon, Hanhan Li, RicoJonschkowski, and Anelia Angelova. 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE International Conference on Computer Vision. 8977–8986.
[15]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie,and Ross Girshick. 2019. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019).
[16]
R .Kuemmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard. 2011. g2o: A General Framework for Graph Optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Shanghai, China, 3607–3613. https://doi.org/10.1109/ICRA.2011.5979949
[17]
Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. 2018. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning. 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018), 7286–7291.
[18]
David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (01 Nov 2004), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
[19]
David G. Lowe. 1999. Object Recognition from Local Scale-Invariant Features. In Proceedings of the International Conference on Computer Vision-Volume 2 -Volume 2 (ICCV ’99). IEEE Computer Society, Washington, DC, USA, 1150– http://dl.acm.org/citation.cfm?id=850924.851523
[20]
Zhaoyang Lv, Kihwan Kim, Alejandro Troccoli, Deqing Sun, James Rehg, and Jan Kautz. 2018. Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation. In ECCV.
[21]
Reza Mahjourian, Martin Wicke, and Anelia Angelova. 2018. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5667–5675.
[22]
Raul Mur-Artal and Juan D. Tardos. 2016. Visual-Inertial Monocular SLAM with Map Reuse. CoRRabs/1610.05949 (2016).
[23]
Raul Mur-Artal and Juan D Tard ́os. 2017. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33, 5 (2017), 1255–1262.
[24]
Richard A. Newcombe, Steven J. Lovegrove, and Andrew J Davison. 2011. DTAM: Dense tracking and mapping in real-time.In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2320–2327.
[25]
Deepak Pathak, Ross Girshick, Piotr Dollar, Trevor Darrell, and Bharath Hariharan. 2017. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2701–2710.
[26]
Marta Salas, Es, and Yasir Latif. 2010. Trajectory Alignment and Evaluation in SLAM : Horn ’ s Method vs Alignment on the Manifold.
[27]
Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixe. 2019. Understanding the Limitations of CNN-based Absolute Camera Pose Regression. arXiv:1903.07504 [cs.CV]
[28]
Thomas Schops, Torsten Sattler, and Marc Pollefeys. 2019. BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29]
Igor Slinko, Anna Vorontsova, Filipp Konokhov, Olga Barinova, and Anton Konushin. 2019. Scene Motion Decomposition for Learnable Visual Odometry. CoRR abs/1907.07227 (2019)
[30]
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and JanKautz. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934–8943.
[31]
Zachary Teed and Jia Deng. 2018. DeepV2D: Video to Depth with Differentiable Structure from Motion. CoRR abs/1812.04605 (2018). arXiv:1812.04605
[32]
Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. 2017. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR),Vol. 5. 6.
[33]
Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. 2017. SfM-Net: Learning of Structure and Motion from Video. CoRR abs/1704.07804 (2017)
[34]
Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. 2017. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2043–2050
[35]
Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. 2018. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research, 4-5 (2018), 513–542. https://doi.org/10.1177/0278364917734298
[36]
Fei Xue, Qiuyuan Wang, Xin Wang,Wei Dong, Junqiu Wang, and Hongbin Zha. 2018. Guided Feature Selection for Deep Visual Odometry. CoRR abs/1811.09935 (2018).
[37]
Fei Xue, Xin Wang, Shunkai Li, Qiuyuan Wang, Junqiu Wang, and Hongbin Zha. 2019. Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38]
Zhichao Yin and Jianping Shi. 2018. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2.
[39]
Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian D. Reid. 2018. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction. CoRR abs/1803.03893 (2018)
[40]
Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In European conference on computer vision. Springer, 649–666.
[41]
Cheng Zhao, Li Sun, Pulak Purkait, Tom Duckett, and Rustam Stolkin. 2018. Learning monocular visual odometry with dense 3D mapping from dense 3D flow. Intelligent Robots and Systems (IROS), 2018 International Conference on (2018).
[42]
Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. 2018. DeepTAM: Deep Tracking and Mapping. In European Conference on Computer Vision (ECCV).
[43]
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In CVPR, Vol. 2. 7.

Cited By

View all
  • (2023)Neural Network-Based Recent Research Developments in SLAM for Autonomous Ground Vehicles: A ReviewIEEE Sensors Journal10.1109/JSEN.2023.327391323:13(13829-13858)Online publication date: 1-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image Processing
December 2020
108 pages
ISBN:9781450388931
DOI:10.1145/3442705
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. autonomous driving
  2. self-supervised learning
  3. visual odometry

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

VSIP '20

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Neural Network-Based Recent Research Developments in SLAM for Autonomous Ground Vehicles: A ReviewIEEE Sensors Journal10.1109/JSEN.2023.327391323:13(13829-13858)Online publication date: 1-Jul-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media