research-article

Self-Supervised Visual Odometry with Ego-Motion Sampling

Authors:

Anna Vorontsova,

Anton KonushinAuthors Info & Claims

VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image Processing

Pages 1 - 9

https://doi.org/10.1145/3442705.3442706

Published: 21 March 2021 Publication History

Abstract

In recent years, deep learning-based methods for monocular visual odometry have made good progress and now demonstrate state-of-the-art results on the well-known KITTI benchmark. However, collecting ground truth camera poses for training deep visual odometry models requires special equipment and thus might be difficult and expensive. To overcome this limitation, there have been proposed a number of unsupervised methods that exploit geometric relations between depth and motion. However, there is still a large gap in accuracy between unsupervised and supervised methods. In this work, we propose a simple method for generating self-supervision for visual odometry. During training, it requires dense depth maps and an approximate motion distribution of a target platform (e.g. a car or a robot). For each input frame, we sample camera motion from the given distribution, then using a depth map we compute an optical flow that corresponds to the sampled camera motion. Then, this generated optical flow serves as an input to a visual odometry model, while the sampled camera motion serves as a ground truth output.

Experiments on KITTI demonstrate that a deep visual odometry method trained in the proposed self-supervised manner outperforms unsupervised visual odometry methods, thus reducing the gap between the methods that do not require supervision and fully supervised methods. The source code is available on GitHub.

References

[1]

Almalioglu, Muhamad Risqi U Saputra, Pedro PB de Gusmao, Andrew Markham, and Niki Trigoni. 2018. GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks. arXiv preprint arXiv:1809.05786(2018).

[2]

J.-W. Bian, Z. Li, N. Wang, H.Zhan, C. Shen, M.-M. Cheng, and Reid I. 2019. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. arXiv preprint arXiv:1908.10553(2019).

[3]

G. Bradski. 2000. The OpenCV Library. Dr. Dobb's Journal of Software Tools (2000).

[4]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 [cs.LG]

[5]

J. L. B. Claraco. 2010. A tutorial on SE3 transformation parameterizations and on-manifold optimization. Technical Report 012010.

[6]

Gabriele Costante and Thomas Alessandro Ciarfuglia. 2018. Ls-vo: Learning dense optical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters3, 3 (2018),1735–1742.

[7]

Thanuja Dharmasiri, Andrew Spek,and Tom Drummond. 2018. ENG: End-to-end Neural Geometry for Robust Depth and PoseEstimation using CNNs. arXiv preprint arXiv:1807.05705(2018).

[8]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758–2766.

Digital Library

[9]

Jakob Engel, Thomas Sch ops, and Daniel Cremers. 2014. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision. Springer, 834–849.

[10]

Tuo Feng and Dongbing Gu. 2019. SGANVO: Unsupervised Deep Visual Odometry and Depth Estimation with Stacked Generative Adversarial Networks. CoRR abs/1906.08889(2019)

[11]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Conference on Computer Vision and Pattern Recognition (CVPR).

Digital Library

[12]

Mingyang Geng, Su Ning Shang, Bo Ding, Huaimin Wang, Pengfei Zhang, and Lei Zhang. 2019. Unsupervised Learning-based Depth Estimation aided Visual SLAM Approach. CoRR abs/1901.07288(2019)

[13]

Spyros Gidaris, Praveer Singh, and Nikos Komodakis. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728(2018).

[14]

Ariel Gordon, Hanhan Li, RicoJonschkowski, and Anelia Angelova. 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE International Conference on Computer Vision. 8977–8986.

[15]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie,and Ross Girshick. 2019. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019).

[16]

R .Kuemmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard. 2011. g2o: A General Framework for Graph Optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Shanghai, China, 3607–3613. https://doi.org/10.1109/ICRA.2011.5979949

[17]

Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. 2018. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning. 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018), 7286–7291.

[18]

David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (01 Nov 2004), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94

Digital Library

[19]

David G. Lowe. 1999. Object Recognition from Local Scale-Invariant Features. In Proceedings of the International Conference on Computer Vision-Volume 2 -Volume 2 (ICCV ’99). IEEE Computer Society, Washington, DC, USA, 1150– http://dl.acm.org/citation.cfm?id=850924.851523

Digital Library

[20]

Zhaoyang Lv, Kihwan Kim, Alejandro Troccoli, Deqing Sun, James Rehg, and Jan Kautz. 2018. Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation. In ECCV.

[21]

Reza Mahjourian, Martin Wicke, and Anelia Angelova. 2018. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5667–5675.

[22]

Raul Mur-Artal and Juan D. Tardos. 2016. Visual-Inertial Monocular SLAM with Map Reuse. CoRRabs/1610.05949 (2016).

[23]

Raul Mur-Artal and Juan D Tard ́os. 2017. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33, 5 (2017), 1255–1262.

Digital Library

[24]

Richard A. Newcombe, Steven J. Lovegrove, and Andrew J Davison. 2011. DTAM: Dense tracking and mapping in real-time.In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2320–2327.

Digital Library

[25]

Deepak Pathak, Ross Girshick, Piotr Dollar, Trevor Darrell, and Bharath Hariharan. 2017. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2701–2710.

[26]

Marta Salas, Es, and Yasir Latif. 2010. Trajectory Alignment and Evaluation in SLAM : Horn ’ s Method vs Alignment on the Manifold.

[27]

Torsten Sattler, Qunjie Zhou, Marc Pollefeys, and Laura Leal-Taixe. 2019. Understanding the Limitations of CNN-based Absolute Camera Pose Regression. arXiv:1903.07504 [cs.CV]

[28]

Thomas Schops, Torsten Sattler, and Marc Pollefeys. 2019. BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]

Igor Slinko, Anna Vorontsova, Filipp Konokhov, Olga Barinova, and Anton Konushin. 2019. Scene Motion Decomposition for Learnable Visual Odometry. CoRR abs/1907.07227 (2019)

[30]

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and JanKautz. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934–8943.

[31]

Zachary Teed and Jia Deng. 2018. DeepV2D: Video to Depth with Differentiable Structure from Motion. CoRR abs/1812.04605 (2018). arXiv:1812.04605

[32]

Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. 2017. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR),Vol. 5. 6.

[33]

Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. 2017. SfM-Net: Learning of Structure and Motion from Video. CoRR abs/1704.07804 (2017)

[34]

Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. 2017. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2043–2050

Digital Library

[35]

Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. 2018. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research, 4-5 (2018), 513–542. https://doi.org/10.1177/0278364917734298

[36]

Fei Xue, Qiuyuan Wang, Xin Wang,Wei Dong, Junqiu Wang, and Hongbin Zha. 2018. Guided Feature Selection for Deep Visual Odometry. CoRR abs/1811.09935 (2018).

[37]

Fei Xue, Xin Wang, Shunkai Li, Qiuyuan Wang, Junqiu Wang, and Hongbin Zha. 2019. Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]

Zhichao Yin and Jianping Shi. 2018. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2.

[39]

Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian D. Reid. 2018. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction. CoRR abs/1803.03893 (2018)

[40]

Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In European conference on computer vision. Springer, 649–666.

[41]

Cheng Zhao, Li Sun, Pulak Purkait, Tom Duckett, and Rustam Stolkin. 2018. Learning monocular visual odometry with dense 3D mapping from dense 3D flow. Intelligent Robots and Systems (IROS), 2018 International Conference on (2018).

Digital Library

[42]

Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. 2018. DeepTAM: Deep Tracking and Mapping. In European Conference on Computer Vision (ECCV).

[43]

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In CVPR, Vol. 2. 7.

Cited By

Saleem HMalekian RMunir H(2023)Neural Network-Based Recent Research Developments in SLAM for Autonomous Ground Vehicles: A ReviewIEEE Sensors Journal10.1109/JSEN.2023.327391323:13(13829-13858)Online publication date: 1-Jul-2023
https://doi.org/10.1109/JSEN.2023.3273913

Recommendations

Semi-dense Visual Odometry for a Monocular Camera
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer Vision

We propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to benefit from the simplicity and accuracy of dense tracking - which does not depend on visual features - while running in real-time on a CPU. The ...
Self-supervised Recurrent Visual Odometry, Depth Estimation, and Instance Segmentation
CNIOT '24: Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things

Self-supervised methods for monocular visual odometry and depth estimation based on deep learning have recently demonstrated very promising results. However, most of the methods are unable to operate on image sequences. Therefore, we believe that the ...
Real-time Quadrifocal Visual Odometry

In this paper we describe a new image-based approach to tracking the six-degree-of-freedom trajectory of a stereo camera pair. The proposed technique estimates the pose and subsequently the dense pixel matching between temporal image pairs in a sequence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

VSIP '20: Proceedings of the 2020 2nd International Conference on Video, Signal and Image Processing

December 2020

108 pages

ISBN:9781450388931

DOI:10.1145/3442705

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

VSIP '20

VSIP '20: 2020 2nd International Conference on Video, Signal and Image Processing

December 4 - 6, 2020

Jakarta, Indonesia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
120
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Saleem HMalekian RMunir H(2023)Neural Network-Based Recent Research Developments in SLAM for Autonomous Ground Vehicles: A ReviewIEEE Sensors Journal10.1109/JSEN.2023.327391323:13(13829-13858)Online publication date: 1-Jul-2023
https://doi.org/10.1109/JSEN.2023.3273913

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents