research-article

Bootstrapped Self-Supervised Training with Monocular Video for Semantic Segmentation and Depth Estimation

Authors:

John J. LeonardAuthors Info & Claims

2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Pages 2420 - 2427

https://doi.org/10.1109/IROS51168.2021.9636330

Published: 27 September 2021 Publication History

Abstract

For a robot deployed in the world, it is desirable to have the ability of autonomous learning to improve its initial pre-set knowledge. We formalize this as a bootstrapped self-supervised learning problem where a system is initially bootstrapped with supervised training on a labeled dataset and we look for a self-supervised training method that can subsequently improve the system over the supervised training baseline using only unlabeled data. In this work, we leverage temporal consistency between frames in monocular video to per-form this bootstrapped self-supervised training. We show that a well-trained state-of-the-art semantic segmentation network can be further improved through our method. In addition, we show that the bootstrapped self-supervised training framework can help a network learn depth estimation better than pure supervised training or self-supervised training.

References

[1]

J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.

[2]

X. Zhan, Z. Liu, P. Luo, X. Tang, and C. C. Loy, “Mix-and-match tuning for self-supervised semantic segmentation,” in AAAI, 2018.

[3]

X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2794–2802.

[4]

P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 37–45.

[5]

D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2701–2710.

[6]

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

Digital Library

[7]

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Ghesh-laghi Azaret al., “Bootstrap your own latent-A new approach to self-supervised learning,” Advances in Neural Information Processing Systems, vol. 33, 2020.

[8]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.

[9]

G. Floros and B. Leibe, “Joint 2D-3D temporally consistent semantic segmentation of street scenes,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2823–2830.

[10]

R. Gadde, V. Jampani, and P. V. Gehler, “Semantic video CNNs through representation warping,” in Proceedings of the IEEE Inter-national Conference on Computer Vision, 2017, pp. 4453–4462.

[11]

A. Pfeuffer, K. Schulz, and K. Dietmayer, “Semantic segmentation of video sequences with convolutional LSTMs,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 1441–1447.

[12]

M. Paul, C. Mayer, L. V. Gool, and R. Timofte, “Efficient video semantic segmentation with labels propagation and refinement,” in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 2873–2882.

[13]

P. Tokmakov, K. Alahari, and C. Schmid, “Weakly-supervised semantic segmentation using motion cues,” in European Conference on Computer Vision.Springer, 2016, pp. 388–404.

[14]

Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro, “Improving semantic segmentation via video propagation and label relaxation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8856–8865.

[15]

L.-C. Chen, R. G. Lopes, B. Cheng, M. D. Collins, E. D. Cubuk, B. Zoph, H. Adam, and J. Shlens, “Naive-student: Leveraging semi-supervised learning in video sequences for urban scene segmentation,” in European Conference on Computer Vision.Springer, 2020, pp. 695–714.

[16]

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.

[17]

R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5667–5675.

[18]

C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2022–2030.

[19]

C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 3828–3838.

[20]

J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” arXiv preprint arXiv:1908.10553, 2019.

[21]

N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.

[22]

A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adaptation for multimodal semantic segmentation,” International Journal of Computer Vision, pp. 1–47, 2019.

[23]

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The CityScapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[24]

S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.

[25]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on computer vision and pattern recognition. IEEE, 2009, pp. 248–255.

[26]

M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.

Digital Library

[27]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.

Digital Library

[28]

S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, “SceneCode: Monocular dense semantic reconstruction using learned encoded scene representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 776–11 785.

[29]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

Digital Library

[30]

J. Civera, A. J. Davison, and J. M. Montiel, “Inverse depth parametrization for monocular slam,” IEEE transactions on robotics, vol. 24, no. 5, pp. 932–945, 2008.

Digital Library

[31]

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.

Digital Library

[32]

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.

[33]

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024–2039, 2015.

Digital Library

[34]

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–3361.

Index terms have been assigned to the content through auto-classification.

Recommendations

Bayesian Self-training for Semi-supervised 3D Segmentation
Computer Vision – ECCV 2024
Abstract
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised ...
SENSE: Self-Evolving Learning for Self-Supervised Monocular Depth Estimation
Self-supervised depth estimation methods can achieve competitive performance using only unlabeled monocular videos, but they suffer from the uncertainty of jointly learning depth and pose without any ground truths of both tasks. Supervised framework ...
A debiased self-training framework with graph self-supervised pre-training aided for semi-supervised rumor detection
Abstract
Existing rumor detection models have achieved remarkable performance in fully-supervised settings. However, it is time-consuming and labor-intensive to obtain extensive labeled rumor data. To mitigate the reliance on labeled data, semi-supervised ...
Highlights
- A self-training framework for semi-supervised rumor detection is proposed.
- Graph self-supervised pre-training is employed to alleviate confirmation bias.
- Self-adaptive thresholds are designed to generate reliable pseudo-labels.

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Sep 2021

7915 pages

Copyright © 2021.

Publisher

IEEE Press

Publication History

Published: 27 September 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten