Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/IROS51168.2021.9636330guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

Bootstrapped Self-Supervised Training with Monocular Video for Semantic Segmentation and Depth Estimation

Published: 27 September 2021 Publication History

Abstract

For a robot deployed in the world, it is desirable to have the ability of autonomous learning to improve its initial pre-set knowledge. We formalize this as a bootstrapped self-supervised learning problem where a system is initially bootstrapped with supervised training on a labeled dataset and we look for a self-supervised training method that can subsequently improve the system over the supervised training baseline using only unlabeled data. In this work, we leverage temporal consistency between frames in monocular video to per-form this bootstrapped self-supervised training. We show that a well-trained state-of-the-art semantic segmentation network can be further improved through our method. In addition, we show that the bootstrapped self-supervised training framework can help a network learn depth estimation better than pure supervised training or self-supervised training.

References

[1]
J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
[2]
X. Zhan, Z. Liu, P. Luo, X. Tang, and C. C. Loy, “Mix-and-match tuning for self-supervised semantic segmentation,” in AAAI, 2018.
[3]
X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2794–2802.
[4]
P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 37–45.
[5]
D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2701–2710.
[6]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[7]
J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Ghesh-laghi Azaret al., “Bootstrap your own latent-A new approach to self-supervised learning,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[8]
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
[9]
G. Floros and B. Leibe, “Joint 2D-3D temporally consistent semantic segmentation of street scenes,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2823–2830.
[10]
R. Gadde, V. Jampani, and P. V. Gehler, “Semantic video CNNs through representation warping,” in Proceedings of the IEEE Inter-national Conference on Computer Vision, 2017, pp. 4453–4462.
[11]
A. Pfeuffer, K. Schulz, and K. Dietmayer, “Semantic segmentation of video sequences with convolutional LSTMs,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 1441–1447.
[12]
M. Paul, C. Mayer, L. V. Gool, and R. Timofte, “Efficient video semantic segmentation with labels propagation and refinement,” in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 2873–2882.
[13]
P. Tokmakov, K. Alahari, and C. Schmid, “Weakly-supervised semantic segmentation using motion cues,” in European Conference on Computer Vision.Springer, 2016, pp. 388–404.
[14]
Y. Zhu, K. Sapra, F. A. Reda, K. J. Shih, S. Newsam, A. Tao, and B. Catanzaro, “Improving semantic segmentation via video propagation and label relaxation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8856–8865.
[15]
L.-C. Chen, R. G. Lopes, B. Cheng, M. D. Collins, E. D. Cubuk, B. Zoph, H. Adam, and J. Shlens, “Naive-student: Leveraging semi-supervised learning in video sequences for urban scene segmentation,” in European Conference on Computer Vision.Springer, 2020, pp. 695–714.
[16]
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.
[17]
R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5667–5675.
[18]
C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2022–2030.
[19]
C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 3828–3838.
[20]
J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” arXiv preprint arXiv:1908.10553, 2019.
[21]
N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.
[22]
A. Valada, R. Mohan, and W. Burgard, “Self-supervised model adaptation for multimodal semantic segmentation,” International Journal of Computer Vision, pp. 1–47, 2019.
[23]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The CityScapes dataset for semantic urban scene understanding,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[24]
S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
[25]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on computer vision and pattern recognition. IEEE, 2009, pp. 248–255.
[26]
M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
[27]
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[28]
S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, “SceneCode: Monocular dense semantic reconstruction using learned encoded scene representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 776–11 785.
[29]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[30]
J. Civera, A. J. Davison, and J. M. Montiel, “Inverse depth parametrization for monocular slam,” IEEE transactions on robotics, vol. 24, no. 5, pp. 932–945, 2008.
[31]
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
[32]
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
[33]
F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024–2039, 2015.
[34]
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–3361.

Index Terms

  1. Bootstrapped Self-Supervised Training with Monocular Video for Semantic Segmentation and Depth Estimation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
        Sep 2021
        7915 pages

        Publisher

        IEEE Press

        Publication History

        Published: 27 September 2021

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Jan 2025

        Other Metrics

        Citations

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media