No more shortcuts: realizing the potential of temporal self-supervision
Article No.: 165, Pages 1481 - 1491
Abstract
Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. Our extensive experiments show state-of-the-art performance across 10 video understanding datasets, illustrating the generalization ability and robustness of our learned video representations. Project Page: https://daveishan.github.io/nms-webpage.
References
[1]
Asano, Y.; Patrick, M.; Rupprecht, C.; and Vedaldi, A. 2020. Labelling unlabelled videos from scratch with multi-modal self-supervision. Advances in Neural Information Processing Systems, 33: 4660-4671.
[2]
Bai, Y.; Fan, H.; Misra, I.; Venkatesh, G.; Lu, Y.; Zhou, Y.; Yu, Q.; Chandra, V.; and Yuille, A. 2020. Can Temporal Information Help with Contrastive Self-Supervised Learning? arXiv preprint arXiv:2011.13046.
[3]
Benaim, S.; Ephrat, A.; Lang, O.; Mosseri, I.; Freeman, W. T.; Rubinstein, M.; Irani, M.; and Dekel, T. 2020. Speed-Net: Learning the Speediness in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9922-9931.
[4]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650-9660.
[5]
Carreira, J.; and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308.
[6]
Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In ICML.
[7]
Chen, X.; and He, K. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15750-15758.
[8]
Chen, X.; Xie, S.; and He, K. 2021. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640-9649.
[9]
Dave, I.; Gupta, R.; Rizve, M. N.; and Shah, M. 2022. TCLR: Temporal contrastive learning for video representation. Computer Vision and Image Understanding, 103406.
[10]
Dave, I. R.; Rizve, M. N.; Chen, C.; and Shah, M. 2023. TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[11]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248-255. Ieee.
[12]
Diba, A.; Fayyaz, M.; Sharma, V.; Paluri, M.; Gall, J.; Stiefelhagen, R.; and Van Gool, L. 2020. Large scale holistic video understanding. In European Conference on Computer Vision, 593-610. Springer.
[13]
Diba, A.; Sharma, V.; Safdari, R.; Lotfi, D.; Sarfraz, S.; Stiefelhagen, R.; and Van Gool, L. 2021. Vi2clr: Video and image for visual contrastive learning of representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1502-1512.
[14]
Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, 1422-1430.
[15]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[16]
Duan, H.; Zhao, N.; Chen, K.; and Lin, D. 2022. TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3000-3010.
[17]
Feichtenhofer, C.; Fan, H.; Li, Y.; and He, K. 2022. Masked Autoencoders As Spatiotemporal Learners. In Advances in Neural Information Processing Systems.
[18]
Feichtenhofer, C.; Fan, H.; Xiong, B.; Girshick, R.; and He, K. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3299-3309.
[19]
Goyal, R.; Kahou, S. E.; Michalski, V.; Materzyńska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; Hoppe, F.; Thurau, C.; Bax, I.; and Memisevic, R. 2017. The "something something" video database for learning and evaluating visual common sense. arXiv:1706.04261.
[20]
Han, T.; Xie, W.; and Zisserman, A. 2020a. Memory-augmented dense predictive coding for video representation learning. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16, 312-329. Springer.
[21]
Han, T.; Xie, W.; and Zisserman, A. 2020b. Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems, 33: 5679-5690.
[22]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; and Girshick, R. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000-16009.
[23]
He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729-9738.
[24]
Hu, K.; Shao, J.; Liu, Y.; Raj, B.; Savvides, M.; and Shen, Z. 2021. Contrast and order representations for video self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7939-7949.
[25]
Jenni, S.; Black, A.; and Collomosse, J. 2023. Audio-Visual Contrastive Learning with Temporal Self-Supervision. arXiv preprint arXiv:2302.07702.
[26]
Jenni, S.; and Favaro, P. 2018. Self-supervised feature learning by learning to spot artifacts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2733-2742.
[27]
Jenni, S.; and Jin, H. 2021. Time-equivariant contrastive video representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9970-9980.
[28]
Jenni, S.; Meishvili, G.; and Favaro, P. 2020. Video Representation Learning by Recognizing Temporal Transformations. In The European Conference on Computer Vision (ECCV).
[29]
Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; and Black, M. J. 2013. Towards understanding action recognition. In International Conf. on Computer Vision (ICCV), 3192-3199.
[30]
Khorasgani, S. H.; Chen, Y.; and Shkurti, F. 2022. SLIC: Self-Supervised Learning With Iterative Clustering for Human Action Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16091-16101.
[31]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV).
[32]
Lee, H.-Y.; Huang, J.-B.; Singh, M.; and Yang, M.-H. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, 667-676.
[33]
Li, C.; Yang, J.; Zhang, P.; Gao, M.; Xiao, B.; Dai, X.; Yuan, L.; and Gao, J. 2022. Efficient Self-supervised Vision Transformers for Representation Learning. International Conference on Learning Representations (ICLR).
[34]
Li, R.; Zhang, Y.; Qiu, Z.; Yao, T.; Liu, D.; and Mei, T. 2021. Motion-focused contrastive learning of video representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2105-2114.
[35]
Li, X.; Liu, S.; De Mello, S.; Wang, X.; Kautz, J.; and Yang, M.-H. 2019. Joint-task self-supervised learning for temporal correspondence. Advances in Neural Information Processing Systems, 32.
[36]
Liang, H.; Quader, N.; Chi, Z.; Chen, L.; Dai, P.; Lu, J.; and Wang, Y. 2022. Self-supervised spatiotemporal representation learning by exploiting video continuity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1564-1573.
[37]
Misra, I.; Zitnick, C. L.; and Hebert, M. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, 527544. Springer.
[38]
Neimark, D.; Bar, O.; Zohar, M.; and Asselmann, D. 2021. Video Transformer Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 3163-3172.
[39]
Ni, J.; Zhou, N.; Qin, J.; Wu, Q.; Liu, J.; Li, B.; and Huang, D. 2022. Motion Sensitive Contrastive Learning for Self-supervised Video Representation. In Proceedings of the European Conference on Computer Vision (ECCV).
[40]
Pan, T.; Song, Y.; Yang, T.; Jiang, W.; and Liu, W. 2021. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11205-11214.
[41]
Patrick, M.; Asano, Y. M.; Kuznetsova, P.; Fong, R.; Henriques, J. F.; Zweig, G.; and Vedaldi, A. 2021. Multi-modal Self-Supervision from Generalized Data Transformations.
[42]
Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; and Van Gool, L. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675.
[43]
Qian, R.; Li, Y.; Liu, H.; See, J.; Ding, S.; Liu, X.; Li, D.; and Lin, W. 2021a. Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization. In Proceedings of the International Conference on Computer Vision (ICCV).
[44]
Qian, R.; Meng, T.; Gong, B.; Yang, M.-H.; Wang, H.; Belongie, S.; and Cui, Y. 2021b. Spatiotemporal con-trastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6964-6974.
[45]
Ranasinghe, K.; Naseer, M.; Khan, S.; Khan, F. S.; and Ryoo, M. S. 2022. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2874-2884.
[46]
Schiappa, M. C.; Biyani, N.; Kamtam, P.; Vyas, S.; Palangi, H.; Vineet, V.; and Rawat, Y. S. 2023. A Large-Scale Robustness Analysis of Video Action Recognition Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14698-14708.
[47]
Shahroudy, A.; Liu, J.; Ng, T.-T.; and Wang, G. 2016. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1010-1019.
[48]
Sigurdsson, G. A.; Gupta, A.; Schmid, C.; Farhadi, A.; and Alahari, K. 2018. Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos. CoRR, abs/1804.09626.
[49]
Soomro, K.; Zamir, A. R.; and Shah, M. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[50]
Thoker, F. M.; Doughty, H.; Bagad, P.; and Snoek, C. G. 2022. How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? In European Conference on Computer Vision, 632-652. Springer.
[51]
Thoker, F. M.; Doughty, H.; and Snoek, C. G. 2023. Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13812-13823.
[52]
Tong, Z.; Song, Y.; Wang, J.; and Wang, L. 2022. Video-MAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems.
[53]
Wang, G.; Zhou, Y.; Luo, C.; Xie, W.; Zeng, W.; and Xiong, Z. 2021a. Unsupervised visual representation learning by tracking patches in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2563-2572.
[54]
Wang, J.; Gao, Y.; Li, K.; Jiang, X.; Guo, X.; Ji, R.; and Sun, X. 2021b. Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion. In The AAAI Conference on Artificial Intelligence (AAAI).
[55]
Wang, J.; Jiao, J.; and Liu, Y.-H. 2020. Self-supervised Video Representation Learning by Pace Prediction. In The European Conference on Computer Vision (ECCV).
[56]
Wei, D.; Lim, J. J.; Zisserman, A.; and Freeman, W. T. 2018. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8052-8060.
[57]
Xiao, F.; Tighe, J.; and Modolo, D. 2022. MaCLR: Motion-aware contrastive Learning of representations for videos. In The European Conference on Computer Vision (ECCV).
[58]
Xu, D.; Xiao, J.; Zhao, Z.; Shao, J.; Xie, D.; and Zhuang, Y. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10334-10343.
[59]
Yang, C.; Xu, Y.; Dai, B.; and Zhou, B. 2020. Video representation learning with visual tempo consistency. arXiv preprint arXiv:2006.15489.
[60]
Yang, H.; Huang, D.; Wen, B.; Wu, J.; Yao, H.; Jiang, Y.; Zhu, X.; and Yuan, Z. 2022. Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders. arXiv preprint arXiv:2210.04154.
[61]
Yao, T.; Zhang, Y.; Qiu, Z.; Pan, Y.; and Mei, T. 2020. SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning. arXiv preprint arXiv:2008.00975.
[62]
Yu, S.; Tan, D.; and Tan, T. 2006. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In 18th International Conference on Pattern Recognition (ICPR'06), volume 4, 441-444. IEEE.
[63]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; and Kong, T. 2022a. iBOT: Image BERT Pre-Training with Online Tokenizer. International Conference on Learning Representations (ICLR).
[64]
Zhou, P.; Zhou, Y.; Si, C.; Yu, W.; Ng, T. K.; and Yan, S. 2022b. Mugs: A Multi-Granular Self-Supervised Learning Framework. In arXiv preprint arXiv:2203.14415.
Index Terms
- No more shortcuts: realizing the potential of temporal self-supervision
Index terms have been assigned to the content through auto-classification.
Comments
Information & Contributors
Information
Published In
February 2024
23861 pages
ISBN:978-1-57735-887-9
Copyright © 2024 Association for the Advancement of Artificial Intelligence.
Sponsors
- Association for the Advancement of Artificial Intelligence
Publisher
AAAI Press
Publication History
Published: 20 February 2024
Qualifiers
- Research-article
- Research
- Refereed limited
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 0Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025