research-article

Self-Supervised Video Representation Learning via Capturing Semantic Changes Indicated by Saccades

Authors:

Qiang XuAuthors Info & Claims

IEEE Transactions on Circuits and Systems for Video Technology, Volume 34, Issue 8

Pages 6634 - 6645

https://doi.org/10.1109/TCSVT.2023.3290938

Published: 30 June 2023 Publication History

Abstract

In this paper, we propose a self-supervised video representation learning (video SSL) method by taking inspiration from cognitive science and neuroscience on human visual perception. Different from previous methods that focus on the inherent properties of videos, we argue that humans learn to perceive the world through the self-awareness of the semantic changes or consistency in the input stimuli in the absence of labels, accompanied by representation reorganization during the post-learning rest periods. To this end, we first exploit the presence of saccades as an indicator of semantic changes in a contrastive learning framework, mimicking self-awareness in human representation learning. The saccades are generated by alternating the fixations following the predicted scanpath. Second, we model the semantic consistency in eye fixation by minimizing the prediction error between the predicted and the true state of another time point. Finally, we incorporate prototypical contrastive learning to reorganize the learned representations to enhance the associations among perceptually similar ones. Compared to previous video SSL solutions, our method can capture finer-grained semantics from video instances and further associate similar ones together. Experiments show that the proposed bio-inspired video SSL method significantly improves the Top-1 video retrieval accuracy on UCF101 and achieves superior performance on downstream tasks such as action recognition under comparable settings.

References

[1]

J. J. DiCarlo, D. Zoccolan, and N. C. Rust, “How does the brain solve visual object recognition?” Neuron, vol. 73, no. 3, pp. 415–434, Feb. 2012.

[2]

J. Hurri and A. Hyvärinen, “Simple-cell-like receptive fields maximize temporal coherence in natural video,” Neural Comput., vol. 15, no. 3, pp. 663–691, Mar. 2003.

Digital Library

[3]

H. Mobahi, R. Collobert, and J. Weston, “Deep learning from temporal coherence in video,” in Proc. 26th Annu. Int. Conf. Mach. Learn., Jun. 2009, pp. 737–744.

[4]

N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using LSTMs,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 843–852.

[5]

J. Wang, J. Jiao, L. Bao, S. He, W. Liu, and Y. Liu, “Self-supervised video representation learning by uncovering spatio-temporal statistics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3791–3806, Jul. 2022.

[6]

H. Duan, N. Zhao, K. Chen, and D. Lin, “TransRank: Self-supervised video representation learning via ranking-based transformation recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 2990–3000.

[7]

I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsupervised learning using temporal order verification,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 527–544.

[8]

C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 391–408.

[9]

S. Jenni, G. Meishvili, and P. Favaro, “Video representation learning by recognizing temporal transformations,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 425–442.

[10]

D. Melcher and C. L. Colby, “Trans-saccadic perception,” Trends Cognit. Sci., vol. 12, no. 12, pp. 466–473, Dec. 2008.

[11]

S. Diekelmann and J. Born, “The memory function of sleep,” Nature Rev. Neurosci., vol. 11, no. 2, pp. 114–126, Feb. 2010.

[12]

B. Illing, J. Ventura, G. Bellec, and W. Gerstner, “Local plasticity rules can learn deep representations using self-supervised contrastive predictions,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 30365–30379.

[13]

K. Rayner, “Eye movements and attention in reading, scene perception, and visual search,” Quart. J. Experim. Psychol., vol. 62, no. 8, pp. 1457–1506, 2009.

[14]

H. E. M. den Ouden, P. Kok, and F. P. de Lange, “How prediction errors shape perception, attention, and motivation,” Frontiers Psychol., vol. 3, p. 548, Dec. 2012.

[15]

J. Li, P. Zhou, C. Xiong, and S. Hoi, “Prototypical contrastive learning of unsupervised representations,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–16.

[16]

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9726–9735.

[17]

D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-supervised spatiotemporal learning via video clip order prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 10334–10343.

[18]

R. Qianet al., “Enhancing self-supervised video representation learning via multi-level feature optimization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 7970–7981.

[19]

J.-B. Alayracet al., “Self-supervised multimodal versatile networks,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 25–37.

[20]

M. Patricket al., “On compositions of transformations in contrastive self-supervised learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 9557–9567.

[21]

B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5729–5738.

[22]

H. Lee, J. Huang, M. Singh, and M. Yang, “Unsupervised representation learning by sorting sequences,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 667–676.

[23]

D. Wei, J. Lim, A. Zisserman, and W. T. Freeman, “Learning and using the arrow of time,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8052–8060.

[24]

X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2794–2802.

[25]

C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 98–106.

[26]

D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning features by watching objects move,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6024–6033.

[27]

X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2561–2571.

[28]

Z. Luo, B. Peng, D. Huang, A. Alahi, and L. Fei-Fei, “Unsupervised learning of long-term motion dynamics for videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 7101–7110.

[29]

J. Wanget al., “Enhancing unsupervised video representation learning by decoupling the scene and the motion,” in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 11, pp. 10129–10137, May 2021.

[30]

J. Wanget al., “Removing the background by adding the background: Towards background robust self-supervised video representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 11799–11808.

[31]

S. Dinget al., “Motion-aware contrastive video representation learning via foreground-background merging,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 1–11.

[32]

R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in Proc. Int. Conf. Learn. Represent., 2017, pp. 1–22.

[33]

T. Han, W. Xie, and A. Zisserman, “Memory-augmented dense predictive coding for video representation learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 312–329.

[34]

N. Behrmann, J. Gall, and M. Noroozi, “Unsupervised video representation learning by bidirectional feature prediction,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 1669–1678.

[35]

D. Kim, D. Cho, and I. S. Kweon, “Self-supervised video representation learning with space-time cubic puzzles,” in Proc. 33rd AAAI Conf. Artif. Intell., Jul. 2019, vol. 33, no. 1, pp. 8545–8552.

[36]

D. Luoet al., “Video cloze procedure for self-supervised spatio-temporal learning,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 11701–11708, Apr. 2020.

[37]

J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu, “Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4001–4010.

[38]

S. Benaimet al., “SpeedNet: Learning the speediness in videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9919–9928.

[39]

J. Wang, J. Jiao, and Y.-H. Liu, “Self-supervised video representation learning by pace prediction,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 504–521.

[40]

Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye, “Video playback rate perception for self-supervised spatio-temporal representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 6547–6556.

[41]

P. Chenet al., “RSPNet: Relative speed perception for unsupervised video representation learning,” in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 2, pp. 1045–1053.

[42]

R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2006, pp. 1735–1742.

[43]

Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3733–3742.

[44]

T. Han, W. Xie, and A. Zisserman, “Self-supervised co-training for video representation learning,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 1–12.

[45]

Y. Lin, X. Guo, and Y. Lu, “Self-supervised video representation learning with meta-contrastive network,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 8219–8229.

[46]

T. Pan, Y. Song, T. Yang, W. Jiang, and W. Liu, “VideoMoCo: Contrastive video representation learning with temporally adversarial examples,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 11200–11209.

[47]

R. Qianet al., “Spatiotemporal contrastive video representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6960–6970.

[48]

T. Yao, Y. Zhang, Z. Qiu, Y. Pan, and T. Mei, “SeCo: Exploring sequence supervision for unsupervised representation learning,” in Proc. 35th AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 12, pp. 10656–10664.

[49]

T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1483–1492.

[50]

D. Huanget al., “ASCNet: Self-supervised video representation learning with appearance-speed consistency,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 8076–8085.

[51]

Q. Kong, W. Wei, Z. Deng, T. Yoshinaga, and T. Murakami, “Cycle-contrast for self-supervised video representation learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 8089–8100.

[52]

H. Kuanget al., “Video contrastive learning with global context,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 3195–3204.

[53]

C. Sun, A. Nagrani, Y. Tian, and C. Schmid, “Composable augmentation encoding for video representation learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 8814–8824.

[54]

D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “Temporal cycle-consistency learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 1801–1810.

[55]

R. Li, Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei, “Motion-focused contrastive learning of video representations,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2085–2094.

[56]

D. Zanca, S. Melacci, and M. Gori, “Gravitational laws of focus of attention,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 12, pp. 2983–2995, Dec. 2020.

[57]

A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018, arXiv:1807.03748.

[58]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.

[59]

Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 16679–16688.

[60]

K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” 2012, arXiv:1212.0402.

[61]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A large video database for human motion recognition,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 2556–2563.

[62]

K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6546–6555.

[63]

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6450–6459.

[64]

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, Jul. 2021.

[65]

C. Gan, B. Gong, K. Liu, H. Su, and L. J. Guibas, “Geometry guided convolutional neural networks for self-supervised video representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 5589–5597.

[66]

J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4724–4733.

Index Terms

Self-Supervised Video Representation Learning via Capturing Semantic Changes Indicated by Saccades
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
        Video summarization
        Visual content-based indexing and retrieval
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Index terms have been assigned to the content through auto-classification.

Recommendations

Generalized semi-supervised learning via self-supervised feature adaptation
NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing Systems

Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn ...
Bootstrapped Self-Supervised Training with Monocular Video for Semantic Segmentation and Depth Estimation
2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
For a robot deployed in the world, it is desirable to have the ability of autonomous learning to improve its initial pre-set knowledge. We formalize this as a bootstrapped self-supervised learning problem where a system is initially bootstrapped with ...
Self-Supervised Contrastive Representation Learning for Semi-Supervised Time-Series Classification
Learning time-series representations when only unlabeled data or few labeled samples are available can be a challenging task. Recently, contrastive self-supervised learning has shown great improvement in extracting useful representations from unlabeled ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Circuits and Systems for Video Technology

IEEE Transactions on Circuits and Systems for Video Technology Volume 34, Issue 8

Aug. 2024

1186 pages

Issue’s Table of Contents

1051-8215 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 30 June 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents