ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

Huang, Deng; Wu, Wenhao; Hu, Weiwen; Liu, Xu; He, Dongliang; Wu, Zhihua; Wu, Xiangmiao; Tan, Mingkui; Ding, Errui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2106.02342v1 (cs)

[Submitted on 4 Jun 2021 (this version), latest version 17 Aug 2021 (v2)]

Title:ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

Authors:Deng Huang, Wenhao Wu, Weiwen Hu, Xu Liu, Dongliang He, Zhihua Wu, Xiangmiao Wu, Mingkui Tan, Errui Ding

View PDF

Abstract:We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other, but they require careful treatment of negative pairs by relying on large batch sizes, memory banks, extra modalities, or customized mining strategies, inevitably including noisy data. In this paper, we observe that the consistency between positive samples is the key to learn robust video representations. Specifically, we propose two tasks to learn the appearance and speed consistency, separately. The appearance consistency task aims to maximize the similarity between two clips of the same video with different playback speeds. The speed consistency task aims to maximize the similarity between two clips with the same playback speed but different appearance information. We show that joint optimization of the two tasks consistently improves the performance on downstream tasks, e.g., action recognition and video retrieval. Remarkably, for action recognition on the UCF-101 dataset, we achieve 90.8% accuracy without using any additional modalities or negative pairs for unsupervised pretraining, outperforming the ImageNet supervised pre-trained model. Codes and models will be available.

Comments:	Technical report
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2106.02342 [cs.CV]
	(or arXiv:2106.02342v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2106.02342

Submission history

From: Wenhao Wu [view email]
[v1] Fri, 4 Jun 2021 08:44:50 UTC (1,854 KB)
[v2] Tue, 17 Aug 2021 09:11:37 UTC (1,848 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators