Self-supervised video pretraining yields human-aligned visual representations

Parthasarathy, Nikhil; Eslami, S. M. Ali; Carreira, João; Hénaff, Olivier J.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.06433 (cs)

[Submitted on 12 Oct 2022 (v1), last revised 25 Jul 2023 (this version, v2)]

Title:Self-supervised video pretraining yields human-aligned visual representations

Authors:Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff

View PDF

Abstract:Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.

Comments:	Technical report
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2210.06433 [cs.CV]
	(or arXiv:2210.06433v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.06433

Submission history

From: Olivier Hénaff [view email]
[v1] Wed, 12 Oct 2022 17:30:12 UTC (2,785 KB)
[v2] Tue, 25 Jul 2023 16:43:33 UTC (6,896 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Self-supervised video pretraining yields human-aligned visual representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Self-supervised video pretraining yields human-aligned visual representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators