research-article

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

Authors:

Yi YangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 1819 - 1827

https://doi.org/10.1145/3503161.3547909

Published: 10 October 2022 Publication History

Get Access

Abstract

In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

Supplementary Material

MP4 File (MM22-fp0666.mp4)

Presentation video for "In-N-Out Generative Learning for Dense Unsupervised Video Segmentation".

Download
51.92 MB

References

[1]

Nikita Araslanov, Simone Schaub-Meyer, and Stefan Roth. 2021. Dense Unsupervised Learning for Video Segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021).

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Unsupervised Cell Segmentation in Fluorescence Microscopy Images via Self-supervised Learning

Few-shot learning with unsupervised part discovery and part-aligned similarity

Discriminative learning can succeed where generative learning fails

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations