Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3547909acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

Published: 10 October 2022 Publication History

Abstract

In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins.

Supplementary Material

MP4 File (MM22-fp0666.mp4)
Presentation video for "In-N-Out Generative Learning for Dense Unsupervised Video Segmentation".

References

[1]
Nikita Araslanov, Simone Schaub-Meyer, and Stefan Roth. 2021. Dense Unsupervised Learning for Video Segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021).
[2]
Hangbo Bao, Li Dong, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.
[4]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9650--9660.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[7]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, Vol. 33 (2020), 21271--21284.
[8]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021).
[9]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[11]
Allan Jabri, Andrew Owens, and Alexei Efros. 2020. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, Vol. 33 (2020), 19545--19560.
[12]
Sangryul Jeon, Dongbo Min, Seungryong Kim, and Kwanghoon Sohn. 2021. Mining better samples for contrastive learning of temporal correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1034--1044.
[13]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[14]
Zihang Lai, Erika Lu, and Weidi Xie. 2020. MAST: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6479--6488.
[15]
Zihang Lai and Weidi Xie. 2019. Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875 (2019).
[16]
Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv preprint arXiv:2106.01061 (2021).
[17]
Qin Lin, Nuo Pang, and Zhiying Hong. 2021. Automated Multi-Modal Video Editing for Ads Video. In Proceedings of the 29th ACM International Conference on Multimedia. 4823--4827.
[18]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[19]
Jiaxu Miao, Yunchao Wei, and Yi Yang. 2020. Memory aggregation networks for efficient interactive video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10366--10375.
[20]
Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In The European Conference on Computer Vision (ECCV).
[21]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226--9235.
[22]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724--732.
[23]
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).
[24]
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. Springer, 510--526.
[25]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.
[26]
Jack Valmadre, Luca Bertinetto, Joao F Henriques, Ran Tao, Andrea Vedaldi, Arnold WM Smeulders, Philip HS Torr, and Efstratios Gavves. 2018. Long-term tracking in the wild: A benchmark. In Proceedings of the European conference on computer vision (ECCV). 670--685.
[27]
Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV). 391--408.
[28]
Ning Wang, Wengang Zhou, and Houqiang Li. 2020. Contrastive transformation for self-supervised correspondence learning. arXiv preprint arXiv:2012.05057 (2020).
[29]
Wenguan Wang, Tianfei Zhou, Fatih Porikli, David Crandall, and Luc Van Gool. 2021b. A survey on deep learning technique for video segmentation. arXiv preprint arXiv:2107.01153 (2021).
[30]
Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2566--2576.
[31]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021a. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8741--8750.
[32]
Jianghao Xiong, En-Lin Hsiang, Ziqian He, Tao Zhan, and Shin-Tson Wu. 2021. Augmented reality and virtual reality displays: emerging technologies and future perspectives. Light: Science & Applications, Vol. 10, 1 (2021), 1--30.
[33]
Jiarui Xu and Xiaolong Wang. 2021. Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10075--10085.
[34]
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 585--601.
[35]
Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, and Stefano Soatto. 2019b. Unsupervised moving object detection via contextual information separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 879--888.
[36]
Zongxin Yang, Peike Li, Qianyu Feng, Yunchao Wei, and Yi Yang. 2019a. Going deeper into embedding learning for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0--0.
[37]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332--348.
[38]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021a. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 2491--2502.
[39]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021b. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[40]
Zixu Zhao, Yueming Jin, and Pheng-Ann Heng. 2021. Modelling neighbor relation in joint space-time graph for video correspondence learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9960--9969.
[41]
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2021. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021).

Cited By

View all
  • (2024)JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00146(762-769)Online publication date: 25-Jun-2024
  • (2024)Bridging spatiotemporal feature gap for video salient object detectionKnowledge-Based Systems10.1016/j.knosys.2024.112505304(112505)Online publication date: Nov-2024
  • (2024)VPE-WSVAD: Visual prompt exemplars for weakly-supervised video anomaly detectionKnowledge-Based Systems10.1016/j.knosys.2024.111978299(111978)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dense prediction
  2. generative learning
  3. self-supervised learning
  4. unsupervised video object segmentation

Qualifiers

  • Research-article

Funding Sources

  • the Fundamental Research Funds for the Central Universities

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00146(762-769)Online publication date: 25-Jun-2024
  • (2024)Bridging spatiotemporal feature gap for video salient object detectionKnowledge-Based Systems10.1016/j.knosys.2024.112505304(112505)Online publication date: Nov-2024
  • (2024)VPE-WSVAD: Visual prompt exemplars for weakly-supervised video anomaly detectionKnowledge-Based Systems10.1016/j.knosys.2024.111978299(111978)Online publication date: Sep-2024
  • (2023)Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.02069(22577-22588)Online publication date: 1-Oct-2023
  • (2023)TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00328(3521-3532)Online publication date: 1-Oct-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media