Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Complementary Coarse-to-Fine Matching for Video Object Segmentation

Published: 12 July 2023 Publication History

Abstract

Semi-supervised Video Object Segmentation (VOS) needs to establish pixel-level correspondences between a video frame and preceding segmented frames to leverage their segmentation clues. Most works rely on features at a single scale to establish those correspondences, e.g., perform dense matching with Convolutional Neural Network (CNN) features from a deep layer. Differently, this work explores complementary features at different scales to pursue more robust feature matching. A coarse feature from a deep layer is first adopted to get coarse pixel-level correspondences. We hence evaluate the quality of those correspondences, and select pixels with low-quality correspondences for fine-scale feature matching. Segmentation clues of previous frames are propagated by both coarse and fine-scale correspondences, which are fused with appearance features for object segmentation. Compared with previous works, this coarse-to-fine matching scheme is more robust to distractions by similar objects and better preserves object details. The sparse fine-scale matching also ensures a fast inference speed. On popular VOS datasets including DAVIS and YouTube-VOS, the proposed method shows promising performance compared with recent works.

References

[1]
Linchao Bao, Baoyuan Wu, and Wei Liu. 2018. CNN in MRF: Video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[2]
Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In Computer VisionECCV 2020: 16th European Conference, Glasgow, UK, August 2328, 2020, Proceedings, Part II 16.
[3]
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. 2018. Optimizing video object detection via a scale-time lattice. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision.
[6]
Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, and Donglian Qi. 2020. State-aware tracker for real-time video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[7]
Yadang Chen, Chuanyan Hao, Alex X. Liu, and Enhua Wu. 2019. Appearance-consistent video object segmentation based on a multinomial event model. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 1–15.
[8]
Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. 2018. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[9]
Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, and Chi-Keung Tang. 2020. CascadePSP: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[10]
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In Advances in Neural Information Processing Systems 34 (2021), 11781–11794.
[11]
Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[12]
Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W. Taylor. 2021. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[13]
Wenbin Ge, Xiankai Lu, and Jianbing Shen. 2021. Video object segmentation using global and instance embedding learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[14]
Li Hu, Peng Zhang, Bang Zhang, Pan Pan, Yinghui Xu, and Rong Jin. 2021. Learning position and target consistency for memory-based video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[15]
Xuhua Huang, Jiarui Xu, Yu-Wing Tai, and Chi-Keung Tang. 2020. Fast video object segmentation with temporal aggregation network and dynamic template matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[16]
Meng Lan, Jing Zhang, Fengxiang He, and Lefei Zhang. 2022. Siamese network with interactive transformer for video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence.
[17]
Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[18]
Xiaoxiao Li and Chen Change Loy. 2018. Video object segmentation with joint re-identification and attention-aware mask propagation. In Proceedings of the European Conference on Computer Vision.
[19]
Xiang Li, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, and Chi-Keung Tang. 2020. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[20]
Yongqing Liang, Xin Li, Navid Jafari, and Jim Chen. 2020. Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems 33 (2020), 3430–3441.
[21]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[22]
Zhihui Lin, Tianyu Yang, Maomao Li, Ziyu Wang, Chun Yuan, Wenhao Jiang, and Wei Liu. 2022. SWEM: Towards real-time video object segmentation with sequential weighted expectation-maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[23]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
[24]
Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. In Computer VisionECCV 2020: 16th European Conference, Glasgow, UK, August 2328, 2020, Proceedings, Part III 16.
[25]
Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. 2018. PReMVOS: Proposal-generation, refinement and merging for video object segmentation. In Computer VisionACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 26, 2018, Revised Selected Papers, Part IV.
[26]
Yunyao Mao, Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Joint inductive and transductive learning for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[27]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[28]
Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. 2022. Per-clip video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[29]
Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. 2017. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[30]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[31]
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Retrieved from https://arxiv.org/abs/1704.00675.
[32]
Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2015. Training deep neural networks on noisy labels with bootstrapping. In Workshop Track Proceedings of the 3rd International Conference on Learning Representations.
[33]
Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2020. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[34]
Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, and Euntai Kim. 2021. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[35]
Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. 2015. Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (2015), 717–729.
[36]
Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[37]
Haidong Wang, Xuan He, Zhiyong Li, Jin Yuan, and Shutao Li. 2023. JDAN: Joint detection and association network for real-Time online Multi-object tracking. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1-17.
[38]
Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. Swiftnet: Real-time video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[39]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV.
[40]
Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. 2017. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[41]
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[42]
Jingjing Wu, Jianguo Jiang, Meibin Qi, Cuiqun Chen, and Yimin Liu. 2022. Improving feature discrimination for object tracking by structural-similarity-based metric learning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–23.
[43]
Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[44]
Haozhe Xie, Hongxun Yao, Shangchen Zhou, Shengping Zhang, and Wenxiu Sun. 2021. Efficient regional memory network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[45]
Kai Xu and Angela Yao. 2022. Accelerating video object segmentation with compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[46]
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision.
[47]
Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[48]
Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K. Katsaggelos. 2018. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[49]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems 34, (2021), 2491–2502.
[50]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 4701–4712.
[51]
Zhenzhen Yang, Pengfei Xu, Yongpeng Yang, and Bing-Kun Bao. 2021. A densely connected network based on U-Net for medical image segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 1–14.
[52]
Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Conference Track Proceedings of 4th International Conference on Learning Representations.
[53]
Yuan Yuan, Jie Fang, Xiaoqiang Lu, and Yachuang Feng. 2019. Spatial structure preserving feature pyramid network for semantic image segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 3 (2019), 1–19.
[54]
Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. 2019. Towards high-resolution salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[55]
Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia.
[56]
Dongyang Zhao, Ziyang Song, Zhenghao Ji, Gangming Zhao, Weifeng Ge, and Yizhou Yu. 2021. Multi-scale matching networks for semantic correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

Cited By

View all
  • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
  • (2024)Learning Nighttime Semantic Segmentation the Hard WayACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365003220:7(1-23)Online publication date: 16-May-2024
  • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
  • Show More Cited By

Index Terms

  1. Complementary Coarse-to-Fine Matching for Video Object Segmentation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 6
      November 2023
      858 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3599695
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 July 2023
      Online AM: 16 May 2023
      Accepted: 29 April 2023
      Revised: 15 March 2023
      Received: 31 July 2022
      Published in TOMM Volume 19, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Video object segmentation
      2. coarse-to-fine matching
      3. label propagation

      Qualifiers

      • Research-article

      Funding Sources

      • Natural Science Foundation of China
      • The National Key Research and Development Program of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)101
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 15 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
      • (2024)Learning Nighttime Semantic Segmentation the Hard WayACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365003220:7(1-23)Online publication date: 16-May-2024
      • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
      • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
      • (2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
      • (2024)Dual-Adversarial Representation Disentanglement for Visible Infrared Person Re-IdentificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334428919(2186-2200)Online publication date: 1-Jan-2024
      • (2023)Prediction With Visual Evidence: Sketch Classification Explanation via Stroke-Level AttributionsIEEE Transactions on Image Processing10.1109/TIP.2023.329740432(4393-4406)Online publication date: 1-Jan-2023
      • (2023)Video Captioning Based on Cascaded Attention-Guided Visual Feature FusionNeural Processing Letters10.1007/s11063-023-11386-y55:8(11509-11526)Online publication date: 25-Aug-2023
      • (2023)VMSG: a video caption network based on multimodal semantic grouping and semantic attentionMultimedia Systems10.1007/s00530-023-01124-829:5(2575-2589)Online publication date: 13-Jun-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media