research-article

Complementary Coarse-to-Fine Matching for Video Object Segmentation

Authors:

Shiliang ZhangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 6

Article No.: 203, Pages 1 - 21

https://doi.org/10.1145/3596496

Published: 12 July 2023 Publication History

Abstract

Semi-supervised Video Object Segmentation (VOS) needs to establish pixel-level correspondences between a video frame and preceding segmented frames to leverage their segmentation clues. Most works rely on features at a single scale to establish those correspondences, e.g., perform dense matching with Convolutional Neural Network (CNN) features from a deep layer. Differently, this work explores complementary features at different scales to pursue more robust feature matching. A coarse feature from a deep layer is first adopted to get coarse pixel-level correspondences. We hence evaluate the quality of those correspondences, and select pixels with low-quality correspondences for fine-scale feature matching. Segmentation clues of previous frames are propagated by both coarse and fine-scale correspondences, which are fused with appearance features for object segmentation. Compared with previous works, this coarse-to-fine matching scheme is more robust to distractions by similar objects and better preserves object details. The sparse fine-scale matching also ensures a fast inference speed. On popular VOS datasets including DAVIS and YouTube-VOS, the proposed method shows promising performance compared with recent works.

References

[1]

Linchao Bao, Baoyuan Wu, and Wei Liu. 2018. CNN in MRF: Video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[2]

Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In Computer VisionECCV 2020: 16th European Conference, Glasgow, UK, August 2328, 2020, Proceedings, Part II 16.

Digital Library

[3]

Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[4]

Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. 2018. Optimizing video object detection via a scale-time lattice. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[5]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision.

Digital Library

[6]

Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, and Donglian Qi. 2020. State-aware tracker for real-time video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[7]

Yadang Chen, Chuanyan Hao, Alex X. Liu, and Enhua Wu. 2019. Appearance-consistent video object segmentation based on a multinomial event model. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 1–15.

Digital Library

[8]

Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. 2018. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[9]

Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, and Chi-Keung Tang. 2020. CascadePSP: Toward class-agnostic and very high-resolution segmentation via global and local refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[10]

Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In Advances in Neural Information Processing Systems 34 (2021), 11781–11794.

[11]

Wen-Sheng Chu, Yale Song, and Alejandro Jaimes. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[12]

Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W. Taylor. 2021. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[13]

Wenbin Ge, Xiankai Lu, and Jianbing Shen. 2021. Video object segmentation using global and instance embedding learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[14]

Li Hu, Peng Zhang, Bang Zhang, Pan Pan, Yinghui Xu, and Rong Jin. 2021. Learning position and target consistency for memory-based video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[15]

Xuhua Huang, Jiarui Xu, Yu-Wing Tai, and Chi-Keung Tang. 2020. Fast video object segmentation with temporal aggregation network and dynamic template matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[16]

Meng Lan, Jing Zhang, Fengxiang He, and Lefei Zhang. 2022. Siamese network with interactive transformer for video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence.

[17]

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[18]

Xiaoxiao Li and Chen Change Loy. 2018. Video object segmentation with joint re-identification and attention-aware mask propagation. In Proceedings of the European Conference on Computer Vision.

Digital Library

[19]

Xiang Li, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, and Chi-Keung Tang. 2020. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[20]

Yongqing Liang, Xin Li, Navid Jafari, and Jim Chen. 2020. Video object segmentation with adaptive feature bank and uncertain-region refinement. Advances in Neural Information Processing Systems 33 (2020), 3430–3441.

[21]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[22]

Zhihui Lin, Tianyu Yang, Maomao Li, Ziyu Wang, Chun Yuan, Wenhao Jiang, and Wei Liu. 2022. SWEM: Towards real-time video object segmentation with sequential weighted expectation-maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[23]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.

[24]

Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. In Computer VisionECCV 2020: 16th European Conference, Glasgow, UK, August 2328, 2020, Proceedings, Part III 16.

Digital Library

[25]

Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. 2018. PReMVOS: Proposal-generation, refinement and merging for video object segmentation. In Computer VisionACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 26, 2018, Revised Selected Papers, Part IV.

[26]

Yunyao Mao, Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Joint inductive and transductive learning for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[27]

Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[28]

Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. 2022. Per-clip video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[29]

Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. 2017. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[30]

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[31]

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Retrieved from https://arxiv.org/abs/1704.00675.

[32]

Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2015. Training deep neural networks on noisy labels with bootstrapping. In Workshop Track Proceedings of the 3rd International Conference on Learning Representations.

[33]

Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2020. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[34]

Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, and Euntai Kim. 2021. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[35]

Jianping Shi, Qiong Yan, Li Xu, and Jiaya Jia. 2015. Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (2015), 717–729.

[36]

Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[37]

Haidong Wang, Xuan He, Zhiyong Li, Jin Yuan, and Shutao Li. 2023. JDAN: Joint detection and association network for real-Time online Multi-object tracking. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1s (2023), 1-17.

[38]

Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. Swiftnet: Real-time video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[39]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV.

[40]

Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. 2017. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[41]

Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[42]

Jingjing Wu, Jianguo Jiang, Meibin Qi, Cuiqun Chen, and Yimin Liu. 2022. Improving feature discrimination for object tracking by structural-similarity-based metric learning. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–23.

Digital Library

[43]

Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[44]

Haozhe Xie, Hongxun Yao, Shangchen Zhou, Shengping Zhang, and Wenxiu Sun. 2021. Efficient regional memory network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[45]

Kai Xu and Angela Yao. 2022. Accelerating video object segmentation with compressed video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[46]

Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision.

Digital Library

[47]

Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[48]

Linjie Yang, Yanran Wang, Xuehan Xiong, Jianchao Yang, and Aggelos K. Katsaggelos. 2018. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[49]

Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems 34, (2021), 2491–2502.

[50]

Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 4701–4712.

[51]

Zhenzhen Yang, Pengfei Xu, Yongpeng Yang, and Bing-Kun Bao. 2021. A densely connected network based on U-Net for medical image segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 1–14.

Digital Library

[52]

Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Conference Track Proceedings of 4th International Conference on Learning Representations.

[53]

Yuan Yuan, Jie Fang, Xiaoqiang Lu, and Yachuang Feng. 2019. Spatial structure preserving feature pyramid network for semantic image segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 3 (2019), 1–19.

Digital Library

[54]

Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. 2019. Towards high-resolution salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[55]

Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia.

Digital Library

[56]

Dongyang Zhao, Ziyang Song, Zhenghao Ji, Gangming Zhao, Weifeng Ge, and Yizhou Yu. 2021. Multi-scale matching networks for semantic correspondence. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

Cited By

Jiang XYao YLiu SShen FNie LHua X(2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3656047
Liu WCai JLi QLiao CCao JHe SYu Y(2024)Learning Nighttime Semantic Segmentation the Hard WayACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365003220:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3650032
Feng ZXu JMa LZhang S(2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633781
Show More Cited By

Index Terms

Complementary Coarse-to-Fine Matching for Video Object Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking
        Video segmentation

Recommendations

Asymmetric Label Propagation for Video Object Segmentation
MMAsia '22: Proceedings of the 4th ACM International Conference on Multimedia in Asia

Semi-supervised video object segmentation aims to segment foreground objects across a video sequence based on their masks given at the first frame. The motion in adjacent frames tends to be smooth, yet object appearances could change substantially in ...
Enhancing Boundary for Video Object Segmentation
ICVISP 2018: Proceedings of the 2nd International Conference on Vision, Image and Signal Processing

Video object segmentation aims to separate objects from background in successive video sequence accurately. It is a challenging task as the huge variance in object regions and similarity between object and background. Among previous methods, inner ...
A Bayesian approach to video object segmentation via merging 3-D watershed volumes

In this letter, we propose a Bayesian approach to video object segmentation. Our method consists of two stages. In the first stage, we partition the video data into a set of three-dimensional (3-D) watershed volumes, where each watershed volume is a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 6

November 2023

858 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3599695

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2023

Online AM: 16 May 2023

Accepted: 29 April 2023

Revised: 15 March 2023

Received: 31 July 2022

Published in TOMM Volume 19, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China
The National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
191
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)8

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang XYao YLiu SShen FNie LHua X(2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3656047
Liu WCai JLi QLiao CCao JHe SYu Y(2024)Learning Nighttime Semantic Segmentation the Hard WayACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365003220:7(1-23)Online publication date: 16-May-2024
https://dl.acm.org/doi/10.1145/3650032
Feng ZXu JMa LZhang S(2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633781
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Jing SZhang HZeng PGao LSong JShen H(2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3295098
Wei ZYang XWang NGao X(2024)Dual-Adversarial Representation Disentanglement for Visible Infrared Person Re-IdentificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334428919(2186-2200)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2023.3344289
Liu SLi JZhang HXu LCao X(2023)Prediction With Visual Evidence: Sketch Classification Explanation via Stroke-Level AttributionsIEEE Transactions on Image Processing10.1109/TIP.2023.329740432(4393-4406)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3297404
Chen SYang LHu Y(2023)Video Captioning Based on Cascaded Attention-Guided Visual Feature FusionNeural Processing Letters10.1007/s11063-023-11386-y55:8(11509-11526)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1007/s11063-023-11386-y
Yang XWang XYe XLi T(2023)VMSG: a video caption network based on multimodal semantic grouping and semantic attentionMultimedia Systems10.1007/s00530-023-01124-829:5(2575-2589)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1007/s00530-023-01124-8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents