Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Published: 13 April 2023 Publication History

Abstract

In this article, we present a fast real-time tangled memory network that segments the objects effectively and efficiently for semi-supervised video object segmentation (VOS). We propose a tangled reference encoder and a memory bank organization mechanism based on a state estimator to fully utilize the mask features and alleviate memory overhead and computational burden brought by the unlimited memory bank used in many memory-based methods. First, the tangled memory network exploits the mask features that uncover abundant object information like edges and contours but are not fully explored in existing methods. Specifically, a tangled two-stream reference encoder is designed to extract and fuse the features from both RGB frames and the predicted masks. Second, to indicate the quality of the predicted mask and feedback the online prediction state for organizing the memory bank, we devise a target state estimator to learn the IoU score between the predicted mask and ground truth. Moreover, to accelerate the forward process and avoid memory overflow, we use a memory bank of fixed size to store historical features by designing a new efficient memory bank organization mechanism based on the mask state score provided by the state estimator. We conduct comprehensive experiments on the public benchmarks DAVIS and YouTube-VOS, demonstrating that our method obtains competitive results while running at high speed (66 FPS on the DAVIS16-val set).

References

[1]
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 221–230.
[2]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017), 834–848.
[3]
Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, and Donglian Qi. 2020. State-aware tracker for real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9384–9393.
[4]
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5559–5568.
[5]
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34 (2021), 11781–11794.
[6]
Suhwan Cho, Heansung Lee, Minjung Kim, Sungjun Jang, and Sangyoun Lee. 2022. Pixel-level bijective matching for video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 129–138.
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 248–255.
[8]
Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi, and Graham W. Taylor. 2021. SSTVOS: Sparse spatiotemporal transformers for video object segmentation. arXiv preprint arXiv:2101.08833 (2021).
[9]
Wenbin Ge, Xiankai Lu, and Jianbing Shen. 2021. Video object segmentation using global and instance embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16836–16845.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[11]
Yuk Heo, Yeong Jun Koh, and Chang-Su Kim. 2021. Guided interactive video object segmentation using reliability-based attention maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7322–7330.
[12]
Li Hu, Peng Zhang, Bang Zhang, Pan Pan, Yinghui Xu, and Rong Jin. 2021. Learning position and target consistency for memory-based video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4144–4154.
[13]
Yuan-Ting Hu, Jia-Bin Huang, and Alexander G. Schwing. 2018. VideoMatch: Matching based video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 54–70.
[14]
Peiliang Huang, Junwei Han, Nian Liu, Jun Ren, and Dingwen Zhang. 2021. Scribble-supervised video object segmentation. IEEE/CAA Journal of Automatica Sinica 9, 2 (2021), 339–353.
[15]
Xuhua Huang, Jiarui Xu, Yu-Wing Tai, and Chi-Keung Tang. 2020. Fast video object segmentation with temporal aggregation network and dynamic template matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8879–8889.
[16]
Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. 2021. Full-duplex strategy for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4922–4933.
[17]
Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. 2018. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV’18). 784–799.
[18]
Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8953–8962.
[19]
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7482–7491.
[20]
Anna Khoreva, Anna Rohrbach, and Bernt Schiele. 2018. Video object segmentation with language referring expressions. In Proceedings of the Asian Conference on Computer Vision. 123–141.
[21]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[22]
Zihang Lai, Erika Lu, and Weidi Xie. 2020. MAST: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6479–6488.
[23]
Meng Lan, Jing Zhang, Fengxiang He, and Lefei Zhang. 2022. Siamese network with interactive transformer for video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1228–1236.
[24]
Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with Siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8971–8980.
[25]
Yu Li, Zhuoran Shen, and Ying Shan. 2020. Fast video object segmentation using the global context module. In Proceedings of the European Conference on Computer Vision. 735–750.
[26]
Yuxi Li, Ning Xu, Jinlong Peng, John See, and Weiyao Lin. 2020. Delving into the cyclic mechanism in semi-supervised video object segmentation. arXiv preprint arXiv:2010.12176 (2020).
[27]
Yongqing Liang, Xin Li, Navid Jafari, and Qin Chen. 2020. Video object segmentation with adaptive feature bank and uncertain-region refinement. arXiv preprint arXiv:2010.07958 (2020).
[28]
Huaijia Lin, Xiaojuan Qi, and Jiaya Jia. 2019. AGSS-VOS: Attention guided single-shot video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3949–3957.
[29]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.
[30]
Chang Liu, Wenguan Wang, Jianbing Shen, and Ling Shao. 2018. Stereo video object segmentation using stereoscopic foreground trajectories. IEEE Transactions on Cybernetics 49, 10 (2018), 3665–3676.
[31]
Weide Liu, Guosheng Lin, Tianyi Zhang, and Zichuan Liu. 2020. Guided co-segmentation network for fast video object segmentation. IEEE Transactions on Circuits and Systems for Video Technology 31, 4 (2020), 1607–1617.
[32]
Yong Liu, Ran Yu, Xinyuan Zhao, and Yujiu Yang. 2021. Quality-aware and selective prior enhancement memory network for video object segmentation. In Proceedings of the CVPR Workshop, Vol. 2.
[33]
Xinkai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. arXiv preprint arXiv:2007.07020 (2020).
[34]
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention Siamese networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3623–3632.
[35]
Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Jiebo Luo. 2022. Zero-shot video object segmentation with co-attention Siamese networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4 (2022), 2228–2242.
[36]
Xiankai Lu, Wenguan Wang, Jianbing Shen, David J. Crandall, and Luc Van Gool. 2021. Segmenting objects from relational visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 7885–7897.
[37]
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J. Crandall, and Steven C. H. Hoi. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8960–8970.
[38]
Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. 2018. PReMVOS: Proposal-generation, refinement and merging for video object segmentation. In Computer Vision—ACCV 2018. Lecture Notes in Computer Science, Vol. 11364. Springer, 565–580.
[39]
Jianbiao Mei, Mengmeng Wang, Yeneng Lin, Yi Yuan, and Yong Liu. 2021. TransVOS: Video object segmentation with transformers. arXiv preprint arXiv:2106.00588 (2021).
[40]
Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7376–7385.
[41]
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226–9235.
[42]
Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. 2017. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2663–2672.
[43]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 724–732.
[44]
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 DAVIS Challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).
[45]
Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. 2021. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15455–15464.
[46]
Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2020. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7406–7415.
[47]
Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In Proceedings of the European Conference on Computer Vision. 629–645.
[48]
Lei Sun, Kailun Yang, Xinxin Hu, Weijian Hu, and Kaiwei Wang. 2020. Real-time fusion network for RGB-D semantic segmentation incorporating unexpected obstacle detection for road-driving images. IEEE Robotics and Automation Letters 5, 4 (2020), 5558–5565.
[49]
Mingjie Sun, Jimin Xiao, Eng Gee Lim, Bingfeng Zhang, and Yao Zhao. 2020. Fast template matching and update for video object tracking and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10791–10799.
[50]
Carles Ventura, Miriam Bellver, Andreu Girbau, Amaia Salvador, Ferran Marques, and Xavier Giro-i-Nieto. 2019. RVOS: End-to-end recurrent network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5277–5286.
[51]
Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. FEELVOS: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9481–9490.
[52]
Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017).
[53]
Paul Voigtlaender, Jonathon Luiten, Philip H. S. Torr, and Bastian Leibe. 2020. Siam R-CNN: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6578–6588.
[54]
Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. SwiftNet: Real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1296–1305.
[55]
Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip H. S. Torr. 2019. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1328–1338.
[56]
Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven C. H. Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 7 (2020), 2413–2428.
[57]
Wenguan Wang, Jianbing Shen, Fatih Porikli, and Ruigang Yang. 2018. Semi-supervised video object segmentation with super-trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 4 (2018), 985–998.
[58]
Ziqin Wang, Jun Xu, Li Liu, Fan Zhu, and Ling Shao. 2019. RANet: Ranking attention network for fast video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3978–3987.
[59]
Lili Wei, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, and Shidi Chen. 2022. Weakly supervised video object segmentation via dual-attention cross-branch fusion. ACM Transactions on Intelligent Systems and Technology 13, 3 (2022), 1–20.
[60]
Dongming Wu, Xingping Dong, Ling Shao, and Jianbing Shen. 2022. Multi-level representation learning with semantic alignment for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4996–5005.
[61]
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4974–4984.
[62]
Haozhe Xie, Hongxun Yao, Shangchen Zhou, Shengping Zhang, and Wenxiu Sun. 2021. Efficient regional memory network for video object segmentation. arXiv preprint arXiv:2103.12934 (2021).
[63]
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. YouTube-VOS: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 585–601.
[64]
Xiaohao Xu, Jinglu Wang, Xiao Li, and Yan Lu. 2022. Reliable propagation-correction modulation for video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2946–2954.
[65]
Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. 2020. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence. 12549–12556.
[66]
Le Yang, Junwei Han, Dingwen Zhang, Nian Liu, and Dong Zhang. 2018. Segmentation in weakly labeled videos via a semantic ranking and optical warping network. IEEE Transactions on Image Processing 27, 8 (2018), 4025–4037.
[67]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In Proceedings of the European Conference on Computer Vision. 332–348.
[68]
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems 34 (2021).
[69]
Rui Yao, Guosheng Lin, Shixiong Xia, Jiaqi Zhao, and Yong Zhou. 2020. Video object segmentation and tracking: A survey. ACM Transactions on Intelligent Systems and Technology 11, 4 (2020), 1–47.
[70]
Yizhuo Zhang, Zhirong Wu, Houwen Peng, and Stephen Lin. 2020. A transductive approach for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6949–6958.
[71]
Zongji Zhao, Sanyuan Zhao, and Jianbing Shen. 2021. Real-time and light-weighted unsupervised video object segmentation network. Pattern Recognition 120 (2021), 108120.
[72]
Tianfei Zhou, Jianwu Li, Shunzhou Wang, Ran Tao, and Jianbing Shen. 2020. MATNet: Motion-attentive transition network for zero-shot video object segmentation. IEEE Transactions on Image Processing 29 (2020), 8326–8338.

Cited By

View all
  • (2024)Real-time segmentation of short videos under VR technology in dynamic scenesJournal of Intelligent Systems10.1515/jisys-2023-028933:1Online publication date: 24-May-2024
  • (2024)Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks2024 International Conference on 3D Vision (3DV)10.1109/3DV62453.2024.00050(842-851)Online publication date: 18-Mar-2024
  • (2024)LiDAR video object segmentation with dynamic kernel refinementPattern Recognition Letters10.1016/j.patrec.2023.12.013178(21-27)Online publication date: Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 3
June 2023
451 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3587032
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2023
Online AM: 23 February 2023
Accepted: 13 February 2023
Revised: 10 February 2023
Received: 02 August 2022
Published in TIST Volume 14, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Tangled memory network
  2. two stream
  3. state estimator
  4. memory organization mechanism

Qualifiers

  • Research-article

Funding Sources

  • Grant from The National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)111
  • Downloads (Last 6 weeks)13
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Real-time segmentation of short videos under VR technology in dynamic scenesJournal of Intelligent Systems10.1515/jisys-2023-028933:1Online publication date: 24-May-2024
  • (2024)Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking via Memory Networks2024 International Conference on 3D Vision (3DV)10.1109/3DV62453.2024.00050(842-851)Online publication date: 18-Mar-2024
  • (2024)LiDAR video object segmentation with dynamic kernel refinementPattern Recognition Letters10.1016/j.patrec.2023.12.013178(21-27)Online publication date: Feb-2024
  • (2024)Space–time recurrent memory networkComputer Vision and Image Understanding10.1016/j.cviu.2024.103943241:COnline publication date: 2-Jul-2024
  • (2024)Learning spatiotemporal relationships with a unified framework for video object segmentationApplied Intelligence10.1007/s10489-024-05486-y54:8(6138-6153)Online publication date: 7-May-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media