research-article

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Authors:

Huchuan LuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2645 - 2653

https://doi.org/10.1145/3474085.3475192

Published: 17 October 2021 Publication History

Abstract

Location and appearance are the key cues for video object segmentation. Many sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only utilize the RGB or RGB and optical flow. In this paper, we propose a novel multi-source fusion network for zero-shot video object segmentation. With the help of interoceptive spatial attention module (ISAM), spatial importance of each source is highlighted. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By the ISAM and FPM, the multi-source features are effectively fused. In addition, we put forward an automatic predictor selection network (APS) to select the better prediction of either the static saliency predictor or the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Extensive experiments on three challenging public benchmarks (i.e. DAVIS$_16 $, Youtube-Objects and FBMS) show that the proposed model achieves compelling performance against the state-of-the-arts. The source code will be publicly available at https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS

Supplementary Material

MP4 File (paper196.mp4)

This presentation in the order of task background, existing methods, solutions, and experimental results. Through detailed introduction, we will gradually reveal the importance and necessity of multi-source fusion and automatic predictor selection for video segmentation task. We summarize our main contribution that we are the first one utilizes multi-source information to achieve static / moving object segmentation, the first one aims to evaluate the quality of optical flow and the first one achieves automatic predictor selection.

Download
16.76 MB

References

[1]

Ning An, Xiao-Guang Zhao, and Zeng-Guang Hou. 2016. Online RGB-D tracking via detection-learning-segmentation. In ICPR. 1231--1236.

[2]

Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, and Ming-Hsuan Yang. 2017. Segflow: Joint learning for video object segmentation and optical flow. In ICCV. 686--695.

[3]

Yupeng Cheng, Huazhu Fu, Xingxing Wei, Jiangjian Xiao, and Xiaochun Cao. 2014. Depth enhanced saliency detection method. In ICIMCS. 23.

Digital Library

[4]

Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research, Vol. 134, 1 (2005), 19--67.

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.

[6]

Muhammad Faisal, Ijaz Akhter, Mohsen Ali, and Richard Hartley. 2019. Exploiting geometric constraints on dense trajectories for motion saliency. arXiv preprint arXiv:1909.13258 (2019).

[7]

Deng-Ping Fan, Zheng Lin, Jia-Xing Zhao, Yun Liu, Zhao Zhang, Qibin Hou, Menglong Zhu, and Ming-Ming Cheng. 2019. Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781 (2019).

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV. 1026--1034.

Digital Library

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.

[10]

Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. 2017. Deeply supervised salient object detection with short connections. In CVPR. 3203--3212.

[11]

Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. 2018. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In CVPR. 8981--8989.

[12]

Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In CVPR. 2117--2126.

[13]

Yuan Ji, Xu Jia, Huchuan Lu, and Xiang Ruan. 2021. Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning. In ACMMM .

Digital Library

[14]

Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gangshan Wu. 2014. Depth saliency based on anisotropic center-surround difference. In ICIP. 1115--1119.

[15]

Yeong Jun Koh and Chang-Su Kim. 2017. Primary object segmentation in videos based on region augmentation and reduction. In CVPR. 3442--3450.

[16]

Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In ECCV. 207--223.

[17]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In CVPR. 2117--2125.

[18]

Hong Liu, Wenshan Wu, Xiangdong Wang, and Yueliang Qian. 2018. RGB-D joint modelling with scene geometric information for indoor semantic segmentation. Multimedia Tools and Applications, Vol. 77, 17 (2018), 22475--22488.

Digital Library

[19]

Wei Liu, Andrew Rabinovich, and Alexander C Berg. 2015. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015).

[20]

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR. 3623--3632.

[21]

Alan Lukezic, Ugur Kart, Jani Kapyla, Ahmed Durmush, Joni-Kristian Kamarainen, Jiri Matas, and Matej Kristan. 2019. CDTB: A color and depth visual object tracking dataset and benchmark. In ICCV. 10013--10022.

[22]

Yuzhen Niu, Yujie Geng, Xueqing Li, and Feng Liu. 2012. Leveraging stereopsis for saliency analysis. In CVPR. 454--461.

Digital Library

[23]

Mertalp Ocal and Armin Mustafa. 2020. RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes. arXiv preprint arXiv:2004.06267 (2020).

[24]

Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. IEEE TPAMI, Vol. 36, 6 (2013), 1187--1200.

Digital Library

[25]

Youwei Pang, Lihe Zhang, Xiaoqi Zhao, and Huchuan Lu. 2020 a. Hierarchical dynamic filtering network for RGB-D salient object detection. In ECCV. 235--252.

[26]

Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020 b. Multi-Scale Interactive Network for Salient Object Detection. In CVPR. 9413--9422.

[27]

Anestis Papazoglou and Vittorio Ferrari. 2013. Fast object segmentation in unconstrained video. In ICCV. 1777--1784.

Digital Library

[28]

Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and Rongrong Ji. 2014. RGBD salient object detection: A benchmark and algorithms. In ECCV. 92--109.

[29]

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR. 724--732.

[30]

Yongri Piao, Wei Ji, Jingjing Li, Miao Zhang, and Huchuan Lu. 2019. Depth-Induced Multi-Scale Recurrent Attention Network for Saliency Detection. In ICCV. 7254--7263.

[31]

Sudeep Pillai, Rarecs Ambrucs, and Adrien Gaidon. 2019. Superdepth: Self-supervised, super-resolved monocular depth estimation. In ICRA. 9250--9256.

[32]

Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. 2012. Learning object class detectors from weakly annotated video. In CVPR. 3282--3289.

Digital Library

[33]

Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. 2019. BASNet: Boundary-Aware Salient Object Detection. In CVPR. 7479--7489.

[34]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI (2020).

[35]

Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In CVPR. 4161--4170.

[36]

Maryamsadat Rasoulidanesh, Srishti Yadav, Sachini Herath, Yasaman Vaghei, and Shahram Payandeh. 2019. Deep Attention Models for Human Tracking Using RGBD. Sensors, Vol. 19, 4 (2019), 750.

[37]

Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. 2019. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA. 50--56.

[38]

Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper convlstm for video salient object detection. In ECCV. 715--731.

[39]

D Sun, X Yang, MY Liu, and J Kautz. 2018. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In CVPR. 8934--8943.

[40]

Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV. 402--419.

[41]

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017a. Learning motion patterns in videos. In CVPR. 3386--3394.

[42]

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017b. Learning video object segmentation with visual memory. In ICCV. 4481--4490.

[43]

Yi-Hsuan Tsai, Guangyu Zhong, and Ming-Hsuan Yang. 2016. Semantic co-segmentation in videos. In ECCV. 760--775.

[44]

Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang Yang, Xiang Ruan, and Ali Borji. 2018. Detect globally, refine locally: A novel approach to saliency detection. In CVPR. 3127--3135.

[45]

Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019 a. Zero-shot video object segmentation via attentive graph neural networks. In ICCV. 9236--9245.

[46]

Weiyue Wang and Ulrich Neumann. 2018. Depth-aware cnn for rgb-d segmentation. In ECCV. 135--150.

[47]

Wenguan Wang, Jianbing Shen, and Fatih Porikli. 2015. Saliency-aware geodesic video object segmentation. In CVPR. 3395--3402.

[48]

Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019 b. Learning unsupervised video object segmentation through visual attention. In CVPR. 3064--3074.

[49]

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. 2003. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2. 1398--1402.

[50]

Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. In Advances in neural information processing systems. 794--805.

Digital Library

[51]

Lu Zhang, Jianming Zhang, Zhe Lin, Radomir Mech, Huchuan Lu, and You He. 2020. Unsupervised Video Object Segmentation with Joint Hotspot Tracking. In ECCV. 490--506.

[52]

Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. 2017. Amulet: Aggregating multi-level convolutional features for salient object detection. In ICCV. 202--211.

[53]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In CVPR. 4106--4115.

[54]

Jiawei Zhao, Yifan Zhao, Jia Li, and Xiaowu Chen. 2020 d. Is depth really necessary for salient object detection?. In ACMMM. 1745--1754.

Digital Library

[55]

Shengyu Zhao, Yilun Sheng, Yue Dong, Eric I Chang, Yan Xu, et al. 2020 b. MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask. In CVPR. 6278--6287.

[56]

Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020 a. Suppress and balance: A simple gated network for salient object detection. In ECCV. 35--51.

[57]

Xiaoqi Zhao, Lihe Zhang, Youwei Pang, Huchuan Lu, and Lei Zhang. 2020 c. A single stream network for robust and real-time rgb-d salient object detection. In ECCV. 646--662.

[58]

Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation. In ECCV. 445--462.

[59]

Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-Attentive Transition for Zero-Shot Video Object Segmentation. In AAAI. 3.

Cited By

Yang JZhang LLu H(2024)Referring Image Segmentation With Fine-Grained Semantic Funneling InfusionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328137235:10(14727-14738)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3281372
Miao BBennamoun MGao YMian A(2024)Region Aware Video Object Segmentation With Deep Motion ModelingIEEE Transactions on Image Processing10.1109/TIP.2024.338144533(2639-2651)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3381445
Ren WTang YSun QZhao CHan Q(2024)Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An OverviewIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2023.12320711:5(1106-1126)Online publication date: May-2024
https://doi.org/10.1109/JAS.2023.123207
Show More Cited By

Index Terms

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation

Recommendations

Adaptive Multi-Source Predictor for Zero-Shot Video Object Segmentation
Abstract
Static and moving objects often occur in real-life videos. Most video object segmentation methods only focus on extracting and exploiting motion cues to perceive moving objects. Once faced with the frames of static objects, the moving object ...
Self-Supervised Superpixel Correspondences For Video Object Segmentation
VSIP '22: Proceedings of the 2022 4th International Conference on Video, Signal and Image Processing

Prior self-supervised video object segmentation models directly find pixel correspondences between pairs of frames. Instead, we propose a novel approach of employing superpixel features for learning visual correspondences between frames of a video ...
Evaluation on fusion of saliency and objectness for salient object segmentation
ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

Saliency detection measures the probability how a region attracts human visual attention, and objectness estimates the probability that a rectangle window may contain potential objects. Can a salient object segmentation method which utilizes both ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
337
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang JZhang LLu H(2024)Referring Image Segmentation With Fine-Grained Semantic Funneling InfusionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328137235:10(14727-14738)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3281372
Miao BBennamoun MGao YMian A(2024)Region Aware Video Object Segmentation With Deep Motion ModelingIEEE Transactions on Image Processing10.1109/TIP.2024.338144533(2639-2651)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3381445
Ren WTang YSun QZhao CHan Q(2024)Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An OverviewIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2023.12320711:5(1106-1126)Online publication date: May-2024
https://doi.org/10.1109/JAS.2023.123207
Jia QYao SXu YLiu YKong DLatecki L(2024)Fuzzy Boundary-Guided Network for Camouflaged Object Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687409(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687409
Zhao XPang YChen ZYu QZhang LLiu HZuo JLu H(2024)Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02079(22020-22029)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02079
Zhang HLiang HZhao XLiu JLiang R(2024)Salient object detection in egocentric videosIET Image Processing10.1049/ipr2.1308018:8(2028-2037)Online publication date: 13-Mar-2024
https://doi.org/10.1049/ipr2.13080
Zeng LLi XYang XChen WLiu JShen LWu S(2024)SCPMan: Shape context and prior constrained multi-scale attention network for pancreatic segmentationExpert Systems with Applications10.1016/j.eswa.2024.124070252(124070)Online publication date: Oct-2024
https://doi.org/10.1016/j.eswa.2024.124070
Yu TXia CLi J(2024)Towards imbalanced motion: part-decoupling network for video portrait segmentationScience China Information Sciences10.1007/s11432-023-4030-y67:7Online publication date: 25-Jun-2024
https://doi.org/10.1007/s11432-023-4030-y
Zhao XChang SPang YYang JZhang LLu H(2024)Adaptive Multi-Source Predictor for Zero-Shot Video Object SegmentationInternational Journal of Computer Vision10.1007/s11263-024-02024-8132:8(3232-3250)Online publication date: 7-Mar-2024
https://doi.org/10.1007/s11263-024-02024-8
Jiang KHong LChen ZGuo PTao ZWang YZhang WEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial AttacksProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611827(8598-8607)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611827
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents