Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475192acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

Published: 17 October 2021 Publication History

Abstract

Location and appearance are the key cues for video object segmentation. Many sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only utilize the RGB or RGB and optical flow. In this paper, we propose a novel multi-source fusion network for zero-shot video object segmentation. With the help of interoceptive spatial attention module (ISAM), spatial importance of each source is highlighted. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By the ISAM and FPM, the multi-source features are effectively fused. In addition, we put forward an automatic predictor selection network (APS) to select the better prediction of either the static saliency predictor or the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Extensive experiments on three challenging public benchmarks (i.e. DAVIS$_16 $, Youtube-Objects and FBMS) show that the proposed model achieves compelling performance against the state-of-the-arts. The source code will be publicly available at https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS

Supplementary Material

MP4 File (paper196.mp4)
This presentation in the order of task background, existing methods, solutions, and experimental results. Through detailed introduction, we will gradually reveal the importance and necessity of multi-source fusion and automatic predictor selection for video segmentation task. We summarize our main contribution that we are the first one utilizes multi-source information to achieve static / moving object segmentation, the first one aims to evaluate the quality of optical flow and the first one achieves automatic predictor selection.

References

[1]
Ning An, Xiao-Guang Zhao, and Zeng-Guang Hou. 2016. Online RGB-D tracking via detection-learning-segmentation. In ICPR. 1231--1236.
[2]
Jingchun Cheng, Yi-Hsuan Tsai, Shengjin Wang, and Ming-Hsuan Yang. 2017. Segflow: Joint learning for video object segmentation and optical flow. In ICCV. 686--695.
[3]
Yupeng Cheng, Huazhu Fu, Xingxing Wei, Jiangjian Xiao, and Xiaochun Cao. 2014. Depth enhanced saliency detection method. In ICIMCS. 23.
[4]
Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research, Vol. 134, 1 (2005), 19--67.
[5]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.
[6]
Muhammad Faisal, Ijaz Akhter, Mohsen Ali, and Richard Hartley. 2019. Exploiting geometric constraints on dense trajectories for motion saliency. arXiv preprint arXiv:1909.13258 (2019).
[7]
Deng-Ping Fan, Zheng Lin, Jia-Xing Zhao, Yun Liu, Zhao Zhang, Qibin Hou, Menglong Zhu, and Ming-Ming Cheng. 2019. Rethinking RGB-D salient object detection: Models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781 (2019).
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV. 1026--1034.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[10]
Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. 2017. Deeply supervised salient object detection with short connections. In CVPR. 3203--3212.
[11]
Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. 2018. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In CVPR. 8981--8989.
[12]
Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In CVPR. 2117--2126.
[13]
Yuan Ji, Xu Jia, Huchuan Lu, and Xiang Ruan. 2021. Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning. In ACMMM .
[14]
Ran Ju, Ling Ge, Wenjing Geng, Tongwei Ren, and Gangshan Wu. 2014. Depth saliency based on anisotropic center-surround difference. In ICIP. 1115--1119.
[15]
Yeong Jun Koh and Chang-Su Kim. 2017. Primary object segmentation in videos based on region augmentation and reduction. In CVPR. 3442--3450.
[16]
Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In ECCV. 207--223.
[17]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In CVPR. 2117--2125.
[18]
Hong Liu, Wenshan Wu, Xiangdong Wang, and Yueliang Qian. 2018. RGB-D joint modelling with scene geometric information for indoor semantic segmentation. Multimedia Tools and Applications, Vol. 77, 17 (2018), 22475--22488.
[19]
Wei Liu, Andrew Rabinovich, and Alexander C Berg. 2015. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015).
[20]
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR. 3623--3632.
[21]
Alan Lukezic, Ugur Kart, Jani Kapyla, Ahmed Durmush, Joni-Kristian Kamarainen, Jiri Matas, and Matej Kristan. 2019. CDTB: A color and depth visual object tracking dataset and benchmark. In ICCV. 10013--10022.
[22]
Yuzhen Niu, Yujie Geng, Xueqing Li, and Feng Liu. 2012. Leveraging stereopsis for saliency analysis. In CVPR. 454--461.
[23]
Mertalp Ocal and Armin Mustafa. 2020. RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes. arXiv preprint arXiv:2004.06267 (2020).
[24]
Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. IEEE TPAMI, Vol. 36, 6 (2013), 1187--1200.
[25]
Youwei Pang, Lihe Zhang, Xiaoqi Zhao, and Huchuan Lu. 2020 a. Hierarchical dynamic filtering network for RGB-D salient object detection. In ECCV. 235--252.
[26]
Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020 b. Multi-Scale Interactive Network for Salient Object Detection. In CVPR. 9413--9422.
[27]
Anestis Papazoglou and Vittorio Ferrari. 2013. Fast object segmentation in unconstrained video. In ICCV. 1777--1784.
[28]
Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, and Rongrong Ji. 2014. RGBD salient object detection: A benchmark and algorithms. In ECCV. 92--109.
[29]
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR. 724--732.
[30]
Yongri Piao, Wei Ji, Jingjing Li, Miao Zhang, and Huchuan Lu. 2019. Depth-Induced Multi-Scale Recurrent Attention Network for Saliency Detection. In ICCV. 7254--7263.
[31]
Sudeep Pillai, Rarecs Ambrucs, and Adrien Gaidon. 2019. Superdepth: Self-supervised, super-resolved monocular depth estimation. In ICRA. 9250--9256.
[32]
Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. 2012. Learning object class detectors from weakly annotated video. In CVPR. 3282--3289.
[33]
Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. 2019. BASNet: Boundary-Aware Salient Object Detection. In CVPR. 7479--7489.
[34]
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. 2020. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE TPAMI (2020).
[35]
Anurag Ranjan and Michael J Black. 2017. Optical flow estimation using a spatial pyramid network. In CVPR. 4161--4170.
[36]
Maryamsadat Rasoulidanesh, Srishti Yadav, Sachini Herath, Yasaman Vaghei, and Shahram Payandeh. 2019. Deep Attention Models for Human Tracking Using RGBD. Sensors, Vol. 19, 4 (2019), 750.
[37]
Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. 2019. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA. 50--56.
[38]
Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper convlstm for video salient object detection. In ECCV. 715--731.
[39]
D Sun, X Yang, MY Liu, and J Kautz. 2018. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In CVPR. 8934--8943.
[40]
Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV. 402--419.
[41]
Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017a. Learning motion patterns in videos. In CVPR. 3386--3394.
[42]
Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017b. Learning video object segmentation with visual memory. In ICCV. 4481--4490.
[43]
Yi-Hsuan Tsai, Guangyu Zhong, and Ming-Hsuan Yang. 2016. Semantic co-segmentation in videos. In ECCV. 760--775.
[44]
Tiantian Wang, Lihe Zhang, Shuo Wang, Huchuan Lu, Gang Yang, Xiang Ruan, and Ali Borji. 2018. Detect globally, refine locally: A novel approach to saliency detection. In CVPR. 3127--3135.
[45]
Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019 a. Zero-shot video object segmentation via attentive graph neural networks. In ICCV. 9236--9245.
[46]
Weiyue Wang and Ulrich Neumann. 2018. Depth-aware cnn for rgb-d segmentation. In ECCV. 135--150.
[47]
Wenguan Wang, Jianbing Shen, and Fatih Porikli. 2015. Saliency-aware geodesic video object segmentation. In CVPR. 3395--3402.
[48]
Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019 b. Learning unsupervised video object segmentation through visual attention. In CVPR. 3064--3074.
[49]
Zhou Wang, Eero P Simoncelli, and Alan C Bovik. 2003. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2. 1398--1402.
[50]
Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. In Advances in neural information processing systems. 794--805.
[51]
Lu Zhang, Jianming Zhang, Zhe Lin, Radomir Mech, Huchuan Lu, and You He. 2020. Unsupervised Video Object Segmentation with Joint Hotspot Tracking. In ECCV. 490--506.
[52]
Pingping Zhang, Dong Wang, Huchuan Lu, Hongyu Wang, and Xiang Ruan. 2017. Amulet: Aggregating multi-level convolutional features for salient object detection. In ICCV. 202--211.
[53]
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Yan Yan, Nicu Sebe, and Jian Yang. 2019. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In CVPR. 4106--4115.
[54]
Jiawei Zhao, Yifan Zhao, Jia Li, and Xiaowu Chen. 2020 d. Is depth really necessary for salient object detection?. In ACMMM. 1745--1754.
[55]
Shengyu Zhao, Yilun Sheng, Yue Dong, Eric I Chang, Yan Xu, et al. 2020 b. MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask. In CVPR. 6278--6287.
[56]
Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020 a. Suppress and balance: A simple gated network for salient object detection. In ECCV. 35--51.
[57]
Xiaoqi Zhao, Lihe Zhang, Youwei Pang, Huchuan Lu, and Lei Zhang. 2020 c. A single stream network for robust and real-time rgb-d salient object detection. In ECCV. 646--662.
[58]
Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation. In ECCV. 445--462.
[59]
Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-Attentive Transition for Zero-Shot Video Object Segmentation. In AAAI. 3.

Cited By

View all
  • (2025)Multi-scale and contrastive learning for pediatric chest radiograph classification tasksDisplays10.1016/j.displa.2024.10295187(102951)Online publication date: Apr-2025
  • (2024)ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341732946:12(9205-9220)Online publication date: Dec-2024
  • (2024)Referring Image Segmentation With Fine-Grained Semantic Funneling InfusionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328137235:10(14727-14738)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Multi-Source Fusion and Automatic Predictor Selection for Zero-Shot Video Object Segmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. feature purification
    2. interoceptive spatial attention
    3. multi-source information
    4. predictor selection
    5. video object segmentation

    Qualifiers

    • Research-article

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Multi-scale and contrastive learning for pediatric chest radiograph classification tasksDisplays10.1016/j.displa.2024.10295187(102951)Online publication date: Apr-2025
    • (2024)ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object DetectionIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341732946:12(9205-9220)Online publication date: Dec-2024
    • (2024)Referring Image Segmentation With Fine-Grained Semantic Funneling InfusionIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328137235:10(14727-14738)Online publication date: Oct-2024
    • (2024)Region Aware Video Object Segmentation With Deep Motion ModelingIEEE Transactions on Image Processing10.1109/TIP.2024.338144533(2639-2651)Online publication date: 2024
    • (2024)Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An OverviewIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2023.12320711:5(1106-1126)Online publication date: May-2024
    • (2024)Fuzzy Boundary-Guided Network for Camouflaged Object Detection2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687409(1-6)Online publication date: 15-Jul-2024
    • (2024)Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02079(22020-22029)Online publication date: 16-Jun-2024
    • (2024)Salient object detection in egocentric videosIET Image Processing10.1049/ipr2.1308018:8(2028-2037)Online publication date: 13-Mar-2024
    • (2024)SCPMan: Shape context and prior constrained multi-scale attention network for pancreatic segmentationExpert Systems with Applications10.1016/j.eswa.2024.124070252(124070)Online publication date: Oct-2024
    • (2024)Towards imbalanced motion: part-decoupling network for video portrait segmentationScience China Information Sciences10.1007/s11432-023-4030-y67:7Online publication date: 25-Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media