Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548039acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

Published: 10 October 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Spatio-temporal feature representation is essential for accurate unsupervised video object segmentation, which needs an effective feature propagation paradigm for both appearance and motion features that can fully interchange information across frames. However, existing solutions mainly focus on the forward feature propagation from the preceding frame to the current one, either using the former segmentation mask or motion propagation in a frame-by-frame manner. This ignores the bi-directional temporal feature interactions (including the backward propagation from the future to the current frame) across all frames that can help to enhance the spatiotemporal feature representation for segmentation prediction. To this end, this paper presents a novel Dense Bidirectional Spatio-temporal feature propagation Network (DBSNet) to fully integrate the forward and the backward propagations across all frames. Specifically, a dense bi-ConvLSTM module is first developed to propagate the features across all frames in a forward and backward manner. This can fully capture the multi-level spatio-temporal contextual information across all frames, producing an effective feature representation that has a strong discriminative capability to tell from noisy backgrounds. Following it, a spatio-temporal Transformer refinement module is designed to further enhance the propagated features, which can effectively capture the spatio-temporal long-range dependencies among all frames. Afterwards, a Co-operative Direction-aware Graph Attention (Co-DGA) module is designed to integrate the propagated appearancemotion cues, yielding a strong spatio-temporal feature representation for segmentation mask prediction. The Co-DGA assigns proper attentional weights to neighboring points along the coordinate axis, making the segmentation model to selectively focus on the most relevant neighbors. Extensive evaluations on four mainstream challenging benchmarks including DAVIS16, FBMS, DAVSOD, and MCL demonstrate that the proposed DBSNet achieves favorable performance against state-of-the-art methods in terms of all evaluation metrics.

    Supplementary Material

    MP4 File (MM22-fp1209.mp4)
    We propose a DBSNet framework for Unsupervised Video Object Segmentation(UVOS), which densely propagates the cross-frame spatio-temporal features along bidirectional directions. Then, to capture long-range dependencies from both spatial and temporal dimensions, a spatio-temporal Transformer refinement module has been designed. It is able to aggregate all the positions over current frame and neighboring frames. Furthermore, we have designed a Co-DGA module to integrate the appearance and motion cues, making the model learn mutual knowledge from static and dynamic contexts. The Co-DGA is able to extract the implicit structural information of the foreground areas, as well, contributing to a more reliable and fine-grained representation for UVOS. Extensive evaluations on four benchmark datasets have demonstrated the advantage and effectiveness of the proposed approach, which substantially outperforms the state-of-the-art methods.

    References

    [1]
    Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. In CVPR.
    [2]
    Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, and Sergio Escalera. 2019. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. In ICCVW.
    [3]
    Goutam Bhat, Felix J¨aremo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In ECCV.
    [4]
    Chenglizhao Chen, Guotao Wang, Chong Peng, Xiaowei Zhang, and Hong Qin. 2019. Improved robust video saliency detection based on long-term spatial-temporal information. TIP (2019).
    [5]
    Yuhuan Chen, Wenbin Zou, Yi Tang, Xia Li, Chen Xu, and Nikos Komodakis. 2018. SCOM: Spatiotemporal constrained optimization for salient object detection. IEEE Transactions on Image Processing 27, 7 (2018), 3345--3357.
    [6]
    Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, and Ming-Hsuan Yang. 2022. Video Salient Object Detection via Contrastive Features and Attention Modules. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1320--1329.
    [7]
    Runmin Cong, Jianjun Lei, Huazhu Fu, Fatih Porikli, Qingming Huang, and Chunping Hou. 2019. Video saliency detection via sparsity-based reconstruction and propagation. TIP (2019).
    [8]
    Muhammad Faisal, Ijaz Akhter, Mohsen Ali, and Richard Hartley. 2020. EpO-net: Exploiting geometric constraints on dense trajectories for motion saliency. In WACV.
    [9]
    Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. 2017. Structure-measure: A new way to evaluate foreground maps. In ICCV.
    [10]
    Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, and Jianbing Shen. 2019. Shifting more attention to video salient object detection. In CVPR.
    [11]
    Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. Graph convolutional tracking. In CVPR.
    [12]
    Yuchao Gu, Lijuan Wang, Ziqin Wang, Yun Liu, Ming-Ming Cheng, and Shao-Ping Lu. 2020. Pyramid constrained selfattention network for fast video salient object detection. In AAAI.
    [13]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.
    [14]
    Qibin Hou, Daquan Zhou, and Jiashi Feng. 2021. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13713--13722.
    [15]
    Ping Hu, Fabian Caba, Oliver Wang, Zhe Lin, Stan Sclaroff, and Federico Perazzi. 2020. Temporally distributed networks for fast video semantic segmentation. In CVPR.
    [16]
    Yuan-Ting Hu, Jia-Bin Huang, and Alexander G Schwing. 2018. Unsupervised video object segmentation using motion saliencyguided spatio-temporal propagation. In Proceedings of the European conference on computer vision (ECCV). 786--802.
    [17]
    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.
    [18]
    Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.
    [19]
    Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. 2017. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 2117--2126.
    [20]
    Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan, Jianbing Shen, and Ling Shao. 2021. Full-duplex strategy for video object segmentation. In ICCV.
    [21]
    Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. TIP (2015).
    [22]
    Haofeng Li, Guanqi Chen, Guanbin Li, and Yizhou Yu. 2019. Motion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 7274--7283.
    [23]
    Jiangtong Li, Wentao Wang, Junjie Chen, Li Niu, Jianlou Si, Chen Qian, and Liqing Zhang. 2021. Video Semantic Segmentation via Sparse Temporal Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 59--68.
    [24]
    Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and C-C Jay Kuo. 2018. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6526--6535.
    [25]
    Siyang Li, Bryan Seybold, Alexey Vorobyov, Xuejing Lei, and C-C Jay Kuo. 2018. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European conference on computer vision (ECCV). 207--223.
    [26]
    Yunxiao Li, Shuai Li, Chenglizhao Chen, Aimin Hao, and Hong Qin. 2019. Accurate and robust video saliency detection via self-paced diffusion. TMM (2019).
    [27]
    Daizong Liu, Dongdong Yu, Changhu Wang, and Pan Zhou. 2021. F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation. In AAAI.
    [28]
    Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, and Luc Van Gool. 2020. Video object segmentation with episodic graph memory networks. In ECCV.
    [29]
    Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. 2019. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In CVPR.
    [30]
    Xiankai Lu, Wenguan Wang, Jianbing Shen, David Crandall, and Jiebo Luo. 2020. Zero-shot video object segmentation with coattention siamese networks. PAMI (2020).
    [31]
    Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. 2020. Learning video object segmentation from unlabeled videos. In CVPR.
    [32]
    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV). 116--131.
    [33]
    Sachin Mehta and Mohammad Rastegari. 2021. Mobilevit: lightweight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021).
    [34]
    Peter Ochs, Jitendra Malik, and Thomas Brox. 2013. Segmentation of moving objects by long term video analysis. PAMI (2013).
    [35]
    Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by referenceguided mask propagation. In CVPR.
    [36]
    Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu. 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9413--9422.
    [37]
    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.
    [38]
    Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. 2021. Global filter networks for image classification. Advances in Neural Information Processing Systems 34 (2021).
    [39]
    Sucheng Ren, Wenxi Liu, Yongtuo Liu, Haoxin Chen, Guoqiang Han, and Shengfeng He. 2021. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15455--15464.
    [40]
    Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. Kernelized memory network for video object segmentation. In ECCV.
    [41]
    Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. 2019. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA.
    [42]
    Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. 2018. Pyramid dilated deeper convlstm for video salient object detection. In ECCV.
    [43]
    Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. 2017. Learning video object segmentation with visual memory. In ICCV.
    [44]
    Haochen Wang, Xiaolong Jiang, Haibing Ren, Yao Hu, and Song Bai. 2021. SwiftNet: Real-time Video Object Segmentation. In CVPR.
    [45]
    Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, Haibin Ling, and Ruigang Yang. 2021. Salient object detection in the deep learning era: An in-depth survey. PAMI (2021).
    [46]
    Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. 2019. Zero-shot video object segmentation via attentive graph neural networks. In ICCV.
    [47]
    Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven CH Hoi, and Haibin Ling. 2020. Paying attention to video object pattern understanding. PAMI (2020).
    [48]
    Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. 2017. Super-trajectory for video segmentation. In ICCV.
    [49]
    Wenguan Wang, Hongmei Song, Shuyang Zhao, Jianbing Shen, Sanyuan Zhao, Steven CH Hoi, and Haibin Ling. 2019. Learning unsupervised video object segmentation through visual attention. In CVPR.
    [50]
    Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12321--12328.
    [51]
    Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. 2020. Polarmask: Single shot instance segmentation with polar representation. In CVPR.
    [52]
    SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.
    [53]
    Han Xu, Jiayi Ma, Zhuliang Le, Junjun Jiang, and Xiaojie Guo. 2020. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12484--12491.
    [54]
    Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, and Yu Hen Hu. 2019. Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. TMM (2019).
    [55]
    Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, Yu Hen Hu, and Shou Feng. 2019. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. TCSVT (2019).
    [56]
    Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 585--601.
    [57]
    Yi Xu, Longwen Gao, Kai Tian, Shuigeng Zhou, and Huyang Sun. 2019. Non-local convlstm for video compression artifact reduction. In ICCV.
    [58]
    Pengxiang Yan, Guanbin Li, Yuan Xie, Zhen Li, Chuan Wang, Tianshui Chen, and Liang Lin. 2019. Semi-supervised video salient object detection using pseudo-labels. In ICCV.
    [59]
    Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.
    [60]
    Ren Yang. 2021. NTIRE 2021 challenge on quality enhancement of compressed video: Methods and results. In CVPR.
    [61]
    Shu Yang, Lu Zhang, Jinqing Qi, Huchuan Lu, Shuo Wang, and Xiaoxing Zhang. 2021. Learning Motion-Appearance Co-Attention for Zero-Shot Video Object Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1564--1573.
    [62]
    Zhao Yang, Qiang Wang, Luca Bertinetto, Weiming Hu, Song Bai, and Philip HS Torr. 2019. Anchor diffusion for unsupervised video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 931--940.
    [63]
    Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.
    [64]
    Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, and Bo Liu. 2021. Deep Transport Network for Unsupervised Video Object Segmentation. In ICCV.
    [65]
    Lu Zhang, Jianming Zhang, Zhe Lin, Radom´?r M?ech, Huchuan Lu, and You He. 2020. Unsupervised video object segmentation with joint hotspot tracking. In ECCV.
    [66]
    Miao Zhang, Jie Liu, Yifei Wang, Yongri Piao, Shunyu Yao, Wei Ji, Jingjing Li, Huchuan Lu, and Zhongxuan Luo. 2021. Dynamic context-sensitive filtering network for video salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1553--1563.
    [67]
    He Zhao and Richard P Wildes. 2019. Spatiotemporal feature residual propagation for action prediction. In ICCV.
    [68]
    Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.
    [69]
    Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, and Lei Zhang. 2020. Suppress and balance: A simple gated network for salient object detection. In European conference on computer vision. Springer, 35--51.
    [70]
    Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, and Long Quan. 2020. Learning discriminative feature with crf for unsupervised video object segmentation. In European Conference on Computer Vision. Springer, 445--462.
    [71]
    Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, and Ling Shao. 2020. Motion-attentive transition for zeroshot video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13066--13073.

    Cited By

    View all
    • (2024)Towards imbalanced motion: part-decoupling network for video portrait segmentationScience China Information Sciences10.1007/s11432-023-4030-y67:7Online publication date: 25-Jun-2024
    • (2023)Hierarchical Graph Pattern Understanding for Zero-Shot Video Object SegmentationIEEE Transactions on Image Processing10.1109/TIP.2023.332639532(5909-5920)Online publication date: 1-Jan-2023
    • (2023)Hierarchical Co-Attention Propagation Network for Zero-Shot Video Object SegmentationIEEE Transactions on Image Processing10.1109/TIP.2023.326724432(2348-2359)Online publication date: 2023
    • Show More Cited By

    Index Terms

    1. Bidirectionally Learning Dense Spatio-temporal Feature Propagation Network for Unsupervised Video Object Segmentation

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            MM '22: Proceedings of the 30th ACM International Conference on Multimedia
            October 2022
            7537 pages
            ISBN:9781450392037
            DOI:10.1145/3503161
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 10 October 2022

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. bidirectional feature propagation
            2. deep learning
            3. direction-aware graph attention
            4. feature refinement
            5. unsupervised video object segmentation

            Qualifiers

            • Research-article

            Funding Sources

            Conference

            MM '22
            Sponsor:

            Acceptance Rates

            Overall Acceptance Rate 995 of 4,171 submissions, 24%

            Upcoming Conference

            MM '24
            The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)79
            • Downloads (Last 6 weeks)1
            Reflects downloads up to 27 Jul 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Towards imbalanced motion: part-decoupling network for video portrait segmentationScience China Information Sciences10.1007/s11432-023-4030-y67:7Online publication date: 25-Jun-2024
            • (2023)Hierarchical Graph Pattern Understanding for Zero-Shot Video Object SegmentationIEEE Transactions on Image Processing10.1109/TIP.2023.332639532(5909-5920)Online publication date: 1-Jan-2023
            • (2023)Hierarchical Co-Attention Propagation Network for Zero-Shot Video Object SegmentationIEEE Transactions on Image Processing10.1109/TIP.2023.326724432(2348-2359)Online publication date: 2023
            • (2023)Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00070(688-698)Online publication date: 1-Oct-2023

            View Options

            Get Access

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media