Towards imbalanced motion: part-decoupling network for video portrait segmentation

Yu, Tianshu; Xia, Changqun; Li, Jia

doi:10.1007/s11432-023-4030-y

Towards imbalanced motion: part-decoupling network for video portrait segmentation

Research Paper
Published: 25 June 2024

Volume 67, article number 172104, (2024)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Tianshu Yu¹,
Changqun Xia² &
Jia Li^1,2

38 Accesses
Explore all metrics

Abstract

Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, the simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale multi-scene video portrait segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10843 sampled frames are finely annotated at the pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of the human body, the motion of portraits is part-associated, which leads to the different parts being relatively independent in motion. That is, the motion of different parts of the portraits is imbalanced. Towards this imbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a part-decoupling network (PDNet) for VPS. Specifically, an inter-frame part-discriminated attention (IPDA) module is proposed which unsupervisedly segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed on portrait parts with imbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Wang Y, Zhang W, Wang L, et al. Temporal consistent portrait video segmentation. Pattern Recogn, 2021, 120: 108143
Article Google Scholar
Pandey R, Escolano S O, Legendre C, et al. Total relighting: learning to relight portraits for background replacement. ACM Trans Graph, 2021, 40: 1–21
Article Google Scholar
Shen X, Hertzmann A, Jia J, et al. Automatic portrait segmentation for image stylization. Comput Graph Forum, 2016, 35: 93–102
Article Google Scholar
Perazzi F, Pont-Tuset J, McWilliams B, et al. A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 724–732
Chu L, Liu Y, Wu Z, et al. PP-HumanSeg: connectivity-aware portrait segmentation with a large-scale teleconferencing video dataset. In: Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2022. 202–209
Lu X, Wang W, Ma C, et al. See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 3618–3627
Wang W, Lu X, Shen J, et al. Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. 9235–9244
Zhou T, Li J, Wang S, et al. MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process, 2020, 29: 8326–8338
Article Google Scholar
Lu X, Wang W, Danelljan M, et al. Video object segmentation with episodic graph memory networks. In: Proceedings of European Conference on Computer Vision, 2020. 661–679
Liu D, Yu D, Wang C, et al. F2Net: learning to focus on the foreground for unsupervised video object segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021. 2109–2117
Ren S, Liu W, Liu Y, et al. Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 15430–15439
Ji G P, Fu K, Wu Z, et al. Full-duplex strategy for video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 4902–4913
Yang S, Zhang L, Qi J, et al. Learning motion-appearance co-attention for zero-shot video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 1544–1553
Pei G, Shen F, Yao Y, et al. Hierarchical feature alignment network for unsupervised video object segmentation. In: Proceedings of European Conference on Computer Vision, 2022. 596–613
Zhou Y, Xu X, Shen F, et al. Flow-edge guided unsupervised video object segmentation. IEEE Trans Circ Syst Video Technol, 2022, 32: 8116–8127
Article Google Scholar
Xi L, Chen W, Wu X, et al. Implicit motion-compensated network for unsupervised video object segmentation. IEEE Trans Circ Syst Video Technol, 2022, 32: 6279–6292
Article Google Scholar
Hung W C, Jampani V, Liu S, et al. SCOPS: self-supervised co-part segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 869–878
Liu S, Zhang L, Yang X, et al. Unsupervised part segmentation through disentangling appearance and shape. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 8351–8360
Huang Z, Li Y. Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 8659–8669
Yu X, Wang J, Zhao Y, et al. Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization. Pattern Recogn, 2023, 135: 109131
Article Google Scholar
Li X, Liu S, Kim K, et al. Self-supervised single-view 3D reconstruction via semantic consistency. In: Proceedings of European Conference on Computer Vision, 2020. 677–693
Zhao Y, Li J, Zhang Y, et al. From pose to part: weakly-supervised pose evolution for human part segmentation. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 3107–3120
Google Scholar
Xie C, Xia C, Ma M, et al. Pyramid grafting network for one-stage high resolution saliency detection. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 11707–11716
Zhao Z, Xia C, Xie C, et al. Complementary trilateral decoder for fast and accurate salient object detection. In: Proceedings of ACM International Conference on Multimedia, 2021. 4967–4975
Ma M, Xia C, Li J. Pyramidal feature shrinking for salient object detection. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021. 2311–2318
Zhuge M, Fan D P, Liu N, et al. Salient object detection via integrity learning. IEEE Trans Pattern Anal Mach Intell, 2022,:1
Cong R, Qin Q, Zhang C, et al. A weakly supervised learning framework for salient object detection via hybrid labels. IEEE Trans Circ Syst Video Technol, 2023, 33: 534–548
Article Google Scholar
Fang C W, Tian H B, Zhang D W, et al. Densely nested top-down flows for salient object detection. Sci China Inf Sci, 2022, 65: 182103
Article MathSciNet Google Scholar
Zhou W J, Liu C, Lei J S, et al. RLLNet: a lightweight remaking learning network for saliency redetection on RGB-D images. Sci China Inf Sci, 2022, 65: 160107
Article Google Scholar
Yue Y H, Zou Q, Yu H K, et al. An end-to-end network for co-saliency detection in one single image. Sci China Inf Sci, 2023, 66: 210101
Article MathSciNet Google Scholar
Zhang S H, Dong X, Li H, et al. PortraitNet: real-time portrait segmentation network for mobile device. Comput Graphic, 2019, 80: 104–113
Article Google Scholar
Park H, Sjösund L L, Yoo Y, et al. SINet: extreme lightweight portrait segmentation networks with spatial squeeze modules and information blocking decoder. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. 2055–2063
Zhang X Y, Wang L J, Xie J, et al. Human-in-the-loop image segmentation and annotation. Sci China Inf Sci, 2020, 63: 219101
Article Google Scholar
Vineet V, Warrell J, Ladicky L, et al. Human instance segmentation from video using detector-based conditional random fields. In: Proceedings of British Machine Vision Conference, 2011
Bhole C, Pal C. Automated person segmentation in videos. In: Proceedings of International Conference on Pattern Recognition, 2012. 3672–3675
Xu M, Fan C, Wang Y, et al. Joint person segmentation and identification in synchronized first- and third-person videos. In: Proceedings of European Conference on Computer Vision, 2018. 656–672
Gruosso M, Capece N, Erra U. Human segmentation in surveillance video with deep learning. Multimed Tools Appl, 2021, 80: 1175–1199
Article Google Scholar
Song H, Wang W, Zhao S, et al. Pyramid dilated deeper convLSTM for video salient object detection. In: Proceedings of European Conference on Computer Vision, 2018. 744–760
Ventura C, Bellver M, Girbau A, et al. RVOS: end-to-end recurrent network for video object segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 5272–5281
Wang W, Shen J, Lu X, et al. Paying attention to video object pattern understanding. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2413–2428
Article Google Scholar
Fan J, Su T, Zhang K, et al. Bidirectionally learning dense spatio-temporal feature propagation network for unsupervised video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2022. 3646–3655
Tokmakov P, Schmid C, Alahari K. Learning to segment moving objects. Int J Comput Vis, 2019, 127: 282–301
Article Google Scholar
Faisal M, Akhter I, Ali M, et al. EpO-Net: exploiting geometric constraints on dense trajectories for motion saliency. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, 2020. 1873–1882
Zhao X, Pang Y, Yang J, et al. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2021. 2645–2653
Zhang K, Zhao Z, Liu D, et al. Deep transport network for unsupervised video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 8761–8770
Cong R, Song W, Lei J, et al. PSNet: parallel symmetric network for video salient object detection. IEEE Trans Emerg Top Comput Intell, 2023, 7: 402–414
Article Google Scholar
Yang Z, Wang Q, Bertinetto L, et al. Anchor diffusion for unsupervised video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. 931–940
Zhang L, Zhang J, Lin Z, et al. Unsupervised video object segmentation with joint hotspot tracking. In: Proceedings of European Conference on Computer Vision, 2020. 490–506
Lee Y, Seong H, Kim E. Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. In: Proceedings of AAAI Conference on Artificial Intelligence, 2022. 1245–1253
Chen Y D, Hao C Y, Yang Z X, et al. Fast target-aware learning for few-shot video object segmentation. Sci China Inf Sci, 2022, 65: 182104
Article Google Scholar
Wen P, Yang R, Xu Q, et al. DMVOS: discriminative matching for real-time video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2020. 2048–2056
Yang L, Han J, Zhao T, et al. Background-click supervision for temporal action localization. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 9814–9829
Article Google Scholar
Zhao T, Han J, Yang L, et al. SODA: weakly supervised temporal action localization based on astute background response and self-distillation learning. Int J Comput Vis, 2021, 129: 2474–2498
Article Google Scholar
Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of AAAI Conference on Artificial Intelligence, 2020. 11320–11327
Zhao T, Han J, Yang L, et al. Equivalent classification mapping for weakly supervised temporal action localization. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 3019–3031
Google Scholar
Shi D, Zhong Y, Cao Q, et al. TriDet: temporal action detection with relative boundary modeling. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 18857–18866
Ochs P, Malik J, Brox T. Segmentation of moving objects by long term video analysis. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 1187–1200
Article Google Scholar
Fan D P, Wang W, Cheng M M, et al. Shifting more attention to video salient object detection. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 8546–8556
Xu N, Yang L, Fan Y, et al. YouTube-VOS: a large-scale video object segmentation benchmark. 2018. ArXiv:1809.03327
Rahane A A, Subramanian A. Measures of complexity for large scale image datasets. In: Proceedings of International Conference on Artificial Intelligence in Information and Communication, 2020. 282–287
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778
Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 936–944
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017
Wang X, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 7794–7803
Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255
Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740–755

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 62132002, 62102206) and Major Key Project of PCL (Grant No. PCL2023A10-1).

Author information

Authors and Affiliations

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Tianshu Yu & Jia Li
Peng Cheng Laboratory, Shenzhen, 518055, China
Changqun Xia & Jia Li

Authors

Tianshu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Changqun Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jia Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Changqun Xia or Jia Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, T., Xia, C. & Li, J. Towards imbalanced motion: part-decoupling network for video portrait segmentation. Sci. China Inf. Sci. 67, 172104 (2024). https://doi.org/10.1007/s11432-023-4030-y

Download citation

Received: 10 October 2023
Revised: 24 January 2024
Accepted: 12 April 2024
Published: 25 June 2024
DOI: https://doi.org/10.1007/s11432-023-4030-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards imbalanced motion: part-decoupling network for video portrait segmentation

Abstract

Access this article

Subscribe and save

Buy Now

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation