Abstract
Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, the simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale multi-scene video portrait segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10843 sampled frames are finely annotated at the pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of the human body, the motion of portraits is part-associated, which leads to the different parts being relatively independent in motion. That is, the motion of different parts of the portraits is imbalanced. Towards this imbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a part-decoupling network (PDNet) for VPS. Specifically, an inter-frame part-discriminated attention (IPDA) module is proposed which unsupervisedly segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed on portrait parts with imbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.
References
Wang Y, Zhang W, Wang L, et al. Temporal consistent portrait video segmentation. Pattern Recogn, 2021, 120: 108143
Pandey R, Escolano S O, Legendre C, et al. Total relighting: learning to relight portraits for background replacement. ACM Trans Graph, 2021, 40: 1–21
Shen X, Hertzmann A, Jia J, et al. Automatic portrait segmentation for image stylization. Comput Graph Forum, 2016, 35: 93–102
Perazzi F, Pont-Tuset J, McWilliams B, et al. A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 724–732
Chu L, Liu Y, Wu Z, et al. PP-HumanSeg: connectivity-aware portrait segmentation with a large-scale teleconferencing video dataset. In: Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2022. 202–209
Lu X, Wang W, Ma C, et al. See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 3618–3627
Wang W, Lu X, Shen J, et al. Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. 9235–9244
Zhou T, Li J, Wang S, et al. MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process, 2020, 29: 8326–8338
Lu X, Wang W, Danelljan M, et al. Video object segmentation with episodic graph memory networks. In: Proceedings of European Conference on Computer Vision, 2020. 661–679
Liu D, Yu D, Wang C, et al. F2Net: learning to focus on the foreground for unsupervised video object segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021. 2109–2117
Ren S, Liu W, Liu Y, et al. Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 15430–15439
Ji G P, Fu K, Wu Z, et al. Full-duplex strategy for video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 4902–4913
Yang S, Zhang L, Qi J, et al. Learning motion-appearance co-attention for zero-shot video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 1544–1553
Pei G, Shen F, Yao Y, et al. Hierarchical feature alignment network for unsupervised video object segmentation. In: Proceedings of European Conference on Computer Vision, 2022. 596–613
Zhou Y, Xu X, Shen F, et al. Flow-edge guided unsupervised video object segmentation. IEEE Trans Circ Syst Video Technol, 2022, 32: 8116–8127
Xi L, Chen W, Wu X, et al. Implicit motion-compensated network for unsupervised video object segmentation. IEEE Trans Circ Syst Video Technol, 2022, 32: 6279–6292
Hung W C, Jampani V, Liu S, et al. SCOPS: self-supervised co-part segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 869–878
Liu S, Zhang L, Yang X, et al. Unsupervised part segmentation through disentangling appearance and shape. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 8351–8360
Huang Z, Li Y. Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 8659–8669
Yu X, Wang J, Zhao Y, et al. Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization. Pattern Recogn, 2023, 135: 109131
Li X, Liu S, Kim K, et al. Self-supervised single-view 3D reconstruction via semantic consistency. In: Proceedings of European Conference on Computer Vision, 2020. 677–693
Zhao Y, Li J, Zhang Y, et al. From pose to part: weakly-supervised pose evolution for human part segmentation. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 3107–3120
Xie C, Xia C, Ma M, et al. Pyramid grafting network for one-stage high resolution saliency detection. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 11707–11716
Zhao Z, Xia C, Xie C, et al. Complementary trilateral decoder for fast and accurate salient object detection. In: Proceedings of ACM International Conference on Multimedia, 2021. 4967–4975
Ma M, Xia C, Li J. Pyramidal feature shrinking for salient object detection. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021. 2311–2318
Zhuge M, Fan D P, Liu N, et al. Salient object detection via integrity learning. IEEE Trans Pattern Anal Mach Intell, 2022,:1
Cong R, Qin Q, Zhang C, et al. A weakly supervised learning framework for salient object detection via hybrid labels. IEEE Trans Circ Syst Video Technol, 2023, 33: 534–548
Fang C W, Tian H B, Zhang D W, et al. Densely nested top-down flows for salient object detection. Sci China Inf Sci, 2022, 65: 182103
Zhou W J, Liu C, Lei J S, et al. RLLNet: a lightweight remaking learning network for saliency redetection on RGB-D images. Sci China Inf Sci, 2022, 65: 160107
Yue Y H, Zou Q, Yu H K, et al. An end-to-end network for co-saliency detection in one single image. Sci China Inf Sci, 2023, 66: 210101
Zhang S H, Dong X, Li H, et al. PortraitNet: real-time portrait segmentation network for mobile device. Comput Graphic, 2019, 80: 104–113
Park H, Sjösund L L, Yoo Y, et al. SINet: extreme lightweight portrait segmentation networks with spatial squeeze modules and information blocking decoder. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. 2055–2063
Zhang X Y, Wang L J, Xie J, et al. Human-in-the-loop image segmentation and annotation. Sci China Inf Sci, 2020, 63: 219101
Vineet V, Warrell J, Ladicky L, et al. Human instance segmentation from video using detector-based conditional random fields. In: Proceedings of British Machine Vision Conference, 2011
Bhole C, Pal C. Automated person segmentation in videos. In: Proceedings of International Conference on Pattern Recognition, 2012. 3672–3675
Xu M, Fan C, Wang Y, et al. Joint person segmentation and identification in synchronized first- and third-person videos. In: Proceedings of European Conference on Computer Vision, 2018. 656–672
Gruosso M, Capece N, Erra U. Human segmentation in surveillance video with deep learning. Multimed Tools Appl, 2021, 80: 1175–1199
Song H, Wang W, Zhao S, et al. Pyramid dilated deeper convLSTM for video salient object detection. In: Proceedings of European Conference on Computer Vision, 2018. 744–760
Ventura C, Bellver M, Girbau A, et al. RVOS: end-to-end recurrent network for video object segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 5272–5281
Wang W, Shen J, Lu X, et al. Paying attention to video object pattern understanding. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2413–2428
Fan J, Su T, Zhang K, et al. Bidirectionally learning dense spatio-temporal feature propagation network for unsupervised video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2022. 3646–3655
Tokmakov P, Schmid C, Alahari K. Learning to segment moving objects. Int J Comput Vis, 2019, 127: 282–301
Faisal M, Akhter I, Ali M, et al. EpO-Net: exploiting geometric constraints on dense trajectories for motion saliency. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, 2020. 1873–1882
Zhao X, Pang Y, Yang J, et al. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2021. 2645–2653
Zhang K, Zhao Z, Liu D, et al. Deep transport network for unsupervised video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 8761–8770
Cong R, Song W, Lei J, et al. PSNet: parallel symmetric network for video salient object detection. IEEE Trans Emerg Top Comput Intell, 2023, 7: 402–414
Yang Z, Wang Q, Bertinetto L, et al. Anchor diffusion for unsupervised video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. 931–940
Zhang L, Zhang J, Lin Z, et al. Unsupervised video object segmentation with joint hotspot tracking. In: Proceedings of European Conference on Computer Vision, 2020. 490–506
Lee Y, Seong H, Kim E. Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. In: Proceedings of AAAI Conference on Artificial Intelligence, 2022. 1245–1253
Chen Y D, Hao C Y, Yang Z X, et al. Fast target-aware learning for few-shot video object segmentation. Sci China Inf Sci, 2022, 65: 182104
Wen P, Yang R, Xu Q, et al. DMVOS: discriminative matching for real-time video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2020. 2048–2056
Yang L, Han J, Zhao T, et al. Background-click supervision for temporal action localization. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 9814–9829
Zhao T, Han J, Yang L, et al. SODA: weakly supervised temporal action localization based on astute background response and self-distillation learning. Int J Comput Vis, 2021, 129: 2474–2498
Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of AAAI Conference on Artificial Intelligence, 2020. 11320–11327
Zhao T, Han J, Yang L, et al. Equivalent classification mapping for weakly supervised temporal action localization. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 3019–3031
Shi D, Zhong Y, Cao Q, et al. TriDet: temporal action detection with relative boundary modeling. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 18857–18866
Ochs P, Malik J, Brox T. Segmentation of moving objects by long term video analysis. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 1187–1200
Fan D P, Wang W, Cheng M M, et al. Shifting more attention to video salient object detection. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 8546–8556
Xu N, Yang L, Fan Y, et al. YouTube-VOS: a large-scale video object segmentation benchmark. 2018. ArXiv:1809.03327
Rahane A A, Subramanian A. Measures of complexity for large scale image datasets. In: Proceedings of International Conference on Artificial Intelligence in Information and Communication, 2020. 282–287
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778
Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 936–944
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017
Wang X, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 7794–7803
Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of Advances in Neural Information Processing Systems, 2019
Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255
Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740–755
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. 62132002, 62102206) and Major Key Project of PCL (Grant No. PCL2023A10-1).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Yu, T., Xia, C. & Li, J. Towards imbalanced motion: part-decoupling network for video portrait segmentation. Sci. China Inf. Sci. 67, 172104 (2024). https://doi.org/10.1007/s11432-023-4030-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-023-4030-y