Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Towards imbalanced motion: part-decoupling network for video portrait segmentation

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, the simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale multi-scene video portrait segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10843 sampled frames are finely annotated at the pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of the human body, the motion of portraits is part-associated, which leads to the different parts being relatively independent in motion. That is, the motion of different parts of the portraits is imbalanced. Towards this imbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a part-decoupling network (PDNet) for VPS. Specifically, an inter-frame part-discriminated attention (IPDA) module is proposed which unsupervisedly segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed on portrait parts with imbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Wang Y, Zhang W, Wang L, et al. Temporal consistent portrait video segmentation. Pattern Recogn, 2021, 120: 108143

    Article  Google Scholar 

  2. Pandey R, Escolano S O, Legendre C, et al. Total relighting: learning to relight portraits for background replacement. ACM Trans Graph, 2021, 40: 1–21

    Article  Google Scholar 

  3. Shen X, Hertzmann A, Jia J, et al. Automatic portrait segmentation for image stylization. Comput Graph Forum, 2016, 35: 93–102

    Article  Google Scholar 

  4. Perazzi F, Pont-Tuset J, McWilliams B, et al. A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 724–732

  5. Chu L, Liu Y, Wu Z, et al. PP-HumanSeg: connectivity-aware portrait segmentation with a large-scale teleconferencing video dataset. In: Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2022. 202–209

  6. Lu X, Wang W, Ma C, et al. See more, know more: unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 3618–3627

  7. Wang W, Lu X, Shen J, et al. Zero-shot video object segmentation via attentive graph neural networks. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. 9235–9244

  8. Zhou T, Li J, Wang S, et al. MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE Trans Image Process, 2020, 29: 8326–8338

    Article  Google Scholar 

  9. Lu X, Wang W, Danelljan M, et al. Video object segmentation with episodic graph memory networks. In: Proceedings of European Conference on Computer Vision, 2020. 661–679

  10. Liu D, Yu D, Wang C, et al. F2Net: learning to focus on the foreground for unsupervised video object segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021. 2109–2117

  11. Ren S, Liu W, Liu Y, et al. Reciprocal transformations for unsupervised video object segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 15430–15439

  12. Ji G P, Fu K, Wu Z, et al. Full-duplex strategy for video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 4902–4913

  13. Yang S, Zhang L, Qi J, et al. Learning motion-appearance co-attention for zero-shot video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 1544–1553

  14. Pei G, Shen F, Yao Y, et al. Hierarchical feature alignment network for unsupervised video object segmentation. In: Proceedings of European Conference on Computer Vision, 2022. 596–613

  15. Zhou Y, Xu X, Shen F, et al. Flow-edge guided unsupervised video object segmentation. IEEE Trans Circ Syst Video Technol, 2022, 32: 8116–8127

    Article  Google Scholar 

  16. Xi L, Chen W, Wu X, et al. Implicit motion-compensated network for unsupervised video object segmentation. IEEE Trans Circ Syst Video Technol, 2022, 32: 6279–6292

    Article  Google Scholar 

  17. Hung W C, Jampani V, Liu S, et al. SCOPS: self-supervised co-part segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 869–878

  18. Liu S, Zhang L, Yang X, et al. Unsupervised part segmentation through disentangling appearance and shape. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. 8351–8360

  19. Huang Z, Li Y. Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. 8659–8669

  20. Yu X, Wang J, Zhao Y, et al. Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization. Pattern Recogn, 2023, 135: 109131

    Article  Google Scholar 

  21. Li X, Liu S, Kim K, et al. Self-supervised single-view 3D reconstruction via semantic consistency. In: Proceedings of European Conference on Computer Vision, 2020. 677–693

  22. Zhao Y, Li J, Zhang Y, et al. From pose to part: weakly-supervised pose evolution for human part segmentation. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 3107–3120

    Google Scholar 

  23. Xie C, Xia C, Ma M, et al. Pyramid grafting network for one-stage high resolution saliency detection. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. 11707–11716

  24. Zhao Z, Xia C, Xie C, et al. Complementary trilateral decoder for fast and accurate salient object detection. In: Proceedings of ACM International Conference on Multimedia, 2021. 4967–4975

  25. Ma M, Xia C, Li J. Pyramidal feature shrinking for salient object detection. In: Proceedings of AAAI Conference on Artificial Intelligence, 2021. 2311–2318

  26. Zhuge M, Fan D P, Liu N, et al. Salient object detection via integrity learning. IEEE Trans Pattern Anal Mach Intell, 2022,:1

  27. Cong R, Qin Q, Zhang C, et al. A weakly supervised learning framework for salient object detection via hybrid labels. IEEE Trans Circ Syst Video Technol, 2023, 33: 534–548

    Article  Google Scholar 

  28. Fang C W, Tian H B, Zhang D W, et al. Densely nested top-down flows for salient object detection. Sci China Inf Sci, 2022, 65: 182103

    Article  MathSciNet  Google Scholar 

  29. Zhou W J, Liu C, Lei J S, et al. RLLNet: a lightweight remaking learning network for saliency redetection on RGB-D images. Sci China Inf Sci, 2022, 65: 160107

    Article  Google Scholar 

  30. Yue Y H, Zou Q, Yu H K, et al. An end-to-end network for co-saliency detection in one single image. Sci China Inf Sci, 2023, 66: 210101

    Article  MathSciNet  Google Scholar 

  31. Zhang S H, Dong X, Li H, et al. PortraitNet: real-time portrait segmentation network for mobile device. Comput Graphic, 2019, 80: 104–113

    Article  Google Scholar 

  32. Park H, Sjösund L L, Yoo Y, et al. SINet: extreme lightweight portrait segmentation networks with spatial squeeze modules and information blocking decoder. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2020. 2055–2063

  33. Zhang X Y, Wang L J, Xie J, et al. Human-in-the-loop image segmentation and annotation. Sci China Inf Sci, 2020, 63: 219101

    Article  Google Scholar 

  34. Vineet V, Warrell J, Ladicky L, et al. Human instance segmentation from video using detector-based conditional random fields. In: Proceedings of British Machine Vision Conference, 2011

  35. Bhole C, Pal C. Automated person segmentation in videos. In: Proceedings of International Conference on Pattern Recognition, 2012. 3672–3675

  36. Xu M, Fan C, Wang Y, et al. Joint person segmentation and identification in synchronized first- and third-person videos. In: Proceedings of European Conference on Computer Vision, 2018. 656–672

  37. Gruosso M, Capece N, Erra U. Human segmentation in surveillance video with deep learning. Multimed Tools Appl, 2021, 80: 1175–1199

    Article  Google Scholar 

  38. Song H, Wang W, Zhao S, et al. Pyramid dilated deeper convLSTM for video salient object detection. In: Proceedings of European Conference on Computer Vision, 2018. 744–760

  39. Ventura C, Bellver M, Girbau A, et al. RVOS: end-to-end recurrent network for video object segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 5272–5281

  40. Wang W, Shen J, Lu X, et al. Paying attention to video object pattern understanding. IEEE Trans Pattern Anal Mach Intell, 2021, 43: 2413–2428

    Article  Google Scholar 

  41. Fan J, Su T, Zhang K, et al. Bidirectionally learning dense spatio-temporal feature propagation network for unsupervised video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2022. 3646–3655

  42. Tokmakov P, Schmid C, Alahari K. Learning to segment moving objects. Int J Comput Vis, 2019, 127: 282–301

    Article  Google Scholar 

  43. Faisal M, Akhter I, Ali M, et al. EpO-Net: exploiting geometric constraints on dense trajectories for motion saliency. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, 2020. 1873–1882

  44. Zhao X, Pang Y, Yang J, et al. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2021. 2645–2653

  45. Zhang K, Zhao Z, Liu D, et al. Deep transport network for unsupervised video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021. 8761–8770

  46. Cong R, Song W, Lei J, et al. PSNet: parallel symmetric network for video salient object detection. IEEE Trans Emerg Top Comput Intell, 2023, 7: 402–414

    Article  Google Scholar 

  47. Yang Z, Wang Q, Bertinetto L, et al. Anchor diffusion for unsupervised video object segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2019. 931–940

  48. Zhang L, Zhang J, Lin Z, et al. Unsupervised video object segmentation with joint hotspot tracking. In: Proceedings of European Conference on Computer Vision, 2020. 490–506

  49. Lee Y, Seong H, Kim E. Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. In: Proceedings of AAAI Conference on Artificial Intelligence, 2022. 1245–1253

  50. Chen Y D, Hao C Y, Yang Z X, et al. Fast target-aware learning for few-shot video object segmentation. Sci China Inf Sci, 2022, 65: 182104

    Article  Google Scholar 

  51. Wen P, Yang R, Xu Q, et al. DMVOS: discriminative matching for real-time video object segmentation. In: Proceedings of ACM International Conference on Multimedia, 2020. 2048–2056

  52. Yang L, Han J, Zhao T, et al. Background-click supervision for temporal action localization. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 9814–9829

    Article  Google Scholar 

  53. Zhao T, Han J, Yang L, et al. SODA: weakly supervised temporal action localization based on astute background response and self-distillation learning. Int J Comput Vis, 2021, 129: 2474–2498

    Article  Google Scholar 

  54. Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of AAAI Conference on Artificial Intelligence, 2020. 11320–11327

  55. Zhao T, Han J, Yang L, et al. Equivalent classification mapping for weakly supervised temporal action localization. IEEE Trans Pattern Anal Mach Intell, 2023, 45: 3019–3031

    Google Scholar 

  56. Shi D, Zhong Y, Cao Q, et al. TriDet: temporal action detection with relative boundary modeling. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 18857–18866

  57. Ochs P, Malik J, Brox T. Segmentation of moving objects by long term video analysis. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 1187–1200

    Article  Google Scholar 

  58. Fan D P, Wang W, Cheng M M, et al. Shifting more attention to video salient object detection. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. 8546–8556

  59. Xu N, Yang L, Fan Y, et al. YouTube-VOS: a large-scale video object segmentation benchmark. 2018. ArXiv:1809.03327

  60. Rahane A A, Subramanian A. Measures of complexity for large scale image datasets. In: Proceedings of International Conference on Artificial Intelligence in Information and Communication, 2020. 282–287

  61. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778

  62. Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 936–944

  63. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017

  64. Wang X, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 7794–7803

  65. Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of Advances in Neural Information Processing Systems, 2019

  66. Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248–255

  67. Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740–755

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 62132002, 62102206) and Major Key Project of PCL (Grant No. PCL2023A10-1).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Changqun Xia or Jia Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, T., Xia, C. & Li, J. Towards imbalanced motion: part-decoupling network for video portrait segmentation. Sci. China Inf. Sci. 67, 172104 (2024). https://doi.org/10.1007/s11432-023-4030-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-023-4030-y

Keywords