Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

Published: 05 January 2023 Publication History

Abstract

In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications’ needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the ICCV.
[2]
Junwen Chen, Wentao Bao, and Yu Kong. 2020. Activity-driven weakly supervised spatio-temporal grounding from untrimmed videos. In Proceedings of the ACM MM.
[3]
Jie Chen, Zhiheng Li, Jiebo Luo, and Chenliang Xu. 2020. Learning a weakly supervised video actor-action segmentation model with a wise selection. In Proceedings of the CVPR.
[4]
Kan Chen, Jiyang Gao, and Ram Nevatia. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the CVPR.
[5]
Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 2723–2733.
[6]
Weidong Chen, Guorong Li, Xinfeng Zhang, Hongyang Yu, Shuhui Wang, and Qingming Huang. 2021. Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the ACM MM. 4053–4062.
[7]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the ECCV.
[8]
Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly supervised spatio-temporally grounding natural sentence in video. In Proceedings of the ACL.
[9]
Chhavi Dhiman, Dinesh Kumar Vishwakarma, and Paras Agarwal. 2021. Part-wise spatio-temporal attention driven CNN-based 3D human action recognition. ACM Trans. Multimidia Comput. Commun. Appl. 17, 3 (2021), 1–24.
[10]
Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. 2020. Learning integral objects with intra-class discriminator for weakly supervised semantic segmentation. In Proceedings of the CVPR.
[11]
Junsong Fan, Zhaoxiang Zhang, Tieniu Tan, Chunfeng Song, and Jun Xiao. 2020. Cian: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI.
[12]
Ruochen Fan, Qibin Hou, Ming-Ming Cheng, Gang Yu, Ralph R Martin, and Shi-Min Hu. 2018. Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of the ECCV.
[13]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the ICCV.
[14]
Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. 2018. Actor and action video segmentation from a sentence. In Proceedings of the CVPR.
[15]
Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. In Proceedings of the NeurIPS.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[17]
Qibin Hou, Peng-Tao Jiang, Yunchao Wei, and Ming-Ming Cheng. 2018. Self-erasing network for integral object attention. In Proceedings of the NeurIPS.
[18]
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In Proceedings of the ECCV.
[19]
De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding “it”: Weakly supervised reference-aware visual grounding in instructional videos. In Proceedings of the CVPR.
[20]
Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and Jingdong Wang. 2018. Weakly supervised semantic segmentation network with deep seeded region growing. In Proceedings of the CVPR.
[21]
Wanting Ji and Ruili Wang. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021), 1–18.
[22]
Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, and Bernt Schiele. 2017. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the CVPR.
[23]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 32–73.
[24]
Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and Sungroh Yoon. 2019. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the CVPR.
[25]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the CVPR.
[26]
Xueyi Li, Tianfei Zhou, Jianwu Li, Yi Zhou, and Zhaoxiang Zhang. 2021. Group-wise semantic mining for weakly supervised semantic segmentation. In Proceedings of the AAAI.
[27]
Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. 2017. Tracking by natural language specification. In Proceedings of the CVPR.
[28]
Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the CVPR.
[29]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ICCV.
[30]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ACM MM.
[31]
Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021), 1–14.
[32]
Yongfei Liu, Bo Wan, Lin Ma, and Xuming He. 2021. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of the CVPR.
[33]
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J. Crandall, and Steven C. H. Hoi. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the CVPR.
[34]
Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the CVPR.
[35]
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the ACL (System Demonstrations).
[36]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the CVPR.
[37]
Bruce McIntosh, Kevin Duarte, Yogesh S. Rawat, and Mubarak Shah. 2020. Visual-textual capsule routing for text-based video segmentation. In Proceedings of the CVPR.
[38]
Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar relative positional encoding for video-language segmentation. In Proceedings of the IJCAI.
[39]
AJ Piergiovanni and Michael Ryoo. 2019. Temporal gaussian mixture layer for videos. In Proceedings of the ICML. PMLR, 5152–5161.
[40]
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673–2681.
[41]
Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the ECCV.
[42]
Jing Shi, Jia Xu, Boqing Gong, and Chenliang Xu. 2019. Not all frames are equal: Weakly supervised video grounding with contextual similarity and visual clustering losses. In Proceedings of the CVPR.
[43]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR.
[44]
Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. 2019. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In Proceedings of the CVPR.
[45]
Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool. 2020. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the ECCV.
[46]
Mingjie Sun, Jimin Xiao, Enggee Lim, Si Liu, and John Yannis Goulermas. 2021. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. PAMI 43, 11 (2021), 4189–4195.
[47]
Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. 2018. On regularized losses for weakly supervised CNN segmentation. In Proceedings of the ECCV.
[48]
Pengjie Tang, Hanli Wang, and Qinyu Li. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019), 1–23.
[49]
Paul Vernaza and Manmohan Chandraker. 2017. Learning random-walk label propagation for weakly supervised semantic segmentation. In Proceedings of the CVPR.
[50]
Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI.
[51]
Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the ICCV.
[52]
Junsheng Xiao, Huahu Xu, Honghao Gao, Minjie Bian, and Yang Li. 2021. A weakly supervised semantic segmentation network by aggregating seed cues: The multi-object proposal generation perspective. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s (2021), 1–19.
[53]
Chenliang Xu and Jason J. Corso. 2016. Actor-action semantic segmentation with grouping process models. In CVPR.
[54]
Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J. Corso. 2015. Can humans fly? Action understanding with multiple classes of actors. In Proceedings of the CVPR.
[55]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML.
[56]
Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF CVPR. 10156–10165.
[57]
Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the ICCV.
[58]
Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the CVPR.
[59]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the ACM MM.
[60]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the CVPR.
[61]
Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467–4480.
[62]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the CVPR.
[63]
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime tv-l 1 optical flow. In Pattern Recognition.
[64]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF ICCV. 7094–7103.
[65]
Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, Mingyang Qian, and Yizhou Yu. 2019. Multi-source weak supervision for saliency detection. In Proceedings of the CVPR.
[66]
Bingfeng Zhang, Jimin Xiao, Yunchao Wei, Mingjie Sun, and Kaizhu Huang. 2020. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI.
[67]
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the CVPR.
[68]
Luowei Zhou, Nathan Louis, and Jason J. Corso. 2018. Weakly supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the BMVC.
[69]
Suguo Zhu, Xiaoxian Yang, Jun Yu, Zhenying Fang, Meng Wang, and Qingming Huang. 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 1–12.
[70]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the ICCV.

Cited By

View all
  • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
  • (2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
  • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
  • Show More Cited By

Index Terms

  1. Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
    January 2023
    505 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3572858
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 January 2023
    Online AM: 18 July 2022
    Accepted: 28 January 2022
    Revised: 25 January 2022
    Received: 05 November 2021
    Published in TOMM Volume 19, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multiple instance learning
    2. weakly supervised learning
    3. video actor-action segmentation
    4. cross-modal learning

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Italy-China Collaboration Project TALENT
    • National Natural Science Foundation of China
    • Youth Innovation Promotion Association CAS
    • Fundamental Research Funds for Central Universities
    • China Postdoctoral Science Foundation Funded Project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)303
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 13 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
    • (2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
    • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
    • (2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
    • (2024)Dimensionality Reduction for Partial Label Learning: A Unified and Adaptive ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336772136:8(3765-3782)Online publication date: Aug-2024
    • (2024)Dual-Adversarial Representation Disentanglement for Visible Infrared Person Re-IdentificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334428919(2186-2200)Online publication date: 1-Jan-2024
    • (2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
    • (2023)Prediction With Visual Evidence: Sketch Classification Explanation via Stroke-Level AttributionsIEEE Transactions on Image Processing10.1109/TIP.2023.329740432(4393-4406)Online publication date: 1-Jan-2023
    • (2023)Video Captioning Based on Cascaded Attention-Guided Visual Feature FusionNeural Processing Letters10.1007/s11063-023-11386-y55:8(11509-11526)Online publication date: 25-Aug-2023
    • (2023)VMSG: a video caption network based on multimodal semantic grouping and semantic attentionMultimedia Systems10.1007/s00530-023-01124-829:5(2575-2589)Online publication date: 13-Jun-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media