research-article

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

Authors:

Qingming HuangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 1

Article No.: 12, Pages 1 - 22

https://doi.org/10.1145/3514250

Published: 05 January 2023 Publication History

Abstract

In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications’ needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the ICCV.

[2]

Junwen Chen, Wentao Bao, and Yu Kong. 2020. Activity-driven weakly supervised spatio-temporal grounding from untrimmed videos. In Proceedings of the ACM MM.

Digital Library

[3]

Jie Chen, Zhiheng Li, Jiebo Luo, and Chenliang Xu. 2020. Learning a weakly supervised video actor-action segmentation model with a wise selection. In Proceedings of the CVPR.

[4]

Kan Chen, Jiyang Gao, and Ram Nevatia. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the CVPR.

[5]

Peihao Chen, Chuang Gan, Guangyao Shen, Wenbing Huang, Runhao Zeng, and Mingkui Tan. 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 2723–2733.

[6]

Weidong Chen, Guorong Li, Xinfeng Zhang, Hongyang Yu, Shuhui Wang, and Qingming Huang. 2021. Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the ACM MM. 4053–4062.

Digital Library

[7]

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the ECCV.

Digital Library

[8]

Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly supervised spatio-temporally grounding natural sentence in video. In Proceedings of the ACL.

[9]

Chhavi Dhiman, Dinesh Kumar Vishwakarma, and Paras Agarwal. 2021. Part-wise spatio-temporal attention driven CNN-based 3D human action recognition. ACM Trans. Multimidia Comput. Commun. Appl. 17, 3 (2021), 1–24.

Digital Library

[10]

Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, and Tieniu Tan. 2020. Learning integral objects with intra-class discriminator for weakly supervised semantic segmentation. In Proceedings of the CVPR.

[11]

Junsong Fan, Zhaoxiang Zhang, Tieniu Tan, Chunfeng Song, and Jun Xiao. 2020. Cian: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI.

[12]

Ruochen Fan, Qibin Hou, Ming-Ming Cheng, Gang Yu, Ralph R Martin, and Shi-Min Hu. 2018. Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of the ECCV.

Digital Library

[13]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the ICCV.

[14]

Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. 2018. Actor and action video segmentation from a sentence. In Proceedings of the CVPR.

[15]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. In Proceedings of the NeurIPS.

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.

Digital Library

[17]

Qibin Hou, Peng-Tao Jiang, Yunchao Wei, and Ming-Ming Cheng. 2018. Self-erasing network for integral object attention. In Proceedings of the NeurIPS.

[18]

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In Proceedings of the ECCV.

[19]

De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding “it”: Weakly supervised reference-aware visual grounding in instructional videos. In Proceedings of the CVPR.

[20]

Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and Jingdong Wang. 2018. Weakly supervised semantic segmentation network with deep seeded region growing. In Proceedings of the CVPR.

[21]

Wanting Ji and Ruili Wang. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021), 1–18.

Digital Library

[22]

Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, and Bernt Schiele. 2017. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the CVPR.

[23]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 32–73.

Digital Library

[24]

Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, and Sungroh Yoon. 2019. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the CVPR.

[25]

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the CVPR.

[26]

Xueyi Li, Tianfei Zhou, Jianwu Li, Yi Zhou, and Zhaoxiang Zhang. 2021. Group-wise semantic mining for weakly supervised semantic segmentation. In Proceedings of the AAAI.

[27]

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. 2017. Tracking by natural language specification. In Proceedings of the CVPR.

[28]

Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the CVPR.

[29]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ICCV.

[30]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ACM MM.

Digital Library

[31]

Xinfang Liu, Xiushan Nie, Junya Teng, Li Lian, and Yilong Yin. 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021), 1–14.

Digital Library

[32]

Yongfei Liu, Bo Wan, Lin Ma, and Xuming He. 2021. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of the CVPR.

[33]

Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J. Crandall, and Steven C. H. Hoi. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the CVPR.

[34]

Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the CVPR.

[35]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the ACL (System Demonstrations).

[36]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the CVPR.

[37]

Bruce McIntosh, Kevin Duarte, Yogesh S. Rawat, and Mubarak Shah. 2020. Visual-textual capsule routing for text-based video segmentation. In Proceedings of the CVPR.

[38]

Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar relative positional encoding for video-language segmentation. In Proceedings of the IJCAI.

[39]

AJ Piergiovanni and Michael Ryoo. 2019. Temporal gaussian mixture layer for videos. In Proceedings of the ICML. PMLR, 5152–5161.

[40]

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673–2681.

Digital Library

[41]

Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the ECCV.

Digital Library

[42]

Jing Shi, Jia Xu, Boqing Gong, and Chenliang Xu. 2019. Not all frames are equal: Weakly supervised video grounding with contextual similarity and visual clustering losses. In Proceedings of the CVPR.

[43]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR.

[44]

Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. 2019. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In Proceedings of the CVPR.

[45]

Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool. 2020. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the ECCV.

Digital Library

[46]

Mingjie Sun, Jimin Xiao, Enggee Lim, Si Liu, and John Yannis Goulermas. 2021. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. PAMI 43, 11 (2021), 4189–4195.

[47]

Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers, and Yuri Boykov. 2018. On regularized losses for weakly supervised CNN segmentation. In Proceedings of the ECCV.

Digital Library

[48]

Pengjie Tang, Hanli Wang, and Qinyu Li. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019), 1–23.

Digital Library

[49]

Paul Vernaza and Manmohan Chandraker. 2017. Learning random-walk label propagation for weakly supervised semantic segmentation. In Proceedings of the CVPR.

[50]

Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI.

[51]

Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the ICCV.

[52]

Junsheng Xiao, Huahu Xu, Honghao Gao, Minjie Bian, and Yang Li. 2021. A weakly supervised semantic segmentation network by aggregating seed cues: The multi-object proposal generation perspective. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s (2021), 1–19.

Digital Library

[53]

Chenliang Xu and Jason J. Corso. 2016. Actor-action semantic segmentation with grouping process models. In CVPR.

[54]

Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J. Corso. 2015. Can humans fly? Action understanding with multiple classes of actors. In Proceedings of the CVPR.

[55]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML.

[56]

Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF CVPR. 10156–10165.

[57]

Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the ICCV.

[58]

Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the CVPR.

[59]

Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the ACM MM.

Digital Library

[60]

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the CVPR.

[61]

Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467–4480.

Digital Library

[62]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the CVPR.

[63]

Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime tv-l 1 optical flow. In Pattern Recognition.

[64]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF ICCV. 7094–7103.

[65]

Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, Mingyang Qian, and Yizhou Yu. 2019. Multi-source weak supervision for saliency detection. In Proceedings of the CVPR.

[66]

Bingfeng Zhang, Jimin Xiao, Yunchao Wei, Mingjie Sun, and Kaizhu Huang. 2020. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI.

[67]

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the CVPR.

[68]

Luowei Zhou, Nathan Louis, and Jason J. Corso. 2018. Weakly supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the BMVC.

[69]

Suguo Zhu, Xiaoxian Yang, Jun Yu, Zhenying Fang, Meng Wang, and Qingming Huang. 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 1–12.

Digital Library

[70]

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the ICCV.

Cited By

Jiang XYao YLiu SShen FNie LHua X(2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3656047
Marques GBoaro JBusson AGuedes ADuarte JColcher S(2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
https://dl.acm.org/doi/10.1145/3649465
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Show More Cited By

Index Terms

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Multiple instance learning

The characteristics specific of MIL problems are formally identified and described.MIL methods and applications are reviewed in the light of the problem characteristics.Comparative experiments show the impact of problem characteristics on 16 reference ...
Transformer Based Multiple Instance Learning for Weakly Supervised Histopathology Image Segmentation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022
Abstract
Hispathological image segmentation algorithms play a critical role in computer aided diagnosis technology. The development of weakly supervised segmentation algorithm alleviates the problem of medical image annotation that it is time-consuming and ...
Online MIL tracking with instance-level semi-supervised learning

In this paper we propose an online multiple instance boosting algorithm with instance-level semi-supervised learning, termed SemiMILBoost, to achieve robust object tracking. Our work revisits the multiple instance learning (MIL) formulation to alleviate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 1

January 2023

505 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3572858

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Online AM: 18 July 2022

Accepted: 28 January 2022

Revised: 25 January 2022

Received: 05 November 2021

Published in TOMM Volume 19, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Italy-China Collaboration Project TALENT
National Natural Science Foundation of China
Youth Innovation Promotion Association CAS
Fundamental Research Funds for Central Universities
China Postdoctoral Science Foundation Funded Project

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
675
Total Downloads

Downloads (Last 12 months)303
Downloads (Last 6 weeks)17

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jiang XYao YLiu SShen FNie LHua X(2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3656047
Marques GBoaro JBusson AGuedes ADuarte JColcher S(2024)Action Segmentation through Self-Supervised Video Features and Positional-Encoded EmbeddingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364946520:9(1-23)Online publication date: 24-Feb-2024
https://dl.acm.org/doi/10.1145/3649465
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Jing SZhang HZeng PGao LSong JShen H(2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3295098
Yu XWang DZhang M(2024)Dimensionality Reduction for Partial Label Learning: A Unified and Adaptive ApproachIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.336772136:8(3765-3782)Online publication date: Aug-2024
https://doi.org/10.1109/TKDE.2024.3367721
Wei ZYang XWang NGao X(2024)Dual-Adversarial Representation Disentanglement for Visible Infrared Person Re-IdentificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334428919(2186-2200)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2023.3344289
Zhang WQi ZWang SSu CSu LHuang Q(2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3568312
Liu SLi JZhang HXu LCao X(2023)Prediction With Visual Evidence: Sketch Classification Explanation via Stroke-Level AttributionsIEEE Transactions on Image Processing10.1109/TIP.2023.329740432(4393-4406)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3297404
Chen SYang LHu Y(2023)Video Captioning Based on Cascaded Attention-Guided Visual Feature FusionNeural Processing Letters10.1007/s11063-023-11386-y55:8(11509-11526)Online publication date: 25-Aug-2023
https://dl.acm.org/doi/10.1007/s11063-023-11386-y
Yang XWang XYe XLi T(2023)VMSG: a video caption network based on multimodal semantic grouping and semantic attentionMultimedia Systems10.1007/s00530-023-01124-829:5(2575-2589)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1007/s00530-023-01124-8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents