research-article

Open access

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Authors:

Tat-Seng ChuaAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5281 - 5291

https://doi.org/10.1145/3581783.3612096

Published: 27 October 2023 Publication History

Abstract

As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, e.g., filtering noisy branches and newly building informative connections, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods. Our HostSG representation shows greater potential to facilitate a broader range of other video understanding tasks.

References

[1]

Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, and Changsheng Xu. 2022. Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation. In Proceedings of ACM MM 2022. 581--590.

Digital Library

[2]

Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zheng Qin. 2020. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization. In Proceedings of ACM MM 2020. ACM, 898--906.

Digital Library

[3]

Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. In Proceedings of CVPR 2020. 9959--9968.

[4]

Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann. 2022. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement. In Proceedings of MM 2022. 3272--3281.

Digital Library

[5]

Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. 2022. (2.5 1) D Spatio-Temporal Scene Graphs for Video Question Answering. In Proceedings of AAAI 2022, Vol. 36. 444--453.

[6]

Yuya Chiba and Ryuichiro Higashinaka. 2021. Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information. In Proceedings of Interspeech 2021. 241--245.

[7]

Junhyeong Cho, Youngseok Yoon, and Suha Kwak. 2022. Collaborative Transformers for Grounded Situation Recognition. In Proceedings of CVPR 2022. 19627--19636.

[8]

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-Temporal Transformer for Dynamic Scene Graph Generation. In Proceedings of ICCV 2021. 16352--16362.

[9]

Peng Dai, Xin Yu, Lan Ma, Baoheng Zhang, Jia Li, Wenbo Li, Jiajun Shen, and Xiaojuan Qi. 2022. Video Demoiréing with Relation-Based Temporal Consistency. In Proceedings of CVPR 2022. 17601--17610.

[10]

Hao Fei, Fei Li, Bobo Li, and Donghong Ji. 2021. Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax. In Proceedings of the AAAI Conference on Artificial Intelligence. 12794--12802.

[11]

Hao Fei, Shengqiong Wu, Yafeng Ren, Fei Li, and Donghong Ji. 2021. Better Combine Them Together! Integrating Syntactic Constituency and Dependency Representations for Semantic Role Labeling. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 549--559.

[12]

Hao Fei, Meishan Zhang, Bobo Li, and Donghong Ji. 2021. End-to-end Semantic Role Labeling with Neural Transition-based Model. In Proceedings of the AAAI Conference on Artificial Intelligence. 12803--12811.

[13]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. In Proceedings of ICCV 2019. 6201--6210.

[14]

Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, and Somdeb Majumdar. 2021. Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs. CoRR abs/2112.09828 (2021). arXiv:2112.09828

[15]

Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, and Anoop Cherian. 2021. Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers. In Proceedings of AAAI 2021. 1415--1423.

[16]

Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. 2022. RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution. In Proceedings of CVPR 2022. 17420--17430.

[17]

Daniel Gildea and Daniel Jurafsky. 2000. Automatic Labeling of Semantic Roles. In Proceedings of ACL 2000. 512--520.

Digital Library

[18]

Saurabh Gupta and Jitendra Malik. 2015. CoRR abs/1505.04474 (2015). arXiv:1505.04474

[19]

Monica Haurilet, Ziad Al-Halah, and Rainer Stiefelhagen. 2019. DynGraph: Visual Question Answering via Dynamic Scene Graphs. In Proceedings of GCPR 2019 (Lecture Notes in Computer Science, Vol. 11824). 428--441.

Digital Library

[20]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs. In Proceedings of CVPR 2020. 10233--10244.

[21]

Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence. 1655--1663.

[22]

Jiayi Ji, Yiwei Ma, Xiaoshuai Sun, Yiyi Zhou, Yongjian Wu, and Rongrong Ji. 2022. Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Transactions on Image Processing 31 (2022), 4321--4335.

Digital Library

[23]

Soo-Han Kang and Ji-Hyeong Han. 2023. Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction. Int. J. Soc. Robotics 15, 4 (2023), 631--641.

[24]

Zeeshan Khan, C. V. Jawahar, and Makarand Tapaswi. 2022. Grounded Video Situation Recognition. In Proceedings of NeurIPS 2022.

[25]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated Graph Sequence Neural Networks. In Proceedings of ICLR 2016.

[26]

Yiming Li, Xiaoshan Yang, and Changsheng Xu. 2022. Dynamic Scene Graph Generation via Anticipatory Pre-training. In Proceedings of CVPR. 13864--13873.

[27]

Xuyang Lu and Yang Gao. 2023. Guide and interact: scene-graph based generation and control of video captions. Multim. Syst. 29, 2 (2023), 797--809.

Digital Library

[28]

Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 2286--2293.

[29]

Nafise Sadat Moosavi and Michael Strube. 2016. Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric. In Proceedings of ACL 2016.

[30]

Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K. Roy-Chowdhury. 2023. Unbiased Scene Graph Generation in Videos. CoRR abs/2304.00733 (2023). arXiv:2304.00733

[31]

Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In Proceedings of ACM MM 2022. 3234--3243.

Digital Library

[32]

Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. 2021. Visual Semantic Role Labeling for Video Understanding. In Proceedings of CVPR. 5589--5600.

[33]

Yale Song, Miriam Redi, Jordi Vallmitjana, and Alejandro Jaimes. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of ACM MM 2016. 659--668.

Digital Library

[34]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased Scene Graph Generation From Biased Training. In Proceedings of CVPR 2020. 3713--3722.

[35]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. CoRR abs/1710.10903 (2017).

[36]

Qizhi Wan, Changxuan Wan, Keli Xiao, Rong Hu, and Dexi Liu. 2023. A Multi-channel Hierarchical Graph Attention Network for Open Event Extraction. ACM Trans. Inf. Syst. 41, 1 (2023), 20:1--20:27.

Digital Library

[37]

Chenwei Wang, Siyi Luo, Jifang Pei, Xiaoyu Liu, Yulin Huang, Yin Zhang, and Jianyu Yang. 2023. An Entropy-Awareness Meta-Learning Method for SAR Open-Set ATR. IEEE Geoscience and Remote Sensing Letters (2023).

[38]

Chenwei Wang, Jifang Pei, Siyi Luo, Weibo Huo, Yulin Huang, Yin Zhang, and Jianyu Yang. 2023. SAR Ship Target Recognition via Multiscale Feature Attention and Adaptive-Weighed Classifier. IEEE Geoscience and Remote Sensing Letters 20 (2023), 1--5.

[39]

Chenwei Wang, Jifang Pei, Jianyu Yang, Xiaoyu Liu, Yulin Huang, and Deqing Mao. 2022. Recognition in Label and Discrimination in Feature: A Hierarchically Designed Lightweight Method for Limited Data in SAR ATR. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--13.

[40]

Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, and Xiaoshuai Sun. 2023. Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network. Proceedings of the AAAI Conference on Artificial Intelligence (2023), 2528--2536.

Digital Library

[41]

Jue Wang, Gedas Bertasius, Du Tran, and Lorenzo Torresani. 2022. Long-Short Temporal Contrastive Learning of Video Transformers. In Proceedings of CVPR 2022. 13990--14000.

[42]

Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, and Tat-Seng Chua. 2022. Rethinking the Two-Stage Framework for Grounded Situation Recognition. In Proceedings of AAAI 2022. 2651--2658.

[43]

Mengjie Wu, Maofu Liu, Luyao Wang, and Huijun Hu. 2023. A Chinese Fine-grained Financial Event Extraction Dataset. In Proceedings of WWW 2023. 1229--1235.

Digital Library

[44]

Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. 2020. Graph information bottleneck. In Proceedings of NeurIPS 2020, Vol. 33. 20437--20448.

[45]

Qingrong Xia, Zhenghua Li, and Min Zhang. 2019. A Syntax-aware Multi-task Learning Framework for Chinese Semantic Role Labeling. In Proceedings of EMNLP-IJCNLP 2019. 5381--5391.

[46]

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In Proceedings of CVPR 2017. 3097--3106.

[47]

Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Shih-Fu Chang, and Heng Ji. 2022. Video Event Extraction via Tracking Visual States of Arguments. In Proceedings of AAAI 2020.

[48]

Xu Yang, Hanwang Zhang, and Jianfei Cai. 2022. Auto-Encoding and Distilling Scene Graphs for Image Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5 (2022), 2313--2327.

[49]

Wenlin Yao, Cheng Zhang, Shiva Saravanan, Ruihong Huang, and Ali Mostafavi. 2020. Weakly-Supervised Fine-Grained Event Recognition on Social Media Texts for Disaster Management. In Proceedings of AAAI 2020. 532--539.

[50]

Mark Yatskar, Luke S. Zettlemoyer, and Ali Farhadi. 2016. Situation Recognition: Visual Semantic Role Labeling for Image Understanding. In Proceedings of CVPR. 5534--5542.

[51]

Tongtao Zhang, Spencer Whitehead, Hanwang Zhang, Hongzhi Li, Joseph G. Ellis, Lifu Huang, Wei Liu, Heng Ji, and Shih-Fu Chang. 2017. Improving Event Extraction via Multimodal Integration. In Proceedings of ACM MM 2017. 270--278.

Digital Library

Cited By

Fei HWu SJi WZhang HZhang MLee MHsu WSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Video-of-thoughtProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692596(13109-13125)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692596
Ji WFei HWei YZheng ZLi JChen LLiao LZhuang YZimmermann RJi WFei HZheng ZFei HWei YZheng Z(2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689091.3690093
Fei HLuo MXu JWu SJi WLee MHsu WFan SWang ZShao RBai SZhu HNie LSatoh S(2024)Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLMProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689388(13-22)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689090.3689388
Show More Cited By

Index Terms

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection
Once a video sequence is organized as basic shot units, it is of great interest to temporally link shots into semantic-compact scene segments to facilitate long video understanding. However, it still challenges existing video scene boundary detection ...
An application of the combinatorial Nullstellensatz to a graph labelling problem

An antimagic labelling of a graph G with m edges and n vertices is a bijection from the set of edges of G to the set of integers {1,…,m}, such that all n vertex sums are pairwise distinct, where a vertex sum is the sum of labels of all edges incident ...
Chinese News Event 5W1H Elements Extraction Using Semantic Role Labeling
ISIP '10: Proceedings of the 2010 Third International Symposium on Information Processing

To relieve "News Information Overload", classification, summarization and recommendation techniques have been proposed. However, these techniques fail to provide sufficient semantic information about news events. In this paper, considering5W1H (Who, What,...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
500
Total Downloads

Downloads (Last 12 months)446
Downloads (Last 6 weeks)70

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fei HWu SJi WZhang HZhang MLee MHsu WSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Video-of-thoughtProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692596(13109-13125)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692596
Ji WFei HWei YZheng ZLi JChen LLiao LZhuang YZimmermann RJi WFei HZheng ZFei HWei YZheng Z(2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689091.3690093
Fei HLuo MXu JWu SJi WLee MHsu WFan SWang ZShao RBai SZhu HNie LSatoh S(2024)Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLMProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689388(13-22)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689090.3689388
Zhao YLi BFei HZhu HSong JLiu WZhang DHuang WWang X(2024)XFashion: Character Animation Generation via Facial-enhanced and Granularly ControllingProceedings of the 5th International Workshop on Human-centric Multimedia Analysis10.1145/3688865.3689480(7-12)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3688865.3689480
Zheng LChen BFei HLi FWu SLiao LJi DTeng CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680966(8576-8585)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680966
Luo MFei HLi BWu SLiu QPoria SCambria ELee MHsu WCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680705(7667-7676)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680705
Wang BZhang MFei HZhao YLi BWu SJi WZhang MCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SpeechEE: A Novel Benchmark for Speech Event ExtractionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680669(10449-10458)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680669
Rodin IFurnari AMin KTripathi SFarinella G(2024)Action Scene Graphs for Long-Form Understanding of Egocentric Videos2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01762(18622-18632)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01762

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents