Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612096acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Published: 27 October 2023 Publication History

Abstract

As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, e.g., filtering noisy branches and newly building informative connections, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods. Our HostSG representation shows greater potential to facilitate a broader range of other video understanding tasks.

References

[1]
Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, and Changsheng Xu. 2022. Adaptive Anti-Bottleneck Multi-Modal Graph Learning Network for Personalized Micro-video Recommendation. In Proceedings of ACM MM 2022. 581--590.
[2]
Da Cao, Yawen Zeng, Xiaochi Wei, Liqiang Nie, Richang Hong, and Zheng Qin. 2020. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization. In Proceedings of ACM MM 2020. ACM, 898--906.
[3]
Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs. In Proceedings of CVPR 2020. 9959--9968.
[4]
Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann. 2022. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement. In Proceedings of MM 2022. 3272--3281.
[5]
Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. 2022. (2.5 1) D Spatio-Temporal Scene Graphs for Video Question Answering. In Proceedings of AAAI 2022, Vol. 36. 444--453.
[6]
Yuya Chiba and Ryuichiro Higashinaka. 2021. Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information. In Proceedings of Interspeech 2021. 241--245.
[7]
Junhyeong Cho, Youngseok Yoon, and Suha Kwak. 2022. Collaborative Transformers for Grounded Situation Recognition. In Proceedings of CVPR 2022. 19627--19636.
[8]
Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-Temporal Transformer for Dynamic Scene Graph Generation. In Proceedings of ICCV 2021. 16352--16362.
[9]
Peng Dai, Xin Yu, Lan Ma, Baoheng Zhang, Jia Li, Wenbo Li, Jiajun Shen, and Xiaojuan Qi. 2022. Video Demoiréing with Relation-Based Temporal Consistency. In Proceedings of CVPR 2022. 17601--17610.
[10]
Hao Fei, Fei Li, Bobo Li, and Donghong Ji. 2021. Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax. In Proceedings of the AAAI Conference on Artificial Intelligence. 12794--12802.
[11]
Hao Fei, Shengqiong Wu, Yafeng Ren, Fei Li, and Donghong Ji. 2021. Better Combine Them Together! Integrating Syntactic Constituency and Dependency Representations for Semantic Role Labeling. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 549--559.
[12]
Hao Fei, Meishan Zhang, Bobo Li, and Donghong Ji. 2021. End-to-end Semantic Role Labeling with Neural Transition-based Model. In Proceedings of the AAAI Conference on Artificial Intelligence. 12803--12811.
[13]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-Fast Networks for Video Recognition. In Proceedings of ICCV 2019. 6201--6210.
[14]
Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, and Somdeb Majumdar. 2021. Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs. CoRR abs/2112.09828 (2021). arXiv:2112.09828
[15]
Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, and Anoop Cherian. 2021. Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers. In Proceedings of AAAI 2021. 1415--1423.
[16]
Zhicheng Geng, Luming Liang, Tianyu Ding, and Ilya Zharkov. 2022. RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution. In Proceedings of CVPR 2022. 17420--17430.
[17]
Daniel Gildea and Daniel Jurafsky. 2000. Automatic Labeling of Semantic Roles. In Proceedings of ACL 2000. 512--520.
[18]
Saurabh Gupta and Jitendra Malik. 2015. CoRR abs/1505.04474 (2015). arXiv:1505.04474
[19]
Monica Haurilet, Ziad Al-Halah, and Rainer Stiefelhagen. 2019. DynGraph: Visual Question Answering via Dynamic Scene Graphs. In Proceedings of GCPR 2019 (Lecture Notes in Computer Science, Vol. 11824). 428--441.
[20]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs. In Proceedings of CVPR 2020. 10233--10244.
[21]
Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence. 1655--1663.
[22]
Jiayi Ji, Yiwei Ma, Xiaoshuai Sun, Yiyi Zhou, Yongjian Wu, and Rongrong Ji. 2022. Knowing what to learn: a metric-oriented focal mechanism for image captioning. IEEE Transactions on Image Processing 31 (2022), 4321--4335.
[23]
Soo-Han Kang and Ji-Hyeong Han. 2023. Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction. Int. J. Soc. Robotics 15, 4 (2023), 631--641.
[24]
Zeeshan Khan, C. V. Jawahar, and Makarand Tapaswi. 2022. Grounded Video Situation Recognition. In Proceedings of NeurIPS 2022.
[25]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated Graph Sequence Neural Networks. In Proceedings of ICLR 2016.
[26]
Yiming Li, Xiaoshan Yang, and Changsheng Xu. 2022. Dynamic Scene Graph Generation via Anticipatory Pre-training. In Proceedings of CVPR. 13864--13873.
[27]
Xuyang Lu and Yang Gao. 2023. Guide and interact: scene-graph based generation and control of video captions. Multim. Syst. 29, 2 (2023), 797--809.
[28]
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 2286--2293.
[29]
Nafise Sadat Moosavi and Michael Strube. 2016. Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric. In Proceedings of ACL 2016.
[30]
Sayak Nag, Kyle Min, Subarna Tripathi, and Amit K. Roy-Chowdhury. 2023. Unbiased Scene Graph Generation in Videos. CoRR abs/2304.00733 (2023). arXiv:2304.00733
[31]
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In Proceedings of ACM MM 2022. 3234--3243.
[32]
Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. 2021. Visual Semantic Role Labeling for Video Understanding. In Proceedings of CVPR. 5589--5600.
[33]
Yale Song, Miriam Redi, Jordi Vallmitjana, and Alejandro Jaimes. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of ACM MM 2016. 659--668.
[34]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased Scene Graph Generation From Biased Training. In Proceedings of CVPR 2020. 3713--3722.
[35]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. CoRR abs/1710.10903 (2017).
[36]
Qizhi Wan, Changxuan Wan, Keli Xiao, Rong Hu, and Dexi Liu. 2023. A Multi-channel Hierarchical Graph Attention Network for Open Event Extraction. ACM Trans. Inf. Syst. 41, 1 (2023), 20:1--20:27.
[37]
Chenwei Wang, Siyi Luo, Jifang Pei, Xiaoyu Liu, Yulin Huang, Yin Zhang, and Jianyu Yang. 2023. An Entropy-Awareness Meta-Learning Method for SAR Open-Set ATR. IEEE Geoscience and Remote Sensing Letters (2023).
[38]
Chenwei Wang, Jifang Pei, Siyi Luo, Weibo Huo, Yulin Huang, Yin Zhang, and Jianyu Yang. 2023. SAR Ship Target Recognition via Multiscale Feature Attention and Adaptive-Weighed Classifier. IEEE Geoscience and Remote Sensing Letters 20 (2023), 1--5.
[39]
Chenwei Wang, Jifang Pei, Jianyu Yang, Xiaoyu Liu, Yulin Huang, and Deqing Mao. 2022. Recognition in Label and Discrimination in Feature: A Hierarchically Designed Lightweight Method for Limited Data in SAR ATR. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--13.
[40]
Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, and Xiaoshuai Sun. 2023. Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network. Proceedings of the AAAI Conference on Artificial Intelligence (2023), 2528--2536.
[41]
Jue Wang, Gedas Bertasius, Du Tran, and Lorenzo Torresani. 2022. Long-Short Temporal Contrastive Learning of Video Transformers. In Proceedings of CVPR 2022. 13990--14000.
[42]
Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, and Tat-Seng Chua. 2022. Rethinking the Two-Stage Framework for Grounded Situation Recognition. In Proceedings of AAAI 2022. 2651--2658.
[43]
Mengjie Wu, Maofu Liu, Luyao Wang, and Huijun Hu. 2023. A Chinese Fine-grained Financial Event Extraction Dataset. In Proceedings of WWW 2023. 1229--1235.
[44]
Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. 2020. Graph information bottleneck. In Proceedings of NeurIPS 2020, Vol. 33. 20437--20448.
[45]
Qingrong Xia, Zhenghua Li, and Min Zhang. 2019. A Syntax-aware Multi-task Learning Framework for Chinese Semantic Role Labeling. In Proceedings of EMNLP-IJCNLP 2019. 5381--5391.
[46]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In Proceedings of CVPR 2017. 3097--3106.
[47]
Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Shih-Fu Chang, and Heng Ji. 2022. Video Event Extraction via Tracking Visual States of Arguments. In Proceedings of AAAI 2020.
[48]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2022. Auto-Encoding and Distilling Scene Graphs for Image Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5 (2022), 2313--2327.
[49]
Wenlin Yao, Cheng Zhang, Shiva Saravanan, Ruihong Huang, and Ali Mostafavi. 2020. Weakly-Supervised Fine-Grained Event Recognition on Social Media Texts for Disaster Management. In Proceedings of AAAI 2020. 532--539.
[50]
Mark Yatskar, Luke S. Zettlemoyer, and Ali Farhadi. 2016. Situation Recognition: Visual Semantic Role Labeling for Image Understanding. In Proceedings of CVPR. 5534--5542.
[51]
Tongtao Zhang, Spencer Whitehead, Hanwang Zhang, Hongzhi Li, Joseph G. Ellis, Lifu Huang, Wei Liu, Heng Ji, and Shih-Fu Chang. 2017. Improving Event Extraction via Multimodal Integration. In Proceedings of ACM MM 2017. 270--278.

Cited By

View all
  • (2024)Video-of-thoughtProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692596(13109-13125)Online publication date: 21-Jul-2024
  • (2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
  • (2024)Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLMProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689388(13-22)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. event extraction
    2. scene graph
    3. semantics role labeling
    4. video understanding

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)446
    • Downloads (Last 6 weeks)70
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Video-of-thoughtProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692596(13109-13125)Online publication date: 21-Jul-2024
    • (2024)The 2nd International Workshop on Deep Multi-modal Generation and RetrievalProceedings of the 2nd International Workshop on Deep Multimodal Generation and Retrieval10.1145/3689091.3690093(1-6)Online publication date: 28-Oct-2024
    • (2024)Fine-grained Structural Hallucination Detection for Unified Visual Comprehension and Generation in Multimodal LLMProceedings of the 1st ACM Multimedia Workshop on Multi-modal Misinformation Governance in the Era of Foundation Models10.1145/3689090.3689388(13-22)Online publication date: 28-Oct-2024
    • (2024)XFashion: Character Animation Generation via Facial-enhanced and Granularly ControllingProceedings of the 5th International Workshop on Human-centric Multimedia Analysis10.1145/3688865.3689480(7-12)Online publication date: 28-Oct-2024
    • (2024)Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680966(8576-8585)Online publication date: 28-Oct-2024
    • (2024)PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment AnalysisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680705(7667-7676)Online publication date: 28-Oct-2024
    • (2024)SpeechEE: A Novel Benchmark for Speech Event ExtractionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680669(10449-10458)Online publication date: 28-Oct-2024
    • (2024)Action Scene Graphs for Long-Form Understanding of Egocentric Videos2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01762(18622-18632)Online publication date: 16-Jun-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media