Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Zhao, Yu; Fei, Hao; Cao, Yixin; Li, Bobo; Zhang, Meishan; Wei, Jianguo; Zhang, Min; Chua, Tat-Seng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.05081 (cs)

[Submitted on 9 Aug 2023 (v1), last revised 12 Aug 2023 (this version, v2)]

Title:Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Authors:Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, Tat-Seng Chua

View PDF

Abstract:Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.

Comments:	Accepted by ACM MM 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2308.05081 [cs.CV]
	(or arXiv:2308.05081v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.05081

Submission history

From: Hao Fei [view email]
[v1] Wed, 9 Aug 2023 17:20:14 UTC (16,214 KB)
[v2] Sat, 12 Aug 2023 06:02:02 UTC (16,178 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators