research-article

Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference

Authors:

Lianggangxu Chen,

Gaoqi HeAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4877 - 4885

https://doi.org/10.1145/3581783.3612249

Published: 27 October 2023 Publication History

Abstract

The task of dynamic scene graph generation (DSGG) aims at constructing a set of frame-level scene graphs for the given video. It suffers from two kinds of spurious correlation problems. First, the spurious correlation between input object pair and predicate label is caused by the biased predicate sample distribution in dataset. Second, the spurious correlation between contextual information and predicate label arises from interference caused by background content in both the current frame and adjacent frames of the video sequence. To alleviate spurious correlations, our work is formulated into two sub-tasks: video-specific commonsense graph generation (VsCG) and causal inference (CI). VsCG module aims to alleviate the first correlation by integrating prior knowledge into prediction. Information of all the frames in current video is used to enhance the commonsense graph constructed from co-occurrence patterns of all training samples. Thus, the commonsense graph has been augmented with video-specific temporal dependencies. Then, a CI strategy with both intervention and counterfactual is used. The intervention component further eliminates the first correlation by forcing the model to consider all possible predicate categories fairly, while the counterfactual component resolves the second correlation by removing the bad effect from context. Comprehensive experiments on the Action Genome dataset show that the proposed method achieves state-of-the-art performance.

Supplementary Material

MP4 File (2396-video.mp4)

Presentation video for "Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference"

Download
22.77 MB

References

[1]

Anurag Arnab, Chen Sun, and Cordelia Schmid. 2021. Unified graph structured models for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8117--8126.

[2]

Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015).

[3]

Elias Bareinboim, Juan D Correa, Duligur Ibeling, and Thomas Icard. 2022. On Pearl's hierarchy and the foundations of causal inference. In Probabilistic and causal inference: the works of judea pearl. 507--556.

[4]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledgeembedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.

[5]

Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. 2022. (2.5 1) D Spatio-Temporal Scene Graphs for Video Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 444--453.

[6]

Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. 2021. Recovering the unbiased scene graphs from the biased ones. In Proceedings of the 29th ACM International Conference on Multimedia. 1581--1590.

Digital Library

[7]

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision. 16372--16382.

[8]

Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems 26 (2013).

[9]

Pinar Demetci, Rebecca Santorella, Björn Sandstede, William Stafford Noble, and Ritambhara Singh. 2020. Gromov-Wasserstein optimal transport to align single-cell multiomics data. BioRxiv (2020).

[10]

Shengyu Feng, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar, and Subarna Tripathi. 2023. Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5130--5139.

[11]

Madelyn Glymour, Judea Pearl, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.

[12]

Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1969--1978.

[13]

Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. State-Aware Compositional Learning Toward Unbiased Training for Scene Graph Generation. IEEE Transactions on Image Processing 32 (2022), 43--56.

[14]

Xia Hua, Xinqing Wang, Ting Rui, Faming Shao, and Dong Wang. 2022. Adversarial reinforcement learning with object-scene relational graph for video captioning. IEEE Transactions on Image Processing 31 (2022), 2004--2016.

[15]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236--10247.

[16]

Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI 16. Springer, 447--463.

[17]

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.

[18]

Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021. Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4091--4099.

Digital Library

[19]

Yiming Li, Xiaoshan Yang, and Changsheng Xu. 2022. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13874--13883.

[20]

Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746--3753.

[21]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 852--869.

[22]

Toshal Patel, Alvin Yan Hong Yao, Yu Qiang, Wei Tsang Ooi, and Roger Zimmermann. 2021. Multi-Camera Video Scene Graphs for Surveillance Videos Indexing and Retrieval. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2383--2387.

[23]

Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological methods 19, 4 (2014), 459.

[24]

Judea Pearl et al. 2000. Models, reasoning and inference. Cambridge, UK: Cambridge University Press 19, 2 (2000).

[25]

Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books.

[26]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[27]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[28]

Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, and Volker Tresp. 2022. Improving scene graph classification by exploiting knowledge from texts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2189--2197.

[29]

Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Advances in Neural Information Processing Systems 33 (2020), 1513--1524.

[30]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3716--3725.

[31]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6619--6628.

[32]

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4631--4640.

[33]

Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.

[34]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision. 4534--4542.

Digital Library

[35]

Shuang Wang, Lianli Gao, Xinyu Lyu, Yuyu Guo, Pengpeng Zeng, and Jingkuan Song. 2022. Dynamic Scene Graph Generation via Temporal Prior Inference. In Proceedings of the 30th ACM International Conference on Multimedia. 5793--5801.

Digital Library

[36]

Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual commonsense r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10760--10770.

[37]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048--2057.

[38]

Li Xu, Haoxuan Qu, Jason Kuen, Jiuxiang Gu, and Jun Liu. 2022. Meta spatiotemporal debiasing for video scene graph generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVII. Springer, 374--390.

[39]

Yuan Yao, Ao Zhang, Xu Han, Mengdi Li, Cornelius Weber, Zhiyuan Liu, Stefan Wermter, and Maosong Sun. 2021. Visual distant supervision for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15816--15826.

[40]

Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision. 1974--1982.

[41]

Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2020. Interventional few-shot learning. Advances in neural information processing systems 33 (2020), 2734--2746.

[42]

Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020. Bridging knowledge graphs to generate scene graphs. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII 16. Springer, 606--623.

Digital Library

[43]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.

[44]

Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5128--5137.

[45]

Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua, and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems 33 (2020), 655--666.

[46]

Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11535--11543.

Index Terms

Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph Generation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Dynamic scene graphs have become a powerful tool for higher-level visual understanding tasks, and the interest in dynamic scene graph generation (dynamic SGG) is grown over time. Recently, numbers of existing methods achieve significant progress in ...
Dynamic Scene Graph Generation via Temporal Prior Inference
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Real-world videos are composed of complex actions with inherent temporal continuity (eg "person-touching-bottle" is usually followed by "person-holding-bottle"). In this work, we propose a novel method to mine such temporal continuity for dynamic scene ...
Causal inference through a witness protection program

One of the most fundamental problems in causal inference is the estimation of a causal effect when treatment and outcome are confounded. This is difficult in an observational study, because one has no direct evidence that all confounders have been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Shanghai Science and Technology Commission
the Open Project Program of the State Key Lab of CADCG
Natural Science Foundation Project of CQ
Shanghai Sailing Program

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
258
Total Downloads

Downloads (Last 12 months)258
Downloads (Last 6 weeks)17

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents