Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612249acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference

Published: 27 October 2023 Publication History

Abstract

The task of dynamic scene graph generation (DSGG) aims at constructing a set of frame-level scene graphs for the given video. It suffers from two kinds of spurious correlation problems. First, the spurious correlation between input object pair and predicate label is caused by the biased predicate sample distribution in dataset. Second, the spurious correlation between contextual information and predicate label arises from interference caused by background content in both the current frame and adjacent frames of the video sequence. To alleviate spurious correlations, our work is formulated into two sub-tasks: video-specific commonsense graph generation (VsCG) and causal inference (CI). VsCG module aims to alleviate the first correlation by integrating prior knowledge into prediction. Information of all the frames in current video is used to enhance the commonsense graph constructed from co-occurrence patterns of all training samples. Thus, the commonsense graph has been augmented with video-specific temporal dependencies. Then, a CI strategy with both intervention and counterfactual is used. The intervention component further eliminates the first correlation by forcing the model to consider all possible predicate categories fairly, while the counterfactual component resolves the second correlation by removing the bad effect from context. Comprehensive experiments on the Action Genome dataset show that the proposed method achieves state-of-the-art performance.

Supplementary Material

MP4 File (2396-video.mp4)
Presentation video for "Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference"

References

[1]
Anurag Arnab, Chen Sun, and Cordelia Schmid. 2021. Unified graph structured models for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8117--8126.
[2]
Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015).
[3]
Elias Bareinboim, Juan D Correa, Duligur Ibeling, and Thomas Icard. 2022. On Pearl's hierarchy and the foundations of causal inference. In Probabilistic and causal inference: the works of judea pearl. 507--556.
[4]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledgeembedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.
[5]
Anoop Cherian, Chiori Hori, Tim K Marks, and Jonathan Le Roux. 2022. (2.5 1) D Spatio-Temporal Scene Graphs for Video Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 444--453.
[6]
Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. 2021. Recovering the unbiased scene graphs from the biased ones. In Proceedings of the 29th ACM International Conference on Multimedia. 1581--1590.
[7]
Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision. 16372--16382.
[8]
Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems 26 (2013).
[9]
Pinar Demetci, Rebecca Santorella, Björn Sandstede, William Stafford Noble, and Ritambhara Singh. 2020. Gromov-Wasserstein optimal transport to align single-cell multiomics data. BioRxiv (2020).
[10]
Shengyu Feng, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar, and Subarna Tripathi. 2023. Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5130--5139.
[11]
Madelyn Glymour, Judea Pearl, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.
[12]
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1969--1978.
[13]
Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. State-Aware Compositional Learning Toward Unbiased Training for Scene Graph Generation. IEEE Transactions on Image Processing 32 (2022), 43--56.
[14]
Xia Hua, Xinqing Wang, Ting Rui, Faming Shao, and Dong Wang. 2022. Adversarial reinforcement learning with object-scene relational graph for video captioning. IEEE Transactions on Image Processing 31 (2022), 2004--2016.
[15]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236--10247.
[16]
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXI 16. Springer, 447--463.
[17]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.
[18]
Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021. Interventional video relation detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4091--4099.
[19]
Yiming Li, Xiaoshan Yang, and Changsheng Xu. 2022. Dynamic scene graph generation via anticipatory pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13874--13883.
[20]
Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746--3753.
[21]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 852--869.
[22]
Toshal Patel, Alvin Yan Hong Yao, Yu Qiang, Wei Tsang Ooi, and Roger Zimmermann. 2021. Multi-Camera Video Scene Graphs for Surveillance Videos Indexing and Retrieval. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2383--2387.
[23]
Judea Pearl. 2014. Interpretation and identification of causal mediation. Psychological methods 19, 4 (2014), 459.
[24]
Judea Pearl et al. 2000. Models, reasoning and inference. Cambridge, UK: Cambridge University Press 19, 2 (2000).
[25]
Judea Pearl and Dana Mackenzie. 2018. The book of why: the new science of cause and effect. Basic books.
[26]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[27]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
[28]
Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, and Volker Tresp. 2022. Improving scene graph classification by exploiting knowledge from texts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2189--2197.
[29]
Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Advances in Neural Information Processing Systems 33 (2020), 1513--1524.
[30]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3716--3725.
[31]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6619--6628.
[32]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4631--4640.
[33]
Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.
[34]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision. 4534--4542.
[35]
Shuang Wang, Lianli Gao, Xinyu Lyu, Yuyu Guo, Pengpeng Zeng, and Jingkuan Song. 2022. Dynamic Scene Graph Generation via Temporal Prior Inference. In Proceedings of the 30th ACM International Conference on Multimedia. 5793--5801.
[36]
Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun. 2020. Visual commonsense r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10760--10770.
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048--2057.
[38]
Li Xu, Haoxuan Qu, Jason Kuen, Jiuxiang Gu, and Jun Liu. 2022. Meta spatiotemporal debiasing for video scene graph generation. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVII. Springer, 374--390.
[39]
Yuan Yao, Ao Zhang, Xu Han, Mengdi Li, Cornelius Weber, Zhiyuan Liu, Stefan Wermter, and Maosong Sun. 2021. Visual distant supervision for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15816--15826.
[40]
Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision. 1974--1982.
[41]
Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2020. Interventional few-shot learning. Advances in neural information processing systems 33 (2020), 2734--2746.
[42]
Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020. Bridging knowledge graphs to generate scene graphs. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII 16. Springer, 606--623.
[43]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5831--5840.
[44]
Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5128--5137.
[45]
Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua, and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems 33 (2020), 655--666.
[46]
Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11535--11543.

Index Terms

  1. Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. causal inference
    2. dynamic scene graph generation
    3. multi-order graph attention network
    4. scene-specific knowledge

    Qualifiers

    • Research-article

    Funding Sources

    • Shanghai Science and Technology Commission
    • the Open Project Program of the State Key Lab of CADCG
    • Natural Science Foundation Project of CQ
    • Shanghai Sailing Program

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 258
      Total Downloads
    • Downloads (Last 12 months)258
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media