Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612024acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph Generation

Published: 27 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships. Due to the inherently biased distribution and missing annotations in the training data, current VidSGG methods have been found to perform poorly on less-represented predicates. In this paper, we propose an explicit solution to address this under-explored issue by supplementing missing predicates that should be included in the ground-truth annotations. Dubbed Trico, our method seeks to supplement the missing predicates that are supposed to appear in the ground-truth annotations, by exploring three complementary spatio-temporal correlations. Guided by these correlations, the missing labels can be effectively supplemented thus achieving an unbiased predicate predictions. We validate the effectiveness of Trico on the most widely used VidSGG datasets, i.e., VidVRD and VidOR. Extensive experiments demonstrate the state-of-the-art performance achieved by Trico, particularly on those tail predicates. The code is available in the supplementary material.

    References

    [1]
    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.
    [2]
    Q. Cao, H. Huang, X. Shang, B. Wang, and T. S. Chua. 2021. 3-D Relation Network for visual relation recognition in videos. Neurocomputing, Vol. 432 (2021), 91--100.
    [3]
    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.
    [4]
    Shuo Chen, Zenglin Shi, Pascal Mettes, and Cees G. M. Snoek. 2021. Social Fabric: Tubelet Compositions for Video Relation Detection. In ICCV.
    [5]
    Siqi Chen, Jun Xiao, and Long Chen. 2023 b. Video scene graph generation from single-frame weak supervision. In ICLR.
    [6]
    Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In CVPR. 6163--6171.
    [7]
    Zhiqing Chen, Yawei Luo, Jian Shao, Yi Yang, Chunping Wang, Lei Chen, and Jun Xiao. 2023 a. Dark Knowledge Balance Learning for Unbiased Scene Graph Generation. arXiv (2023).
    [8]
    Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In ICCV. 16372--16382.
    [9]
    Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. TPAMI (2021).
    [10]
    Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, and Somdeb Majumdar. 2021. Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs. arXiv (2021).
    [11]
    Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv (2016).
    [12]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
    [13]
    Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR. 10236--10247.
    [14]
    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv (2014).
    [15]
    Lin Li, Long Chen, Hanrong Shi, Wenxiao Wang, Jian Shao, Yi Yang, and Jun Xiao. 2022. Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation. arXiv (2022).
    [16]
    Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. 2021b. Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation. In CVPR.
    [17]
    Wenhui Li, Song Yang, Qiang Li, Xuanya Li, and An-An Liu. 2023. Commonsense-Guided Semantic and Relational Consistencies for Image-Text Retrieval. IEEE Transactions on Multimedia (2023).
    [18]
    Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021a. Interventional video relation detection. In ACM MM. 4091--4099.
    [19]
    Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In CVPR. 482--490.
    [20]
    An-An Liu, Haochun Lu, Heyu Zhou, Tianbao Li, and Mohan Kankanhalli. 2023. Balanced Class-Incremental 3D Object Classification and Retrieval. IEEE Transactions on Knowledge and Data Engineering (2023).
    [21]
    An-An Liu, He-Yu Zhou, Xuanya Li, and Lanjun Wang. 2022. Vulnerability of feature extractors in 2D image-based 3D object retrieval. IEEE Transactions on Multimedia (2022).
    [22]
    Chenchen Liu, Yang Jin, Kehan Xu, Guoqiang Gong, and Yadong Mu. 2020. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In CVPR. 10840--10849.
    [23]
    Hengyue Liu, Ning Yan, Masood S Mortazavi, and Bir Bhanu. 2021. Fully Convolutional Scene Graph Generation. In CVPR.
    [24]
    Yawei Luo, Ping Liu, Tao Guan, Junqing Yu, and Yi Yang. 2020. Adversarial style mining for one-shot unsupervised domain adaptation. Advances in neural information processing systems, Vol. 33 (2020), 20612--20623.
    [25]
    Yawei Luo, Ping Liu, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2021. Category-level adversarial adaptation for semantic segmentation using purified features. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 8 (2021), 3940--3956.
    [26]
    Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2019. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2507--2516.
    [27]
    Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, and Jingkuan Song. 2022. Fine-grained predicates learning for scene graph generation. In CVPR. 19467--19475.
    [28]
    Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. 2016. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR. 2930--2939.
    [29]
    Alejandro Newell and Jia Deng. 2017. Pixels to Graphs by Associative Embedding. In NIPS.
    [30]
    Weizhi Nie, Xin Wen, Jing Liu, Jiawei Chen, Jiancan Wu, Guoqing Jin, Jing Lu, and An-An Liu. 2023. Knowledge-Enhanced Causal Reinforcement Learning Model for Interactive Recommendation. IEEE Transactions on Multimedia (2023).
    [31]
    Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In ACM MM. 84--93.
    [32]
    Shaoqing Ren, Kaiming He, Ross B Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS.
    [33]
    Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In ICMR. 279--287.
    [34]
    Xindi Shang, Yicong Li, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2021. Video Visual Relation Detection via Iterative Inference. In ACM MM. 3654--3663.
    [35]
    Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In ACM MM. 1300--1308.
    [36]
    Cees GM Snoek, Marcel Worring, et al. 2009. Concept-based video retrieval. FOUND TRENDS INF RET, Vol. 2, 4 (2009), 215--322.
    [37]
    Zixuan Su, Xindi Shang, Jingjing Chen, Yu-Gang Jiang, Zhiyong Qiu, and Tat-Seng Chua. 2020. Video Relation Detection via Multiple Hypothesis Association. In ACM MM. 3127--3135.
    [38]
    Xu Sun, Tongwei Ren, Yuan Zi, and Gangshan Wu. 2019. Video visual relation detection via multi-modal feature fusion. In ACM MM. 2657--2661.
    [39]
    Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI.
    [40]
    Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In CVPR. 3716--3725.
    [41]
    Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In CVPR. 6619--6628.
    [42]
    Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In CVPR. 4631--4640.
    [43]
    Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target Adaptive Context Aggregation for Video Scene Graph Generation. In ICCV. 13688--13697.
    [44]
    Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. 2019. Video relationship reasoning using gated spatio-temporal energy graph. In CVPR. 10424--10433.
    [45]
    Sangmin Woo, Junhyug Noh, and Kangil Kim. 2021. What and When to Look?: Temporal Span Proposal Network for Video Visual Relation Detection. arXiv (2021).
    [46]
    Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR. 5410--5419.
    [47]
    Li Xu, Haoxuan Qu, Jason Kuen, Jiuxiang Gu, and Jun Liu. 2022. Meta spatio-temporal debiasing for video scene graph generation. In ECCV. Springer, 374--390.
    [48]
    Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In ACM MM. 265--273.
    [49]
    Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. 5831--5840.
    [50]
    Sipeng Zheng, Xiangyu Chen, Shizhe Chen, and Qin Jin. 2019. Relation understanding in videos. In ACM MM. 2662--2666.

    Cited By

    View all
    • (2024)A New Training Data Organization Form and Training Mode for Unbiased Scene Graph GenerationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334456934:7(5295-5305)Online publication date: Jul-2024

    Index Terms

    1. Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph Generation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. long-tail problem
      2. missing label supplementation
      3. spatio-temporal correlations
      4. video scene graph generation

      Qualifiers

      • Research-article

      Funding Sources

      • National Key Research & Development Project of China
      • the Fundamental Research Funds for the Central Universities
      • the National Natural Science Foundation of China

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 995 of 4,171 submissions, 24%

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)113
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A New Training Data Organization Form and Training Mode for Unbiased Scene Graph GenerationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334456934:7(5295-5305)Online publication date: Jul-2024

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media