Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Decomposed Prototype Learning for Few-Shot Scene Graph Generation

Published: 23 December 2024 Publication History

Abstract

Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Therefore, it is difficult to apply them to real-world applications with massive uncommon predicate categories whose annotations are hard to collect. In this article, we focus on Few-Shot SGG (FSSGG), which encourages SGG models to be able to quickly transfer previous knowledge and recognize unseen predicates well with only a few examples. However, current methods for FSSGG are hindered by the high intra-class variance of predicate categories in SGG: On one hand, each predicate category commonly has multiple semantic meanings under different contexts. On the other hand, the visual appearance of relation triplets with the same predicate differs greatly under different subject–object compositions. Such great variance of inputs makes it hard to learn generalizable representation for each predicate category with current few-shot learning (FSL) methods. However, we found that this intra-class variance of predicates is highly related to the composed subjects and objects. To model the intra-class variance of predicates with subject–object context, we propose a novel Decomposed Prototype Learning (DPL) model for FSSGG. Specifically, we first construct a decomposable prototype space to capture diverse semantics and visual patterns of subjects and objects for predicates by decomposing them into multiple prototypes. Afterwards, we integrate these prototypes with different weights to generate query-adaptive predicate representation with more reliable semantics for each query sample. We conduct extensive experiments and compare with various baseline methods to show the effectiveness of our method.

References

[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: A visual language model for few-shot learning. In NeurIPS.
[2]
Guikun Chen, Jin Li, and Wenguan Wang. 2024. Scene Graph Generation with Role-Playing Large Language Models. In NeurIPS.
[3]
Guikun Chen, Lin Li, Yawei Luo, and Jun Xiao. 2023. Addressing predicate overlap in scene graph generation with semantic granularity controller. In ICME.
[4]
Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2022. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, 18030–18040.
[5]
Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 4613–4623.
[6]
Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng, and Liqiang Nie. 2022. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In CVPR, 19427–19436.
[7]
Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Fei-Fei Li. 2019. Visual relationships as functions: Enabling few-shot scene graph prediction. In ICCV Workshops, 0–0.
[8]
Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. 2022. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 14084–14093.
[9]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 1126–1135.
[10]
Kaifeng Gao, Long Chen, Hanwang Zhang, Jun Xiao, and Qianru Sun. 2023. Compositional prompt tuning with motion cues for open-vocabulary video relation detection. In ICLR.
[11]
Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In CVPR, 4367–4375.
[12]
Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained prompt tuning for few-shot learning. arXiv:2109.04332. Retrieved from https://arxiv.org/abs/2109.04332
[13]
Yuyu Guo, Jingkuan Song, Lianli Gao, and Heng Tao Shen. 2020. One-shot scene graph generation. In ACM MM, 3090–3098.
[14]
Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. Towards open-vocabulary scene graph generation with prompt-based finetuning. In ECCV. Springer, 56–73.
[15]
Wen Jiang, Kai Huang, Jie Geng, and Xinyang Deng. 2020. Multi-scale metric learning for few-shot learning. IEEE TCSVT 31, 3 (2020), 1091–1102.
[16]
Junsik Kim, Tae-Hyun Oh, Seokju Lee, Fei Pan, and In So Kweon. 2019. Variational prototyping-encoder: One-shot learning with prototypical images. In CVPR, 9462–9470.
[17]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123 (2017), 32–73.
[18]
Anna Kukleva, Hilde Kuehne, and Bernt Schiele. 2021. Generalized and incremental few-shot learning by explicit learning and calibration without forgetting. In ICCV, 9020–9029.
[19]
Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. 2019. Meta-learning with differentiable convex optimization. In CVPR, 10657–10665.
[20]
Gen Li, Varun Jampani, Laura Sevilla-Lara, Deqing Sun, Jonghyun Kim, and Joongkyu Kim. 2021. Adaptive prototype learning and allocation for few-shot segmentation. In CVPR, 8334–8343.
[21]
Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, and Long Chen. 2023. Compositional feature augmentation for unbiased scene graph generation. In ICCV, 21685–21695.
[22]
Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The devil is in the labels: Noisy label correction for robust scene graph generation. In CVPR, 18869–18878.
[23]
Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, and Long Chen. 2023. Zero-shot visual relation detection via composite visual cues from large language models. In NeurIPS, Vol. 36.
[24]
Lin Li, Jun Xiao, Hanrong Shi, Wenxiao Wang, Jian Shao, An-An Liu, Yi Yang, and Long Chen. 2023. Label semantic knowledge distillation for unbiased scene graph generation. IEEE TCSVT 34, 1 (2023), 195–206.
[25]
Lin Li, Jun Xiao, Hanrong Shi, Hanwang Zhang, Yi Yang, Wei Liu, and Long Chen. 2024. Nicest: Noisy label correction and training for robust scene graph generation. IEEE TPAMI 46, 10 (2024), 6873–6888.
[26]
Rongjie Li, Songyang Zhang, and Xuming He. 2022. SGTR: End-to-end scene graph generation with transformer. In CVPR, 19486–19496.
[27]
Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. 2019. Revisiting local descriptor based image-to-class measure for few-shot learning. In CVPR, 7260–7268.
[28]
Wenbin Li, Jinglin Xu, Jing Huo, Lei Wang, Yang Gao, and Jiebo Luo. 2019. Distribution consistency based covariance metric networks for few-shot learning. In AAAI, 8642–8649.
[29]
Xingchen Li, Long Chen, Wenbo Ma, Yi Yang, and Jun Xiao. 2022. Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In ACM MM, 4204–4213.
[30]
Xingchen Li, Long Chen, Jian Shao, Shaoning Xiao, Songyang Zhang, and Jun Xiao. 2022. Rethinking the evaluation of unbiased scene graph generation. In BMVC.
[31]
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 4582–4597.
[32]
Yiming Li, Xiaoshan Yang, Xuhui Huang, Zhe Ma, and Changsheng Xu. 2022. Zero-shot predicate prediction for scene graph parsing. IEEE TMM 25 (2022), 3140–3153.
[33]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In ECCV. Springer, 852–869.
[34]
Xinyu Lyu, Lianli Gao, Junlin Xie, Pengpeng Zeng, Yulu Tian, Jie Shao, and Heng Tao Shen. 2023. Generalized unbiased scene graph generation. arXiv:2308.04802. Retrieved from https://arxiv.org/abs/2308.04802
[35]
Xinyu Lyu, Lianli Gao, Pengpeng Zeng, Heng Tao Shen, and Jingkuan Song. 2023. Adaptive fine-grained predicates learning for scene graph generation. IEEE TPAMI 45, 11 (2023), 13921–13940.
[36]
Chengcheng Ma, Yang Liu, Jiankang Deng, Lingxi Xie, Weiming Dong, and Changsheng Xu. 2023. Understanding and mitigating overfitting in prompt tuning for vision-language models. IEEE TCSVT 33, 9 (2023), 4616–4629.
[37]
Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Zero-shot temporal action detection via vision-language prompting. In ECCV. Springer, 681–697.
[38]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748–8763.
[39]
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In ICML. PMLR, 1842–1850.
[40]
Hanrong Shi, Lin Li, Jun Xiao, Yueting Zhuang, and Long Chen. 2024. From easy to hard: Learning curricular shape-aware features for robust panoptic scene graph generation. IJCV (2024), 1–20. Retrieved from https://link.springer.com/article/10.1007/s11263-024-02190-9
[41]
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In NeurIPS, Vol. 30.
[42]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In CVPR, 1199–1208.
[43]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In CVPR, 3716–3725.
[44]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In CVPR, 6619–6628.
[45]
Yao Teng and Limin Wang. 2022. Structured sparse r-cnn for direct scene graph generation. In CVPR, 19437–19446.
[46]
Maria Tsimpoukelli, Jacob L. Menick, Serkan Cabi, S. M. Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. In NeurIPS, Vol. 34, 200–212.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, Vol. 30.
[48]
Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. 2019. Panet: Few-shot image semantic segmentation with prototype alignment. In ICCV, 9197–9206.
[49]
Weitao Wang, Meng Wang, Sen Wang, Guodong Long, Lina Yao, Guilin Qi, and Yang Chen. 2020. One-shot learning for long-tail visual relation detection. In AAAI, Vol. 34, 12225–12232.
[50]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for scene graph generation. In ECCV, 670–685.
[51]
Hantao Yao, Rui Zhang, and Changsheng Xu. 2024. TCP: Textual-based Class-aware Prompt tuning for Visual-Language Model. In CVPR, 23438–23448.
[52]
Tianyu Yu, Yangning Li, Jiaoyan Chen, Yinghui Li, Hai-Tao Zheng, Xi Chen, Qingbin Liu, Wenqiang Liu, Dongxiao Huang, Bei Wu, et al. 2023. Knowledge-augmented few-shot visual relation detection. arXiv:2303.05342. Retrieved from https://arxiv.org/abs/2303.05342
[53]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR, 5831–5840.
[54]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VINVL: Revisiting visual representations in vision-language models. In CVPR, 5579–5588.
[55]
Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. 2023. Boosting scene graph generation with visual relation saliency. ACM TOMM 19, 1 (2023), 1–17.
[56]
Chaofan Zheng, Lianli Gao, Xinyu Lyu, Pengpeng Zeng, Abdulmotaleb El Saddik, and Heng Tao Shen. 2023. Dual-branch hybrid learning network for unbiased scene graph generation. IEEE TCSVT 34, 3 (2023), 1743–1756.
[57]
Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. 2023. Prototype-based embedding network for scene graph generation. In CVPR, 22783–22792.
[58]
Zewen Zheng, Guoheng Huang, Xiaochen Yuan, Chi-Man Pun, Hongrui Liu, and Wing-Kuen Ling. 2022. Quaternion-valued correlation learning for few-shot semantic segmentation. IEEE TCSVT 33, 5 (2022), 2102–2115.
[59]
Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. 2021. Learning to generate scene graph from natural language supervision. In ICCV, 1823–1834.
[60]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In CVPR, 16816–16825.
[61]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. IJCV 130, 9 (2022), 2337–2348.
[62]
Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, and Yu-Gang Jiang. 2024. Unified view empirical study for large pretrained model on cross-domain few-shot learning. ACM TOMM 20, 9 (2024), 1–18.

Index Terms

  1. Decomposed Prototype Learning for Few-Shot Scene Graph Generation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 21, Issue 1
      January 2025
      860 pages
      EISSN:1551-6865
      DOI:10.1145/3703004
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 December 2024
      Online AM: 21 October 2024
      Accepted: 15 September 2024
      Revised: 12 August 2024
      Received: 26 December 2023
      Published in TOMM Volume 21, Issue 1

      Check for updates

      Author Tags

      1. Scene Graph Generation (SGG)
      2. Few-Shot Learning
      3. Prompt Learning
      4. Prototype Learning

      Qualifiers

      • Research-article

      Funding Sources

      • National Key Research and Development Project of China
      • National Natural Science Foundation of China
      • Fundamental Research Funds for the Central Universities
      • HKUST Special Support for Young Faculty
      • HKUST Sports Science and Technology Research

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 138
        Total Downloads
      • Downloads (Last 12 months)138
      • Downloads (Last 6 weeks)21
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media