research-article

Decomposed Prototype Learning for Few-Shot Scene Graph Generation

Authors:

Long ChenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 21, Issue 1

Article No.: 30, Pages 1 - 24

https://doi.org/10.1145/3700877

Published: 23 December 2024 Publication History

Abstract

Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Therefore, it is difficult to apply them to real-world applications with massive uncommon predicate categories whose annotations are hard to collect. In this article, we focus on Few-Shot SGG (FSSGG), which encourages SGG models to be able to quickly transfer previous knowledge and recognize unseen predicates well with only a few examples. However, current methods for FSSGG are hindered by the high intra-class variance of predicate categories in SGG: On one hand, each predicate category commonly has multiple semantic meanings under different contexts. On the other hand, the visual appearance of relation triplets with the same predicate differs greatly under different subject–object compositions. Such great variance of inputs makes it hard to learn generalizable representation for each predicate category with current few-shot learning (FSL) methods. However, we found that this intra-class variance of predicates is highly related to the composed subjects and objects. To model the intra-class variance of predicates with subject–object context, we propose a novel Decomposed Prototype Learning (DPL) model for FSSGG. Specifically, we first construct a decomposable prototype space to capture diverse semantics and visual patterns of subjects and objects for predicates by decomposing them into multiple prototypes. Afterwards, we integrate these prototypes with different weights to generate query-adaptive predicate representation with more reliable semantics for each query sample. We conduct extensive experiments and compare with various baseline methods to show the effectiveness of our method.

References

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: A visual language model for few-shot learning. In NeurIPS.

[2]

Guikun Chen, Jin Li, and Wenguan Wang. 2024. Scene Graph Generation with Role-Playing Large Language Models. In NeurIPS.

[3]

Guikun Chen, Lin Li, Yawei Luo, and Jun Xiao. 2023. Addressing predicate overlap in scene graph generation with semantic granularity controller. In ICME.

[4]

Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2022. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, 18030–18040.

[5]

Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 4613–4623.

[6]

Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng, and Liqiang Nie. 2022. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In CVPR, 19427–19436.

[7]

Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Fei-Fei Li. 2019. Visual relationships as functions: Enabling few-shot scene graph prediction. In ICCV Workshops, 0–0.

[8]

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. 2022. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, 14084–14093.

[9]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 1126–1135.

Digital Library

[10]

Kaifeng Gao, Long Chen, Hanwang Zhang, Jun Xiao, and Qianru Sun. 2023. Compositional prompt tuning with motion cues for open-vocabulary video relation detection. In ICLR.

[11]

Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In CVPR, 4367–4375.

[12]

Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained prompt tuning for few-shot learning. arXiv:2109.04332. Retrieved from https://arxiv.org/abs/2109.04332

[13]

Yuyu Guo, Jingkuan Song, Lianli Gao, and Heng Tao Shen. 2020. One-shot scene graph generation. In ACM MM, 3090–3098.

Digital Library

[14]

Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. Towards open-vocabulary scene graph generation with prompt-based finetuning. In ECCV. Springer, 56–73.

Digital Library

[15]

Wen Jiang, Kai Huang, Jie Geng, and Xinyang Deng. 2020. Multi-scale metric learning for few-shot learning. IEEE TCSVT 31, 3 (2020), 1091–1102.

[16]

Junsik Kim, Tae-Hyun Oh, Seokju Lee, Fei Pan, and In So Kweon. 2019. Variational prototyping-encoder: One-shot learning with prototypical images. In CVPR, 9462–9470.

[17]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123 (2017), 32–73.

Digital Library

[18]

Anna Kukleva, Hilde Kuehne, and Bernt Schiele. 2021. Generalized and incremental few-shot learning by explicit learning and calibration without forgetting. In ICCV, 9020–9029.

[19]

Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. 2019. Meta-learning with differentiable convex optimization. In CVPR, 10657–10665.

[20]

Gen Li, Varun Jampani, Laura Sevilla-Lara, Deqing Sun, Jonghyun Kim, and Joongkyu Kim. 2021. Adaptive prototype learning and allocation for few-shot segmentation. In CVPR, 8334–8343.

[21]

Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, and Long Chen. 2023. Compositional feature augmentation for unbiased scene graph generation. In ICCV, 21685–21695.

[22]

Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The devil is in the labels: Noisy label correction for robust scene graph generation. In CVPR, 18869–18878.

[23]

Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, and Long Chen. 2023. Zero-shot visual relation detection via composite visual cues from large language models. In NeurIPS, Vol. 36.

[24]

Lin Li, Jun Xiao, Hanrong Shi, Wenxiao Wang, Jian Shao, An-An Liu, Yi Yang, and Long Chen. 2023. Label semantic knowledge distillation for unbiased scene graph generation. IEEE TCSVT 34, 1 (2023), 195–206.

[25]

Lin Li, Jun Xiao, Hanrong Shi, Hanwang Zhang, Yi Yang, Wei Liu, and Long Chen. 2024. Nicest: Noisy label correction and training for robust scene graph generation. IEEE TPAMI 46, 10 (2024), 6873–6888.

Digital Library

[26]

Rongjie Li, Songyang Zhang, and Xuming He. 2022. SGTR: End-to-end scene graph generation with transformer. In CVPR, 19486–19496.

[27]

Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. 2019. Revisiting local descriptor based image-to-class measure for few-shot learning. In CVPR, 7260–7268.

[28]

Wenbin Li, Jinglin Xu, Jing Huo, Lei Wang, Yang Gao, and Jiebo Luo. 2019. Distribution consistency based covariance metric networks for few-shot learning. In AAAI, 8642–8649.

Digital Library

[29]

Xingchen Li, Long Chen, Wenbo Ma, Yi Yang, and Jun Xiao. 2022. Integrating object-aware and interaction-aware knowledge for weakly supervised scene graph generation. In ACM MM, 4204–4213.

Digital Library

[30]

Xingchen Li, Long Chen, Jian Shao, Shaoning Xiao, Songyang Zhang, and Jun Xiao. 2022. Rethinking the evaluation of unbiased scene graph generation. In BMVC.

[31]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 4582–4597.

[32]

Yiming Li, Xiaoshan Yang, Xuhui Huang, Zhe Ma, and Changsheng Xu. 2022. Zero-shot predicate prediction for scene graph parsing. IEEE TMM 25 (2022), 3140–3153.

[33]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In ECCV. Springer, 852–869.

[34]

Xinyu Lyu, Lianli Gao, Junlin Xie, Pengpeng Zeng, Yulu Tian, Jie Shao, and Heng Tao Shen. 2023. Generalized unbiased scene graph generation. arXiv:2308.04802. Retrieved from https://arxiv.org/abs/2308.04802

[35]

Xinyu Lyu, Lianli Gao, Pengpeng Zeng, Heng Tao Shen, and Jingkuan Song. 2023. Adaptive fine-grained predicates learning for scene graph generation. IEEE TPAMI 45, 11 (2023), 13921–13940.

[36]

Chengcheng Ma, Yang Liu, Jiankang Deng, Lingxi Xie, Weiming Dong, and Changsheng Xu. 2023. Understanding and mitigating overfitting in prompt tuning for vision-language models. IEEE TCSVT 33, 9 (2023), 4616–4629.

[37]

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Zero-shot temporal action detection via vision-language prompting. In ECCV. Springer, 681–697.

Digital Library

[38]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748–8763.

[39]

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In ICML. PMLR, 1842–1850.

Digital Library

[40]

Hanrong Shi, Lin Li, Jun Xiao, Yueting Zhuang, and Long Chen. 2024. From easy to hard: Learning curricular shape-aware features for robust panoptic scene graph generation. IJCV (2024), 1–20. Retrieved from https://link.springer.com/article/10.1007/s11263-024-02190-9

[41]

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In NeurIPS, Vol. 30.

[42]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In CVPR, 1199–1208.

[43]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In CVPR, 3716–3725.

[44]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In CVPR, 6619–6628.

[45]

Yao Teng and Limin Wang. 2022. Structured sparse r-cnn for direct scene graph generation. In CVPR, 19437–19446.

[46]

Maria Tsimpoukelli, Jacob L. Menick, Serkan Cabi, S. M. Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. In NeurIPS, Vol. 34, 200–212.

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, Vol. 30.

[48]

Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. 2019. Panet: Few-shot image semantic segmentation with prototype alignment. In ICCV, 9197–9206.

[49]

Weitao Wang, Meng Wang, Sen Wang, Guodong Long, Lina Yao, Guilin Qi, and Yang Chen. 2020. One-shot learning for long-tail visual relation detection. In AAAI, Vol. 34, 12225–12232.

[50]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for scene graph generation. In ECCV, 670–685.

Digital Library

[51]

Hantao Yao, Rui Zhang, and Changsheng Xu. 2024. TCP: Textual-based Class-aware Prompt tuning for Visual-Language Model. In CVPR, 23438–23448.

[52]

Tianyu Yu, Yangning Li, Jiaoyan Chen, Yinghui Li, Hai-Tao Zheng, Xi Chen, Qingbin Liu, Wenqiang Liu, Dongxiao Huang, Bei Wu, et al. 2023. Knowledge-augmented few-shot visual relation detection. arXiv:2303.05342. Retrieved from https://arxiv.org/abs/2303.05342

[53]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR, 5831–5840.

[54]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. VINVL: Revisiting visual representations in vision-language models. In CVPR, 5579–5588.

[55]

Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. 2023. Boosting scene graph generation with visual relation saliency. ACM TOMM 19, 1 (2023), 1–17.

Digital Library

[56]

Chaofan Zheng, Lianli Gao, Xinyu Lyu, Pengpeng Zeng, Abdulmotaleb El Saddik, and Heng Tao Shen. 2023. Dual-branch hybrid learning network for unbiased scene graph generation. IEEE TCSVT 34, 3 (2023), 1743–1756.

[57]

Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. 2023. Prototype-based embedding network for scene graph generation. In CVPR, 22783–22792.

[58]

Zewen Zheng, Guoheng Huang, Xiaochen Yuan, Chi-Man Pun, Hongrui Liu, and Wing-Kuen Ling. 2022. Quaternion-valued correlation learning for few-shot semantic segmentation. IEEE TCSVT 33, 5 (2022), 2102–2115.

[59]

Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin Li. 2021. Learning to generate scene graph from natural language supervision. In ICCV, 1823–1834.

[60]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In CVPR, 16816–16825.

[61]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. IJCV 130, 9 (2022), 2337–2348.

Digital Library

[62]

Linhai Zhuo, Yuqian Fu, Jingjing Chen, Yixin Cao, and Yu-Gang Jiang. 2024. Unified view empirical study for large pretrained model on cross-domain few-shot learning. ACM TOMM 20, 9 (2024), 1–18.

Digital Library

Index Terms

Decomposed Prototype Learning for Few-Shot Scene Graph Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Prototype-Augmented Contrastive Learning for Few-Shot Unsupervised Domain Adaptation
Knowledge Science, Engineering and Management
Abstract
Unsupervised domain adaptation aims to learn a classification model from the source domain with much-supervised information, which is applied to the utterly unsupervised target domain. However, collecting enough labeled source samples is difficult ...
Zero-shot classification with unseen prototype learning
Abstract
Zero-shot learning (ZSL) aims at recognizing instances from unseen classes via training a classification model with only seen data. Most existing approaches easily suffer from the classification bias from unseen to seen categories since the models ...
Prompt-Based Self-training Framework for Few-Shot Named Entity Recognition
Knowledge Science, Engineering and Management
Abstract
Exploiting unlabeled data is one of the plausible methods to improve few-shot named entity recognition (few-shot NER), where only a small number of labeled examples are given for each entity type. Existing works focus on learning deep NER models ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 21, Issue 1

January 2025

860 pages

EISSN:1551-6865

DOI:10.1145/3703004

Editor:
Abuabdulmotaleb El Saddik
University of Ottowa

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 December 2024

Online AM: 21 October 2024

Accepted: 15 September 2024

Revised: 12 August 2024

Received: 26 December 2023

Published in TOMM Volume 21, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Project of China
National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
HKUST Special Support for Young Faculty
HKUST Sports Science and Technology Research

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
138
Total Downloads

Downloads (Last 12 months)138
Downloads (Last 6 weeks)21

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents