Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-06981-9_6guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning

Published: 29 May 2022 Publication History

Abstract

Scene graph generation aims to capture the semantic elements in images by modelling objects and their relationships in a structured manner, which are essential for visual understanding and reasoning tasks including image captioning, visual question answering, multimedia event processing, visual storytelling and image retrieval. The existing scene graph generation approaches provide limited performance and expressiveness for higher-level visual understanding and reasoning. This challenge can be mitigated by leveraging commonsense knowledge, such as related facts and background knowledge, about the semantic elements in scene graphs. In this paper, we propose the infusion of diverse commonsense knowledge about the semantic elements in scene graphs to generate rich and expressive scene graphs using a heterogeneous knowledge source that contains commonsense knowledge consolidated from seven different knowledge bases. The graph embeddings of the object nodes are used to leverage their structural patterns in the knowledge source to compute similarity metrics for graph refinement and enrichment. We performed experimental and comparative analysis on the benchmark Visual Genome dataset, in which the proposed method achieved a higher recall rate (R@K=29.89,35.4,39.12 for K=20,50,100) as compared to the existing state-of-the-art technique (R@K=25.8,33.3,37.8 for K=20,50,100). The qualitative results of the proposed method in a downstream task of image generation showed that more realistic images are generated using the commonsense knowledge-based scene graphs. These results depict the effectiveness of commonsense knowledge infusion in improving the performance and expressiveness of scene graph generation for visual understanding and reasoning tasks.

References

[1]
Baier S, Ma Y, Tresp V, et al. d’Amato C et al. Improving visual relationship detection using semantic modeling of scene descriptions The Semantic Web – ISWC 2017 2017 Cham Springer 53-68
[2]
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley framenet project. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 86–90 (1998)
[3]
Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., Hauptmann, A.: Scene graphs: a survey of generations and applications. arXiv preprint arXiv:2104.01111 (2021)
[4]
Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6163–6171 (2019)
[5]
Curry, E., Salwala, D., Dhingra, P., Pontes, F.A., Yadav, P.: Multimodal event processing: a neural-symbolic paradigm for the internet of multimedia things. IEEE Internet of Things J.
[6]
Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3076–3086 (2017)
[7]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
[8]
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
[9]
Gangemi A, Alam M, Asprino L, Presutti V, and Recupero DR Blomqvist E, Ciancarini P, Poggi F, and Vitali F Framester: a wide coverage linguistic linked data hub Knowledge Engineering and Knowledge Management 2016 Cham Springer 239-254
[10]
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
[11]
Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2019)
[12]
Guo, Y., Song, J., Gao, L., Shen, H.T.: One-shot scene graph generation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3090–3098 (2020)
[13]
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
[14]
Hung ZS, Mallya A, and Lazebnik S Contextual translation embedding for visual relationship detection and scene graph generation IEEE Trans. Pattern Anal. Mach. Intell. 2020 43 3820-3832
[15]
Ilievski F, et al., et al. Pan JZ, et al., et al. KGTK: a toolkit for large knowledge graph manipulation and analysis The Semantic Web – ISWC 2020 2020 Cham Springer 278-293
[16]
Ilievski, F., Oltramari, A., Ma, K., Zhang, B., McGuinness, D.L., Szekely, P.: Dimensions of commonsense knowledge. arXiv preprint arXiv:2101.04640 (2021)
[17]
Ilievski F, Szekely P, Zhang B, et al. Verborgh R et al. CSKG: the commonsense knowledge graph The Semantic Web 2021 Cham Springer 680-696
[18]
Johnson, J., Gupta, A., Fei-Fei, L.: Image generation from scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1219–1228 (2018)
[19]
Kan X, Cui H, and Yang C Oliver N, Pérez-Cruz F, Kramer S, Read J, and Lozano JA Zero-shot scene graph relation prediction through commonsense knowledge integration Machine Learning and Knowledge Discovery in Databases. Research Track 2021 Cham Springer 466-482
[20]
Khan, M.J., Curry, E.: Neuro-symbolic visual reasoning for multimedia event processing: overview, prospects and challenges. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM 2020) Workshops (2020)
[21]
Kipfer, B.: Roget’s 21st Century Thesaurus in Dictionary form, 3rd edn. The Philip Lief Group, New York (2005)
[22]
Koner R, Li H, Hildebrandt M, Das D, Tresp V, Günnemann S, et al. Hotho A et al. Graphhopper: multi-hop scene graph reasoning for visual question answering The Semantic Web – ISWC 2021 2021 Cham Springer 111-127
[23]
Krishna R et al. Visual genome: connecting language and vision using crowdsourced dense image annotations Int. J. Comput. Vis. 2017 123 1 32-73
[24]
Lee, C.W., Fang, W., Yeh, C.K., Wang, Y.C.F.: Multi-label zero-shot learning with structured knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1576–1585 (2018)
[25]
Lee, S., Kim, J.W., Oh, Y., Jeon, J.H.: Visual question answering over scene graph. In: 2019 First International Conference on Graph Computing (GC), pp. 45–50. IEEE (2019)
[26]
Li, Y., Ouyang, W., Wang, X., Tang, X.: VIP-CNN: visual phrase guided convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1347–1356 (2017)
[27]
Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 335–351 (2018)
[28]
Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270 (2017)
[29]
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)
[30]
Liu, L., Wang, M., He, X., Qing, L., Chen, H.: Fact-based visual question answering via dual-process system. Knowl.-Based Syst. 107650 (2021)
[31]
Lu C, Krishna R, Bernstein M, and Fei-Fei L Leibe B, Matas J, Sebe N, and Welling M Visual relationship detection with language priors Computer Vision – ECCV 2016 2016 Cham Springer 852-869
[32]
Ma, C., Sun, L., Zhong, Z., Huo, Q.: ReLaText: exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recogn. 111, 107684 (2021)
[33]
Ma, K., Ilievski, F., Francis, J., Bisk, Y., Nyberg, E., Oltramari, A.: Knowledge-driven data construction for zero-shot evaluation in commonsense question answering. In: 35th AAAI Conference on Artificial Intelligence (2021)
[34]
McCarthy, J., et al.: Programs with Common Sense. RLE and MIT Computation Center (1960)
[35]
Mi, L., Chen, Z.: Hierarchical graph attention network for visual relationship detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13886–13895 (2020)
[36]
Miller GA WordNet: a lexical database for English Commun. ACM 1995 38 11 39-41
[37]
Narasimhan, M., Schwing, A.G.: Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 451–468 (2018)
[38]
Palmonari, M., Minervini, P.: Knowledge graph embeddings and explainable AI. In: Knowledge Graphs for Explainable Artificial Intelligence: Foundations, Applications and Challenges, pp. 49–72. IOS Press, Amsterdam (2020)
[39]
Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Detecting unseen visual relations using analogies. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1981–1990 (2019)
[40]
Prakash, A., et al.: Self-supervised real-to-sim scene generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16044–16054 (2021)
[41]
Ren S, He K, Girshick R, and Sun J Faster R-CNN: towards real-time object detection with region proposal networks IEEE Trans. Pattern Anal. Mach. Intell. 2016 39 6 1137-1149
[42]
Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR 2011, pp. 1745–1752. IEEE (2011)
[43]
Sap, M., et al.: Atomic: an atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3027–3035 (2019)
[44]
Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4444–4451 (2017)
[45]
Su, Z., Zhu, C., Dong, Y., Cai, D., Chen, Y., Li, J.: Learning visual knowledge memory networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7736–7745 (2018)
[46]
Suhail, M., et al.: Energy-based learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13936–13945 (2021)
[47]
Tang, K.: A scene graph generation codebase in pytorch (2020). https://github.com/KaihuaTang/Scene-Graph-Benchmark.pytorch
[48]
Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716–3725 (2020)
[49]
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6619–6628 (2019)
[50]
Vrandečić D and Krötzsch M Wikidata: a free collaborative knowledgebase Commun. ACM 2014 57 10 78-85
[51]
Wan, H., Ou, J., Wang, B., Du, J., Pan, J.Z., Zeng, J.: Iterative visual relationship detection via commonsense knowledge graph. In: Wang, X., Lisi, F.A., Xiao, G., Botoeva, E. (eds.) JIST 2019. LNCS, vol. 12032, pp. 210–225. Springer, Cham (2020).
[52]
Wang, H., Zhang, F., Xie, X., Guo, M.: DKN: deep knowledge-aware network for news recommendation. In: Proceedings of the 2018 World Wide Web Conference, pp. 1835–1844 (2018)
[53]
Wang P, Wu Q, Shen C, Dick A, and Van Den Hengel A FVQA: fact-based visual question answering IEEE Trans. Pattern Anal. Mach. Intell. 2017 40 10 2413-2427
[54]
Wang, R., Wei, Z., Li, P., Zhang, Q., Huang, X.: Storytelling from an image stream using scene graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9185–9192 (2020)
[55]
Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1508–1517 (2020)
[56]
Wang, X., Ye, Y., Gupta, A.: Zero-shot recognition via semantic embeddings and knowledge graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6857–6866 (2018)
[57]
Wu, X., Sahoo, D., Hoi, S.C.: Recent advances in deep learning for object detection. Neurocomputing (2020)
[58]
Xie, Y., Pu, P.: How commonsense knowledge helps with natural language tasks: a survey of recent resources and methodologies. arXiv preprint arXiv:2108.04674 (2021)
[59]
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419 (2017)
[60]
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–685 (2018)
[61]
Yang, X., Zhang, H., Cai, J.: Auto-encoding and distilling scene graphs for image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2313–2327 (2022).
[62]
Ye, K., Kovashka, A.: Linguistic structures as weak supervision for visual scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8289–8299, June 2021
[63]
Zareian, A., Karaman, S., Chang, S.-F.: Bridging knowledge graphs to generate scene graphs. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 606–623. Springer, Cham (2020).
[64]
Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3736–3745 (2020)
[65]
Zareian, A., Wang, Z., You, H., Chang, S.-F.: Learning visual commonsense for robust scene graph generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 642–657. Springer, Cham (2020).
[66]
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840 (2018)
[67]
Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9185–9194 (2019)

Cited By

View all
  • (2023)Towards Multimodal Knowledge Graphs for Data SpacesCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587665(1494-1499)Online publication date: 30-Apr-2023
  • (2023)L-TReiD: Logic Tensor Transformer for Re-identificationAdvances in Visual Computing10.1007/978-3-031-47966-3_27(345-357)Online publication date: 16-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
The Semantic Web: 19th International Conference, ESWC 2022, Hersonissos, Crete, Greece, May 29 – June 2, 2022, Proceedings
May 2022
516 pages
ISBN:978-3-031-06980-2
DOI:10.1007/978-3-031-06981-9
  • Editors:
  • Paul Groth,
  • Maria-Esther Vidal,
  • Fabian Suchanek,
  • Pedro Szekley,
  • Pavan Kapanipathi,
  • Catia Pesquita,
  • Hala Skaf-Molli,
  • Minna Tamper

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 29 May 2022

Author Tags

  1. scene graph
  2. image representation
  3. commonsense knowledge
  4. visual reasoning
  5. image generation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Towards Multimodal Knowledge Graphs for Data SpacesCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587665(1494-1499)Online publication date: 30-Apr-2023
  • (2023)L-TReiD: Logic Tensor Transformer for Re-identificationAdvances in Visual Computing10.1007/978-3-031-47966-3_27(345-357)Online publication date: 16-Oct-2023

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media