Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3394171.3413575acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

Published: 12 October 2020 Publication History

Abstract

Scene graph generation aims to produce structured representations for images, which requires to understand the relations between objects. Due to the continuous nature of deep neural networks, the prediction of scene graphs is divided into object detection and relation classification. However, the independent relation classes cannot separate the visual features well. Although some methods organize the visual features into graph structures and use message passing to learn contextual information, they still suffer from drastic intra-class variations and unbalanced data distributions. One important factor is that they learn an unstructured output space that ignores the inherent structures of scene graphs. Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. First, we propose a novel structure-aware embedding-to-classifier(SEC) module to incorporate both local and global structural information of relationships into the output space. Specifically, a set of context embeddings are learned via local graph based message passing and then mapped to a global structure based classification space. Second, since learning too many context-specific classification subspaces can suffer from data sparsity issues, we propose a hierarchical semantic aggregation(HSA) module to reduces the number of subspaces by introducing higher order structural information. HSA is also a fast and flexible tool to automatically search a semantic object hierarchy based on relational knowledge graphs. Extensive experiments show that the proposed HOSE-Net achieves the state-of-the-art performance on two popular benchmarks of Visual Genome and VRD.

Supplementary Material

MP4 File (3394171.3413575.mp4)
Presentation video for paper "HOSE-Net:Higher Order Structure Embedded Network for Scene Graph Generation".

References

[1]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-Embedded Routing Network for Scene Graph Generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171.
[2]
Zhen Cui, Chunyan Xu, Wenming Zheng, and Jian Yang. 2018. Context-dependent diffusion network for visual relationship detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1475--1482.
[3]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3076--3086.
[4]
Junyu Gao, Tianzhu Zhang, and Changsheng Xu. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8303--8311.
[5]
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 765--773.
[6]
Tristan Hascoet, Yasuo Ariki, and Tetsuya Takiguchi. 2019. Semantic embeddings of generic objects for zero-shot learning. EURASIP Journal on Image and Video Processing, Vol. 2019, 1 (2019), 13.
[7]
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir Globerson. 2018. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems. 7211--7221.
[8]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1219--1228.
[9]
Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao Wang, Yujia Zhang, and Eric P Xing. 2019. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11487--11496.
[10]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[11]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.
[12]
Xiangyang Li and Shuqiang Jiang. 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia (2019).
[13]
Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao'ou Tang. 2017a. Vip-cnn: Visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1347--1356.
[14]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017b. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision. 1261--1270.
[15]
Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 848--857.
[16]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.
[17]
Thomas Mensink, Efstratios Gavves, and Cees GM Snoek. 2014. Costa: Co-occurrence statistics for zero-shot classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2441--2448.
[18]
Alejandro Newell and Jia Deng. 2017. Pixels to graphs by associative embedding. In Advances in neural information processing systems. 2171--2180.
[19]
Liang Peng, Yang Yang, Zheng Wang, Xiao Wu, and Zi Huang. 2019. CRA-Net: Composed Relation Attention Network for Visual Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia. 1202--1210.
[20]
Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019. Attentive Relational Networks for Mapping Images to Scene Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3957--3966.
[21]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[22]
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In CVPR 2011. IEEE, 1745--1752.
[23]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6619--6628.
[24]
Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[25]
Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6857--6866.
[26]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.
[27]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV). 670--685.
[28]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.
[29]
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-net: Mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 322--338.
[30]
Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1974--1982.
[31]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.
[32]
Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On Exploring Undetermined Relationships for Visual Relationship Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5128--5137.
[33]
Chenrui Zhang, Xiaoqing Lyu, and Zhi Tang. 2019 a. TGG: Transferable Graph Generation for Zero-shot and Few-shot Learning. In Proceedings of the 27th ACM International Conference on Multimedia. 1641--1649.
[34]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5532--5540.
[35]
Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019 b. Graphical Contrastive Losses for Scene Graph Parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11535--11543.

Cited By

View all
  • (2024)In Defense of Clip-Based Video Relation DetectionIEEE Transactions on Image Processing10.1109/TIP.2024.337993533(2759-2769)Online publication date: 2024
  • (2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
  • (2023)A Comprehensive Survey of Scene Graphs: Generation and ApplicationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.313760545:1(1-26)Online publication date: 1-Jan-2023
  • Show More Cited By

Index Terms

  1. HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. context embedding
    2. graph convolution network
    3. knowledge graph
    4. scene graph generation

    Qualifiers

    • Research-article

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)In Defense of Clip-Based Video Relation DetectionIEEE Transactions on Image Processing10.1109/TIP.2024.337993533(2759-2769)Online publication date: 2024
    • (2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
    • (2023)A Comprehensive Survey of Scene Graphs: Generation and ApplicationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.313760545:1(1-26)Online publication date: 1-Jan-2023
    • (2023)Scene Graph Generation using Depth-based Multimodal Network2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00199(1139-1144)Online publication date: Jul-2023
    • (2023)Addressing Predicate Overlap in Scene Graph Generation with Semantic Granularity Controller2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00022(78-83)Online publication date: Jul-2023
    • (2023)Compositional Feature Augmentation for Unbiased Scene Graph Generation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01982(21628-21638)Online publication date: 1-Oct-2023
    • (2022)RSSGG_CS: Remote Sensing Image Scene Graph Generation by Fusing Contextual Information and Statistical KnowledgeRemote Sensing10.3390/rs1413311814:13(3118)Online publication date: 29-Jun-2022
    • (2022)Exploring correlation of relationship reasoning for scene graph generationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01538-213:9(2479-2493)Online publication date: 12-Apr-2022
    • (2021)Recovering the Unbiased Scene Graphs from the Biased OnesProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475297(1581-1590)Online publication date: 17-Oct-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media